Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Well, you need fairly long file names for this to work so that eliminates DOS and the 14-character Unix file systems but also means that you'll have no problem using [-A-Za-z0-9._] in the file names. This gives us lots of room for "reversible fun".

I'd go with something like:

  • [A-Za-z0-9.] remain unchanged
  • [ ] becomes [_]
  • Everything else becomes something like "-garbage-" where "garbage" is some reversible translation of the "everything else".

Now, with [A-Za-z0-9. ] already taken care of, we only need to encode the remaining 31 punctation characters, 33 ASCII control characters, and 128 eight-bit characters. And we can use [A-Za-z0-9._] (64 characters) for this encoding. That would make it easy to just go for a base-64 encoding, but that would make the filename longer than it needs to be for most titles.

Since I assume that control characters and 8-bit characters won't appear often in titles (international characters should be inserted using HTML escapes as 8-bit encodings vary), I wanted punctuation characters to be very compactly represented. After some playing around, I like converting "-" to "--", isolated punctuation characters to "-[a-z0-3]", and all other illegal sequences of characters to -([A-Z4-7]|[89_][\w.])+- where [A-Z4-7] represent a punctuation mark and [_89][\w.] represents a control or 8-bit character. Note that this leaves "-." available for future meanings.

Here is the code to do this packaged with a test program that encodes or decodes from STDIN to STDOUT. I tested it with:

perl titles.pl <thispost.txt >thispost.enc perl titles.pl -d <thispost.enc >thispost.dec fc thispost.txt thispost.dec perl titles.pl -e <perl.exe >perl.enc perl titles.pl -e -d <perl.enc >perl.dec fc perl.exe perl.dec
The decoded files were identical to the originals. (-d means "decode" and "-e" means en/decode entire file rather than 1 line at a time.)

The encoding of this posting contained the follow lines as examples:

Well-l_you_need_fairly_long_file_names_for_this_to_work_so_that eliminates_DOS_and_the_14--character_Unix_file_systems_but_also means_that_you-gll_have_no_problem_using_-pcode-RU---A--Za--z0--9.-YWP +M-code-r in_the_file_names.__This_gives_us_lots_of_room_for_-breversible fun-b. -PM-p-RP-p-r [...] my_-deight1-q_pack_-bC-JBL-_0x80..0xBF-o my_-deight2-q_pack_-bC-JBL-_0xC0..0xFF-o my_-dctrl-q_pack_-bC-JBL-_0..31-l128-o my_-dpunct-q_q---ABCDEFGHIJKLMNOPQRSTUVWXYZ4567----o my_-dcode-q_join_-BBL-_-g0-g..-g9-GLG-a-g..-gz-GLG-A-g..-gZ-GLGYGLG-.- +GO- [...] ________-2_elsif-h__1_-QQ-_length-HD-1-i__-FF-__0_-PQ-_index-HD-punct- +LD-1-i__-i_-0

Here is the code:

#!/usr/bin/perl -w use strict; # [A-Za-z0-9.] => unchanged # " " => "_" # "-" => "--" # isolated punctation => "-[a-z0-3]" # other => "-" . ( [A-Z4-7] | [_89][\w.] )+ . "-" # punctuation _=control, 8,9=8-bit my $eight1= pack "C*", 0x80..0xBF; my $eight2= pack "C*", 0xC0..0xFF; my $ctrl= pack "C*", 0..31,128; my $punct= q-!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~-; my $code= join "", '0'..'9','a'..'z','A'..'Z','_','.'; sub encode { local( $_ )= shift; s{(-|[^-a-zA-Z0-9. ]+)}{ if( "-" eq $1 ) { "--"; } elsif( 1 == length($1) && 0 <= index($punct,$1) ) { my $p= $1; $p =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.a-z0-3.; "-$p"; } else { my $g= $1; $g =~ s{([\Q$punct\E])|([$ctrl])|([$eight1])|([$eight2])}{ my $c= $1 || $2 || $3 || $4; if( $1 ) { $c =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.A-Z4-7.; $c; } elsif( $2 ) { "_" . substr( $code, index($ctrl,$c), 1 ); } elsif( $3 ) { "8" . substr( $code, index($eight1,$c), 1 ); } else { "9" . substr( $code, index($eight2,$c), 1 ); } }ge; "-$g-"; } }ge; tr/ /_/; return $_; } sub decode { local( $_ )= shift; s{(_)|-(-)|-([a-z0-3])|-((?:[A-Z4-7]|[_89][\w.])+)-}{ if( $1 ) { " "; } elsif( $2 ) { "-"; } elsif( ! $4 ) { my $g= $3; $g =~ tr.a-z0-3.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } else { my $g= $4; $g =~ s{([A-Z4-7])|([_89])([\w.])}{ #{ local( $^W); # warn "else: 1-($1) 2-($2) 3-($3)\n"; #} my $g= $1 || $3; if( $1 ) { $g =~ tr.A-Z4-7.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } elsif( "_" eq $2 ) { substr( $ctrl, index($code,$g), 1 ); } elsif( "8" eq $2 ) { substr( $eight1, index($code,$g), 1 ); } else { substr( $eight2, index($code,$g), 1 ); } }ge; $g; } }ge; return $_; } my $decode= grep /^-d/, @ARGV; binmode( $decode ? \*STDOUT : \*STDIN ); if( grep /^-e/, @ARGV ) { my $data= do { local( $/ ); <STDIN> }; print $decode ? decode($data) : encode($data); } else { while( <STDIN> ) { s#\r?\n$##; print $decode ? decode($_) : encode($_), "\n"; } }
Enjoy!

        - tye (but my friends call me "-757-_-ZM-_-h--")

In reply to (tye)Re: Portably transforming a string to a valid filename by tye
in thread Portably transforming a string to a valid filename by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-25 07:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found