http://www.perlmonks.org?node_id=59721


in reply to Portably transforming a string to a valid filename

Well, you need fairly long file names for this to work so that eliminates DOS and the 14-character Unix file systems but also means that you'll have no problem using [-A-Za-z0-9._] in the file names. This gives us lots of room for "reversible fun".

I'd go with something like:

Now, with [A-Za-z0-9. ] already taken care of, we only need to encode the remaining 31 punctation characters, 33 ASCII control characters, and 128 eight-bit characters. And we can use [A-Za-z0-9._] (64 characters) for this encoding. That would make it easy to just go for a base-64 encoding, but that would make the filename longer than it needs to be for most titles.

Since I assume that control characters and 8-bit characters won't appear often in titles (international characters should be inserted using HTML escapes as 8-bit encodings vary), I wanted punctuation characters to be very compactly represented. After some playing around, I like converting "-" to "--", isolated punctuation characters to "-[a-z0-3]", and all other illegal sequences of characters to -([A-Z4-7]|[89_][\w.])+- where [A-Z4-7] represent a punctuation mark and [_89][\w.] represents a control or 8-bit character. Note that this leaves "-." available for future meanings.

Here is the code to do this packaged with a test program that encodes or decodes from STDIN to STDOUT. I tested it with:

perl titles.pl <thispost.txt >thispost.enc perl titles.pl -d <thispost.enc >thispost.dec fc thispost.txt thispost.dec perl titles.pl -e <perl.exe >perl.enc perl titles.pl -e -d <perl.enc >perl.dec fc perl.exe perl.dec
The decoded files were identical to the originals. (-d means "decode" and "-e" means en/decode entire file rather than 1 line at a time.)

The encoding of this posting contained the follow lines as examples:

Well-l_you_need_fairly_long_file_names_for_this_to_work_so_that eliminates_DOS_and_the_14--character_Unix_file_systems_but_also means_that_you-gll_have_no_problem_using_-pcode-RU---A--Za--z0--9.-YWP +M-code-r in_the_file_names.__This_gives_us_lots_of_room_for_-breversible fun-b. -PM-p-RP-p-r [...] my_-deight1-q_pack_-bC-JBL-_0x80..0xBF-o my_-deight2-q_pack_-bC-JBL-_0xC0..0xFF-o my_-dctrl-q_pack_-bC-JBL-_0..31-l128-o my_-dpunct-q_q---ABCDEFGHIJKLMNOPQRSTUVWXYZ4567----o my_-dcode-q_join_-BBL-_-g0-g..-g9-GLG-a-g..-gz-GLG-A-g..-gZ-GLGYGLG-.- +GO- [...] ________-2_elsif-h__1_-QQ-_length-HD-1-i__-FF-__0_-PQ-_index-HD-punct- +LD-1-i__-i_-0

Here is the code:

#!/usr/bin/perl -w use strict; # [A-Za-z0-9.] => unchanged # " " => "_" # "-" => "--" # isolated punctation => "-[a-z0-3]" # other => "-" . ( [A-Z4-7] | [_89][\w.] )+ . "-" # punctuation _=control, 8,9=8-bit my $eight1= pack "C*", 0x80..0xBF; my $eight2= pack "C*", 0xC0..0xFF; my $ctrl= pack "C*", 0..31,128; my $punct= q-!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~-; my $code= join "", '0'..'9','a'..'z','A'..'Z','_','.'; sub encode { local( $_ )= shift; s{(-|[^-a-zA-Z0-9. ]+)}{ if( "-" eq $1 ) { "--"; } elsif( 1 == length($1) && 0 <= index($punct,$1) ) { my $p= $1; $p =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.a-z0-3.; "-$p"; } else { my $g= $1; $g =~ s{([\Q$punct\E])|([$ctrl])|([$eight1])|([$eight2])}{ my $c= $1 || $2 || $3 || $4; if( $1 ) { $c =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.A-Z4-7.; $c; } elsif( $2 ) { "_" . substr( $code, index($ctrl,$c), 1 ); } elsif( $3 ) { "8" . substr( $code, index($eight1,$c), 1 ); } else { "9" . substr( $code, index($eight2,$c), 1 ); } }ge; "-$g-"; } }ge; tr/ /_/; return $_; } sub decode { local( $_ )= shift; s{(_)|-(-)|-([a-z0-3])|-((?:[A-Z4-7]|[_89][\w.])+)-}{ if( $1 ) { " "; } elsif( $2 ) { "-"; } elsif( ! $4 ) { my $g= $3; $g =~ tr.a-z0-3.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } else { my $g= $4; $g =~ s{([A-Z4-7])|([_89])([\w.])}{ #{ local( $^W); # warn "else: 1-($1) 2-($2) 3-($3)\n"; #} my $g= $1 || $3; if( $1 ) { $g =~ tr.A-Z4-7.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } elsif( "_" eq $2 ) { substr( $ctrl, index($code,$g), 1 ); } elsif( "8" eq $2 ) { substr( $eight1, index($code,$g), 1 ); } else { substr( $eight2, index($code,$g), 1 ); } }ge; $g; } }ge; return $_; } my $decode= grep /^-d/, @ARGV; binmode( $decode ? \*STDOUT : \*STDIN ); if( grep /^-e/, @ARGV ) { my $data= do { local( $/ ); <STDIN> }; print $decode ? decode($data) : encode($data); } else { while( <STDIN> ) { s#\r?\n$##; print $decode ? decode($_) : encode($_), "\n"; } }
Enjoy!

        - tye (but my friends call me "-757-_-ZM-_-h--")

Replies are listed 'Best First'.
Re^2: Portably transforming a string to a valid filename (mnemonic)
by tye (Sage) on Aug 06, 2003 at 22:20 UTC

    One thing bothers me about this: The codes for punctuation are not pneumatic. This is easily fixed. The association between punctuation and letters/numbers should be changed to be:

    tr[@{}$='`^"+/<>#()?\[]_|~!&%*,:;] [abcdefghijklmnopqrstuvwxyz0123];
    based on the following phlegmatic code: At Brace CloseCurly Dollar Equals Feet Grave-accent Hat Inches Join whacK/stroKe Less-than More-than Number Open Paren Question Reverse-whack Square brackeT Underline Verticalbar Wiggle eXclaim y(Spanish for "and"). % and z are shaped the same as are * and 0. Comma has 1 "dot" while colon has 2 dots and semicolon is a bit more than colon.

    I could see swapping /=k and &=y to get /=y and &=k based on similarity of shape (the whacK/stroKe connection is rather weak). *shrug*

                    - tye
      The codes for punctuation are not pneumatic.

      Pneumatic? As in, "Of or relating to air or other gases" ?

      I'm thinking you meant mnemonic, like the title suggests. But I am certainly amused by the idea of pneumatic filenames.

        It's a joke, Son.

                        - tye ("They can't all be winners.")