Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

(tye)Re: Portably transforming a string to a valid filename

by tye (Cardinal)
on Feb 20, 2001 at 21:51 UTC ( #59721=note: print w/ replies, xml ) Need Help??


in reply to Portably transforming a string to a valid filename

Well, you need fairly long file names for this to work so that eliminates DOS and the 14-character Unix file systems but also means that you'll have no problem using [-A-Za-z0-9._] in the file names. This gives us lots of room for "reversible fun".

I'd go with something like:

  • [A-Za-z0-9.] remain unchanged
  • [ ] becomes [_]
  • Everything else becomes something like "-garbage-" where "garbage" is some reversible translation of the "everything else".

Now, with [A-Za-z0-9. ] already taken care of, we only need to encode the remaining 31 punctation characters, 33 ASCII control characters, and 128 eight-bit characters. And we can use [A-Za-z0-9._] (64 characters) for this encoding. That would make it easy to just go for a base-64 encoding, but that would make the filename longer than it needs to be for most titles.

Since I assume that control characters and 8-bit characters won't appear often in titles (international characters should be inserted using HTML escapes as 8-bit encodings vary), I wanted punctuation characters to be very compactly represented. After some playing around, I like converting "-" to "--", isolated punctuation characters to "-[a-z0-3]", and all other illegal sequences of characters to -([A-Z4-7]|[89_][\w.])+- where [A-Z4-7] represent a punctuation mark and [_89][\w.] represents a control or 8-bit character. Note that this leaves "-." available for future meanings.

Here is the code to do this packaged with a test program that encodes or decodes from STDIN to STDOUT. I tested it with:

perl titles.pl <thispost.txt >thispost.enc perl titles.pl -d <thispost.enc >thispost.dec fc thispost.txt thispost.dec perl titles.pl -e <perl.exe >perl.enc perl titles.pl -e -d <perl.enc >perl.dec fc perl.exe perl.dec
The decoded files were identical to the originals. (-d means "decode" and "-e" means en/decode entire file rather than 1 line at a time.)

The encoding of this posting contained the follow lines as examples:

Well-l_you_need_fairly_long_file_names_for_this_to_work_so_that eliminates_DOS_and_the_14--character_Unix_file_systems_but_also means_that_you-gll_have_no_problem_using_-pcode-RU---A--Za--z0--9.-YWP +M-code-r in_the_file_names.__This_gives_us_lots_of_room_for_-breversible fun-b. -PM-p-RP-p-r [...] my_-deight1-q_pack_-bC-JBL-_0x80..0xBF-o my_-deight2-q_pack_-bC-JBL-_0xC0..0xFF-o my_-dctrl-q_pack_-bC-JBL-_0..31-l128-o my_-dpunct-q_q---ABCDEFGHIJKLMNOPQRSTUVWXYZ4567----o my_-dcode-q_join_-BBL-_-g0-g..-g9-GLG-a-g..-gz-GLG-A-g..-gZ-GLGYGLG-.- +GO- [...] ________-2_elsif-h__1_-QQ-_length-HD-1-i__-FF-__0_-PQ-_index-HD-punct- +LD-1-i__-i_-0

Here is the code:

#!/usr/bin/perl -w use strict; # [A-Za-z0-9.] => unchanged # " " => "_" # "-" => "--" # isolated punctation => "-[a-z0-3]" # other => "-" . ( [A-Z4-7] | [_89][\w.] )+ . "-" # punctuation _=control, 8,9=8-bit my $eight1= pack "C*", 0x80..0xBF; my $eight2= pack "C*", 0xC0..0xFF; my $ctrl= pack "C*", 0..31,128; my $punct= q-!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~-; my $code= join "", '0'..'9','a'..'z','A'..'Z','_','.'; sub encode { local( $_ )= shift; s{(-|[^-a-zA-Z0-9. ]+)}{ if( "-" eq $1 ) { "--"; } elsif( 1 == length($1) && 0 <= index($punct,$1) ) { my $p= $1; $p =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.a-z0-3.; "-$p"; } else { my $g= $1; $g =~ s{([\Q$punct\E])|([$ctrl])|([$eight1])|([$eight2])}{ my $c= $1 || $2 || $3 || $4; if( $1 ) { $c =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.A-Z4-7.; $c; } elsif( $2 ) { "_" . substr( $code, index($ctrl,$c), 1 ); } elsif( $3 ) { "8" . substr( $code, index($eight1,$c), 1 ); } else { "9" . substr( $code, index($eight2,$c), 1 ); } }ge; "-$g-"; } }ge; tr/ /_/; return $_; } sub decode { local( $_ )= shift; s{(_)|-(-)|-([a-z0-3])|-((?:[A-Z4-7]|[_89][\w.])+)-}{ if( $1 ) { " "; } elsif( $2 ) { "-"; } elsif( ! $4 ) { my $g= $3; $g =~ tr.a-z0-3.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } else { my $g= $4; $g =~ s{([A-Z4-7])|([_89])([\w.])}{ #{ local( $^W); # warn "else: 1-($1) 2-($2) 3-($3)\n"; #} my $g= $1 || $3; if( $1 ) { $g =~ tr.A-Z4-7.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } elsif( "_" eq $2 ) { substr( $ctrl, index($code,$g), 1 ); } elsif( "8" eq $2 ) { substr( $eight1, index($code,$g), 1 ); } else { substr( $eight2, index($code,$g), 1 ); } }ge; $g; } }ge; return $_; } my $decode= grep /^-d/, @ARGV; binmode( $decode ? \*STDOUT : \*STDIN ); if( grep /^-e/, @ARGV ) { my $data= do { local( $/ ); <STDIN> }; print $decode ? decode($data) : encode($data); } else { while( <STDIN> ) { s#\r?\n$##; print $decode ? decode($_) : encode($_), "\n"; } }
Enjoy!

        - tye (but my friends call me "-757-_-ZM-_-h--")


Comment on (tye)Re: Portably transforming a string to a valid filename
Select or Download Code
Replies are listed 'Best First'.
Re^2: Portably transforming a string to a valid filename (mnemonic)
by tye (Cardinal) on Aug 06, 2003 at 22:20 UTC

    One thing bothers me about this: The codes for punctuation are not pneumatic. This is easily fixed. The association between punctuation and letters/numbers should be changed to be:

    tr[@{}$='`^"+/<>#()?\[]_|~!&%*,:;] [abcdefghijklmnopqrstuvwxyz0123];
    based on the following phlegmatic code: At Brace CloseCurly Dollar Equals Feet Grave-accent Hat Inches Join whacK/stroKe Less-than More-than Number Open Paren Question Reverse-whack Square brackeT Underline Verticalbar Wiggle eXclaim y(Spanish for "and"). % and z are shaped the same as are * and 0. Comma has 1 "dot" while colon has 2 dots and semicolon is a bit more than colon.

    I could see swapping /=k and &=y to get /=y and &=k based on similarity of shape (the whacK/stroKe connection is rather weak). *shrug*

                    - tye
      The codes for punctuation are not pneumatic.

      Pneumatic? As in, "Of or relating to air or other gases" ?

      I'm thinking you meant mnemonic, like the title suggests. But I am certainly amused by the idea of pneumatic filenames.

        It's a joke, Son.

                        - tye ("They can't all be winners.")

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://59721]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (17)
As of 2015-07-29 12:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (263 votes), past polls