Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Well, you need fairly long file names for this to work so that eliminates DOS and the 14-character Unix file systems but also means that you'll have no problem using [-A-Za-z0-9._] in the file names. This gives us lots of room for "reversible fun".

I'd go with something like:

  • [A-Za-z0-9.] remain unchanged
  • [ ] becomes [_]
  • Everything else becomes something like "-garbage-" where "garbage" is some reversible translation of the "everything else".

Now, with [A-Za-z0-9. ] already taken care of, we only need to encode the remaining 31 punctation characters, 33 ASCII control characters, and 128 eight-bit characters. And we can use [A-Za-z0-9._] (64 characters) for this encoding. That would make it easy to just go for a base-64 encoding, but that would make the filename longer than it needs to be for most titles.

Since I assume that control characters and 8-bit characters won't appear often in titles (international characters should be inserted using HTML escapes as 8-bit encodings vary), I wanted punctuation characters to be very compactly represented. After some playing around, I like converting "-" to "--", isolated punctuation characters to "-[a-z0-3]", and all other illegal sequences of characters to -([A-Z4-7]|[89_][\w.])+- where [A-Z4-7] represent a punctuation mark and [_89][\w.] represents a control or 8-bit character. Note that this leaves "-." available for future meanings.

Here is the code to do this packaged with a test program that encodes or decodes from STDIN to STDOUT. I tested it with:

perl titles.pl <thispost.txt >thispost.enc perl titles.pl -d <thispost.enc >thispost.dec fc thispost.txt thispost.dec perl titles.pl -e <perl.exe >perl.enc perl titles.pl -e -d <perl.enc >perl.dec fc perl.exe perl.dec
The decoded files were identical to the originals. (-d means "decode" and "-e" means en/decode entire file rather than 1 line at a time.)

The encoding of this posting contained the follow lines as examples:

Well-l_you_need_fairly_long_file_names_for_this_to_work_so_that eliminates_DOS_and_the_14--character_Unix_file_systems_but_also means_that_you-gll_have_no_problem_using_-pcode-RU---A--Za--z0--9.-YWP +M-code-r in_the_file_names.__This_gives_us_lots_of_room_for_-breversible fun-b. -PM-p-RP-p-r [...] my_-deight1-q_pack_-bC-JBL-_0x80..0xBF-o my_-deight2-q_pack_-bC-JBL-_0xC0..0xFF-o my_-dctrl-q_pack_-bC-JBL-_0..31-l128-o my_-dpunct-q_q---ABCDEFGHIJKLMNOPQRSTUVWXYZ4567----o my_-dcode-q_join_-BBL-_-g0-g..-g9-GLG-a-g..-gz-GLG-A-g..-gZ-GLGYGLG-.- +GO- [...] ________-2_elsif-h__1_-QQ-_length-HD-1-i__-FF-__0_-PQ-_index-HD-punct- +LD-1-i__-i_-0

Here is the code:

#!/usr/bin/perl -w use strict; # [A-Za-z0-9.] => unchanged # " " => "_" # "-" => "--" # isolated punctation => "-[a-z0-3]" # other => "-" . ( [A-Z4-7] | [_89][\w.] )+ . "-" # punctuation _=control, 8,9=8-bit my $eight1= pack "C*", 0x80..0xBF; my $eight2= pack "C*", 0xC0..0xFF; my $ctrl= pack "C*", 0..31,128; my $punct= q-!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~-; my $code= join "", '0'..'9','a'..'z','A'..'Z','_','.'; sub encode { local( $_ )= shift; s{(-|[^-a-zA-Z0-9. ]+)}{ if( "-" eq $1 ) { "--"; } elsif( 1 == length($1) && 0 <= index($punct,$1) ) { my $p= $1; $p =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.a-z0-3.; "-$p"; } else { my $g= $1; $g =~ s{([\Q$punct\E])|([$ctrl])|([$eight1])|([$eight2])}{ my $c= $1 || $2 || $3 || $4; if( $1 ) { $c =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.A-Z4-7.; $c; } elsif( $2 ) { "_" . substr( $code, index($ctrl,$c), 1 ); } elsif( $3 ) { "8" . substr( $code, index($eight1,$c), 1 ); } else { "9" . substr( $code, index($eight2,$c), 1 ); } }ge; "-$g-"; } }ge; tr/ /_/; return $_; } sub decode { local( $_ )= shift; s{(_)|-(-)|-([a-z0-3])|-((?:[A-Z4-7]|[_89][\w.])+)-}{ if( $1 ) { " "; } elsif( $2 ) { "-"; } elsif( ! $4 ) { my $g= $3; $g =~ tr.a-z0-3.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } else { my $g= $4; $g =~ s{([A-Z4-7])|([_89])([\w.])}{ #{ local( $^W); # warn "else: 1-($1) 2-($2) 3-($3)\n"; #} my $g= $1 || $3; if( $1 ) { $g =~ tr.A-Z4-7.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } elsif( "_" eq $2 ) { substr( $ctrl, index($code,$g), 1 ); } elsif( "8" eq $2 ) { substr( $eight1, index($code,$g), 1 ); } else { substr( $eight2, index($code,$g), 1 ); } }ge; $g; } }ge; return $_; } my $decode= grep /^-d/, @ARGV; binmode( $decode ? \*STDOUT : \*STDIN ); if( grep /^-e/, @ARGV ) { my $data= do { local( $/ ); <STDIN> }; print $decode ? decode($data) : encode($data); } else { while( <STDIN> ) { s#\r?\n$##; print $decode ? decode($_) : encode($_), "\n"; } }
Enjoy!

        - tye (but my friends call me "-757-_-ZM-_-h--")

In reply to (tye)Re: Portably transforming a string to a valid filename by tye
in thread Portably transforming a string to a valid filename by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others imbibing at the Monastery: (12)
    As of 2015-07-08 04:18 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (94 votes), past polls