Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Portably transforming a string to a valid filename

by Anonymous Monk
on Feb 20, 2001 at 02:31 UTC ( #59505=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm currently trying to shove together a script to back up all the writeups an Everything2 user has to files, each one named for the title of the node it comes from.

There is just one problem -- the set of legal characters for node titles is larger than the set of characters most operating systems will support in the filesystem. The perlport man page suggests dropping everything except alphanumerics and the period, but I would rather keep the title as close as possible to the original, perhaps even able to be translated back with the right incantations. I have thought about using the CGI library function escape(), but that would use the % sign for some things, and a question mark is still legal.

Is there some portable way to escape characters appropriately for a given filesystem? Failing that, what would an approximation that will cover the common operating systems in use look like? Ideally, I'd like to make it fairly foolproof in terms of not having to install new modules, but solutions featuring that are okay too.

Replies are listed 'Best First'.
(tye)Re: Portably transforming a string to a valid filename
by tye (Sage) on Feb 20, 2001 at 21:51 UTC

    Well, you need fairly long file names for this to work so that eliminates DOS and the 14-character Unix file systems but also means that you'll have no problem using [-A-Za-z0-9._] in the file names. This gives us lots of room for "reversible fun".

    I'd go with something like:

    • [A-Za-z0-9.] remain unchanged
    • [ ] becomes [_]
    • Everything else becomes something like "-garbage-" where "garbage" is some reversible translation of the "everything else".

    Now, with [A-Za-z0-9. ] already taken care of, we only need to encode the remaining 31 punctation characters, 33 ASCII control characters, and 128 eight-bit characters. And we can use [A-Za-z0-9._] (64 characters) for this encoding. That would make it easy to just go for a base-64 encoding, but that would make the filename longer than it needs to be for most titles.

    Since I assume that control characters and 8-bit characters won't appear often in titles (international characters should be inserted using HTML escapes as 8-bit encodings vary), I wanted punctuation characters to be very compactly represented. After some playing around, I like converting "-" to "--", isolated punctuation characters to "-[a-z0-3]", and all other illegal sequences of characters to -([A-Z4-7]|[89_][\w.])+- where [A-Z4-7] represent a punctuation mark and [_89][\w.] represents a control or 8-bit character. Note that this leaves "-." available for future meanings.

    Here is the code to do this packaged with a test program that encodes or decodes from STDIN to STDOUT. I tested it with:

    perl <thispost.txt >thispost.enc perl -d <thispost.enc >thispost.dec fc thispost.txt thispost.dec perl -e <perl.exe >perl.enc perl -e -d <perl.enc >perl.dec fc perl.exe perl.dec
    The decoded files were identical to the originals. (-d means "decode" and "-e" means en/decode entire file rather than 1 line at a time.)

    The encoding of this posting contained the follow lines as examples:

    Well-l_you_need_fairly_long_file_names_for_this_to_work_so_that eliminates_DOS_and_the_14--character_Unix_file_systems_but_also means_that_you-gll_have_no_problem_using_-pcode-RU---A--Za--z0--9.-YWP +M-code-r in_the_file_names.__This_gives_us_lots_of_room_for_-breversible fun-b. -PM-p-RP-p-r [...] my_-deight1-q_pack_-bC-JBL-_0x80..0xBF-o my_-deight2-q_pack_-bC-JBL-_0xC0..0xFF-o my_-dctrl-q_pack_-bC-JBL-_0..31-l128-o my_-dpunct-q_q---ABCDEFGHIJKLMNOPQRSTUVWXYZ4567----o my_-dcode-q_join_-BBL-_-g0-g..-g9-GLG-a-g..-gz-GLG-A-g..-gZ-GLGYGLG-.- +GO- [...] ________-2_elsif-h__1_-QQ-_length-HD-1-i__-FF-__0_-PQ-_index-HD-punct- +LD-1-i__-i_-0

    Here is the code:

    #!/usr/bin/perl -w use strict; # [A-Za-z0-9.] => unchanged # " " => "_" # "-" => "--" # isolated punctation => "-[a-z0-3]" # other => "-" . ( [A-Z4-7] | [_89][\w.] )+ . "-" # punctuation _=control, 8,9=8-bit my $eight1= pack "C*", 0x80..0xBF; my $eight2= pack "C*", 0xC0..0xFF; my $ctrl= pack "C*", 0..31,128; my $punct= q-!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~-; my $code= join "", '0'..'9','a'..'z','A'..'Z','_','.'; sub encode { local( $_ )= shift; s{(-|[^-a-zA-Z0-9. ]+)}{ if( "-" eq $1 ) { "--"; } elsif( 1 == length($1) && 0 <= index($punct,$1) ) { my $p= $1; $p =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.a-z0-3.; "-$p"; } else { my $g= $1; $g =~ s{([\Q$punct\E])|([$ctrl])|([$eight1])|([$eight2])}{ my $c= $1 || $2 || $3 || $4; if( $1 ) { $c =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.A-Z4-7.; $c; } elsif( $2 ) { "_" . substr( $code, index($ctrl,$c), 1 ); } elsif( $3 ) { "8" . substr( $code, index($eight1,$c), 1 ); } else { "9" . substr( $code, index($eight2,$c), 1 ); } }ge; "-$g-"; } }ge; tr/ /_/; return $_; } sub decode { local( $_ )= shift; s{(_)|-(-)|-([a-z0-3])|-((?:[A-Z4-7]|[_89][\w.])+)-}{ if( $1 ) { " "; } elsif( $2 ) { "-"; } elsif( ! $4 ) { my $g= $3; $g =~ tr.a-z0-3.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } else { my $g= $4; $g =~ s{([A-Z4-7])|([_89])([\w.])}{ #{ local( $^W); # warn "else: 1-($1) 2-($2) 3-($3)\n"; #} my $g= $1 || $3; if( $1 ) { $g =~ tr.A-Z4-7.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.; $g; } elsif( "_" eq $2 ) { substr( $ctrl, index($code,$g), 1 ); } elsif( "8" eq $2 ) { substr( $eight1, index($code,$g), 1 ); } else { substr( $eight2, index($code,$g), 1 ); } }ge; $g; } }ge; return $_; } my $decode= grep /^-d/, @ARGV; binmode( $decode ? \*STDOUT : \*STDIN ); if( grep /^-e/, @ARGV ) { my $data= do { local( $/ ); <STDIN> }; print $decode ? decode($data) : encode($data); } else { while( <STDIN> ) { s#\r?\n$##; print $decode ? decode($_) : encode($_), "\n"; } }

            - tye (but my friends call me "-757-_-ZM-_-h--")

      One thing bothers me about this: The codes for punctuation are not pneumatic. This is easily fixed. The association between punctuation and letters/numbers should be changed to be:

      tr[@{}$='`^"+/<>#()?\[]_|~!&%*,:;] [abcdefghijklmnopqrstuvwxyz0123];
      based on the following phlegmatic code: At Brace CloseCurly Dollar Equals Feet Grave-accent Hat Inches Join whacK/stroKe Less-than More-than Number Open Paren Question Reverse-whack Square brackeT Underline Verticalbar Wiggle eXclaim y(Spanish for "and"). % and z are shaped the same as are * and 0. Comma has 1 "dot" while colon has 2 dots and semicolon is a bit more than colon.

      I could see swapping /=k and &=y to get /=y and &=k based on similarity of shape (the whacK/stroKe connection is rather weak). *shrug*

                      - tye
        The codes for punctuation are not pneumatic.

        Pneumatic? As in, "Of or relating to air or other gases" ?

        I'm thinking you meant mnemonic, like the title suggests. But I am certainly amused by the idea of pneumatic filenames.

Re: Portably transforming a string to a valid filename
by arturo (Vicar) on Feb 20, 2001 at 02:49 UTC

    I'm not at all up on "encryption" algorithms, but it seems that what you want is a two-way algorithm that (a) isn't computationally expensive and (b) yields 'encrypted' data that's encoded in filesystem-friendly strings. It's the last that's got me stumped. Just that thinking "encryption" might give you some ideas for other searches.

    How about an alternate scheme: instead of backing things up one file to a node, write out all the user's nodes into a standard format (XML, anybody?):

    <node> <title>How I learned to stop worrying and Love the NodeReaper</title> <time>December 31, 2001 : 24:00:00</time> <node_content>He got rid of those nasty nodes. nuff said</node_conte +nt> </node>

    Then gzip the whole thing, and there's your backup.

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

      You can just use pack to uuencode your file-name strings to be extremely file-system friendly (or, you can install linux or freebsd or something and get a friendlier file-system :)

      To the original poster: don't do this. It's a waste of bandwidth (as well as a serious copyright violation.) Just /msg nate and ask him if he'll send you a copy of the database. He (probably) won't and you can move on to another project.

      update ignore the last paragraph -- my misunderstanding. thought you planned on backing up the *entire* database. apologies all around.

      update2 too bad E2's displaytype=xml doesn't show softlinks as well.

        I think you misunderstand my intent; it's for a user to back up their own writeups. In fact, it won't work on another user unless you had their login and password, as it taps into E2's XML writeup displaytype, which requires that one be logged in as the user whose writeups are to be viewed.
Re: Portably transforming a string to a valid filename
by chipmunk (Parson) on Feb 20, 2001 at 10:10 UTC
    Here's a very straightforward solution. Instead of %, it uses _ as the escape character, since I think _ should be safe just about anywhere. (If not, maybe -?)

    Anyway, my thinking is, escape almost everything (including the escape character, of course). :)

    sub encode_title { my($title) = @_; $title =~ s/([^A-Za-z0-9])/sprintf '_%02X', ord $1/ge; return $title; } sub decode_title { my($title) = @_; if ($title =~ /[^A-Za-z0-9_]/ or $title =~ /_(?![A-F0-9]{2})/) { return undef; } $title =~ s/_(..)/chr hex $1/ge; return $title; }

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://59505]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (1)
As of 2021-12-02 04:14 GMT
Find Nodes?
    Voting Booth?
    R or B?

    Results (17 votes). Check out past polls.