Well, you need fairly long file names for this to work so that
eliminates DOS and the 14-character Unix file systems but also
means that you'll have no problem using [-A-Za-z0-9._]
in the file names. This gives us lots of room for "reversible
fun".
I'd go with something like:
- [A-Za-z0-9.] remain unchanged
- [ ] becomes [_]
- Everything else becomes something like "-garbage-"
where "garbage" is some reversible translation of the
"everything else".
Now, with [A-Za-z0-9. ] already taken care of, we only
need to encode the remaining 31 punctation characters, 33 ASCII
control characters, and 128 eight-bit characters. And we can
use [A-Za-z0-9._] (64 characters) for this encoding.
That would make it easy to just go for a base-64 encoding, but
that would make the filename longer than it needs to be for most titles.
Since I assume that control characters and 8-bit characters won't
appear often in titles (international characters should be inserted
using HTML escapes as 8-bit encodings vary), I wanted punctuation
characters to be very compactly represented. After some playing
around, I like converting "-" to "--", isolated punctuation
characters to "-[a-z0-3]", and all other illegal
sequences of characters to -([A-Z4-7]|[89_][\w.])+-
where [A-Z4-7] represent a punctuation mark and
[_89][\w.] represents a control or 8-bit character.
Note that this leaves "-." available for future meanings.
Here is the code to do this packaged with a test program that
encodes or decodes from STDIN to STDOUT. I tested it with:
perl titles.pl <thispost.txt >thispost.enc
perl titles.pl -d <thispost.enc >thispost.dec
fc thispost.txt thispost.dec
perl titles.pl -e <perl.exe >perl.enc
perl titles.pl -e -d <perl.enc >perl.dec
fc perl.exe perl.dec
The decoded files were identical to the originals. (-d means "decode" and
"-e" means en/decode entire file rather than 1 line at a time.)
The encoding of this posting contained the follow lines as examples:
Well-l_you_need_fairly_long_file_names_for_this_to_work_so_that
eliminates_DOS_and_the_14--character_Unix_file_systems_but_also
means_that_you-gll_have_no_problem_using_-pcode-RU---A--Za--z0--9.-YWP
+M-code-r
in_the_file_names.__This_gives_us_lots_of_room_for_-breversible
fun-b.
-PM-p-RP-p-r
[...]
my_-deight1-q_pack_-bC-JBL-_0x80..0xBF-o
my_-deight2-q_pack_-bC-JBL-_0xC0..0xFF-o
my_-dctrl-q_pack_-bC-JBL-_0..31-l128-o
my_-dpunct-q_q---ABCDEFGHIJKLMNOPQRSTUVWXYZ4567----o
my_-dcode-q_join_-BBL-_-g0-g..-g9-GLG-a-g..-gz-GLG-A-g..-gZ-GLGYGLG-.-
+GO-
[...]
________-2_elsif-h__1_-QQ-_length-HD-1-i__-FF-__0_-PQ-_index-HD-punct-
+LD-1-i__-i_-0
Here is the code:
#!/usr/bin/perl -w
use strict;
# [A-Za-z0-9.] => unchanged
# " " => "_"
# "-" => "--"
# isolated punctation => "-[a-z0-3]"
# other => "-" . ( [A-Z4-7] | [_89][\w.] )+ . "-"
# punctuation _=control, 8,9=8-bit
my $eight1= pack "C*", 0x80..0xBF;
my $eight2= pack "C*", 0xC0..0xFF;
my $ctrl= pack "C*", 0..31,128;
my $punct= q-!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~-;
my $code= join "", '0'..'9','a'..'z','A'..'Z','_','.';
sub encode {
local( $_ )= shift;
s{(-|[^-a-zA-Z0-9. ]+)}{
if( "-" eq $1 ) {
"--";
} elsif( 1 == length($1) && 0 <= index($punct,$1) ) {
my $p= $1;
$p =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.a-z0-3.;
"-$p";
} else {
my $g= $1;
$g =~ s{([\Q$punct\E])|([$ctrl])|([$eight1])|([$eight2])}{
my $c= $1 || $2 || $3 || $4;
if( $1 ) {
$c =~ tr.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.A-Z4-7.;
$c;
} elsif( $2 ) {
"_" . substr( $code, index($ctrl,$c), 1 );
} elsif( $3 ) {
"8" . substr( $code, index($eight1,$c), 1 );
} else {
"9" . substr( $code, index($eight2,$c), 1 );
}
}ge;
"-$g-";
}
}ge;
tr/ /_/;
return $_;
}
sub decode {
local( $_ )= shift;
s{(_)|-(-)|-([a-z0-3])|-((?:[A-Z4-7]|[_89][\w.])+)-}{
if( $1 ) {
" ";
} elsif( $2 ) {
"-";
} elsif( ! $4 ) {
my $g= $3;
$g =~ tr.a-z0-3.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.;
$g;
} else {
my $g= $4;
$g =~ s{([A-Z4-7])|([_89])([\w.])}{
#{ local( $^W);
# warn "else: 1-($1) 2-($2) 3-($3)\n";
#}
my $g= $1 || $3;
if( $1 ) {
$g =~ tr.A-Z4-7.!"#$%&'()*+,/:;<=>?@[\\]^_`{|}~.;
$g;
} elsif( "_" eq $2 ) {
substr( $ctrl, index($code,$g), 1 );
} elsif( "8" eq $2 ) {
substr( $eight1, index($code,$g), 1 );
} else {
substr( $eight2, index($code,$g), 1 );
}
}ge;
$g;
}
}ge;
return $_;
}
my $decode= grep /^-d/, @ARGV;
binmode( $decode ? \*STDOUT : \*STDIN );
if( grep /^-e/, @ARGV ) {
my $data= do { local( $/ ); <STDIN> };
print $decode ? decode($data) : encode($data);
} else {
while( <STDIN> ) {
s#\r?\n$##;
print $decode ? decode($_) : encode($_), "\n";
}
}
Enjoy!
-
tye
(but my friends call me "-757-_-ZM-_-h--") |