If you're already storing the unique URL in a database, why not use a unique key from the db for the URL portion? That would let you turn
my $string = '/h/plkr/3/www.plkr.org/rss.pl';
into
my $string = '/h/plkr/3/426933';
where 426933 would be the unique ID for retrieving that URI from the database. It has the added bonus of letting the end user tweak things like depth and output format without having to decode/re-encode the string.
If you really want to encode the string as-is, without having to store anything in a DB, you might be able to use something like Algorithm::Huffman in CPAN to generate a short bit vector for your data, and then use some other method (encode_base64 or somesuch) to make it safe for use in a URL.
You'll need to grab a good sample of the sort of strings you want to encode, and do a frequency count on them so that Algorithm::Huffman can build up a good dictionary.
Here's an untested example:
my $token_counts = {
'/' => 10000,
'.' => 3000,
'w' => 2342,
'e' => 2000,
't' => 2000,
# ... etc, for any character that might appear in the input strin
+g
};
my $string = '/h/plkr/3/www.plkr.org/rss.pl';
my $huff = Algorithm::Huffman->new($token_counts);
my $huff_encoded = $huff->encode($string);
my $printable = encode_base64($huff_encoded);
I have no idea what sort of size savings this will get you, though. Base-64 encoding costs you one character of output for every 6 bits of input, so as long as your Huffman encoding saves you more bits than Base 64 encoding wastes, you should end up with a URL-safe encoding that's shorter than the input.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.