Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Imploding URLs

by mobiGeek (Beadle)
on Jun 09, 2005 at 20:55 UTC ( [id://465321]=perlquestion: print w/replies, xml ) Need Help??

mobiGeek has asked for the wisdom of the Perl Monks concerning the following question:

I want to "implode" URLs, that is shorten them by replacing long and/or frequently seen substrings with single non-visible ASCII chars. Essentially, the code I have is:
my $tnt = [ "http://www." , "https://www." , }; my $implode_url = sub { my ($url) = @_; my $len = scalar(@$tnt); for( my $i = 0 ; $i < $len; $i++) { $url =~ s/$tnt->[$i]/chr($i+1)/eg; # +1 to avoid \0 } return $url; };
What I'm looking to do is, given a large list of URLs find the top 31 (chr(1) to chr(31)) most valueable substrings to populate $tnt.

Obviously, the most valuable is a measure of the substring's length x frequency within the collection.

So, given a collection of URLs (one per line), does anyone have a suggested algorithm outside of brute force (pseudo-code below)?

while( $u = <STDIN> ) { for( $i=1; $i < length($u)-1; $i++ ) { for ($j = 0; $j < length($u)-$i ; $j++ ) { $s = substr($u, $j, $i) $matches($s) += ( $u =~ m/$s/g ); } } # now loop over %matches finding top 30 of: # $matches($key) * length($key) #
Thanks in advance!


Replies are listed 'Best First'.
Re: Imploding URLs
by tall_man (Parson) on Jun 09, 2005 at 22:22 UTC
    You could use String::Ediff to find common substrings between pairs of URLs, and then break those down into pieces that are 31 characters or less and count those with a hash.

    It uses a suffix tree to find the substrings, so it should be fairly efficient. Out-of-the-box, it finds substrings of length >=4, but that could probably be changed. Substrings of length one would not be compressed, anyway.

    Update: You might prefer Algorithm::Diff, which has a nicer interface and more options.

    Update2: The node Re: finding longest common substring also builds a suffix tree and it might be adaptable to your problem.

Re: Imploding URLs
by shemp (Deacon) on Jun 09, 2005 at 22:37 UTC
    This sounds like something that you could dig out a Huffman encoding of the URLS. I dont have personal experience with Huffman coding, but Chapter 19 of O'Reillys 'Computer Science & Perl Programming' discusses it quite well. That article would have also appeared in The Perl Journal, but i dont know when.
Re: Imploding URLs
by QM (Parson) on Jun 09, 2005 at 21:12 UTC
    A simplification of the brute force method would be to split on \W, and use a hash to count the frequencies of the "words". That avoids generating the canonical list of substrings from every URL (which is a huge list, even for a single URL of significant length).

    Quantum Mechanics: The dreams stuff is made of

Re: Imploding URLs
by kscaldef (Pilgrim) on Jun 10, 2005 at 01:12 UTC
    Agreed. Sounds like you're looking for Compress::Zlib.

    (If you intend this just as a fun exercise, then by all means go for it; but if this is for real use, you should use the standard and well-tested libraries.)

      Fun, no. Exercise, sorta. Real use, yes. Standard libs, I wish.

      See my reply to post before yours.


Re: Imploding URLs
by Cody Pendant (Prior) on Jun 10, 2005 at 00:08 UTC
    Call me crazy, but aren't there already highly efficient methods for compressing files efficiently, as in, gzip?

    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
      You are crazy. :-)

      So here's the bigger gist. I am improving a special-purpose HTTP proxy server that rewrites URLs in pages that it fetches so that they all point back to itself (e.g. the URL "" gets rewritten as "http://my_proxy?url=". So though I have a large collection of URLs (from my logs), I need to "implode" URLs on a one-by-one basis. GZip and the like don't do very much on a single URL.

      Finding the collection of "top" substrings has already reduced my downloads by 20% on a given page, but that was done by hand for a single test page with only 30 or so URLs in it.

      So the problem as stated stands...I wish it were as simple as GZip/Compress. In fact, I used those and in many cases the URLs are actually larger (for short URLs)...especially once the data is encrypted and base64'ed...


        So though I have a large collection of URLs (from my logs), I need to "implode" URLs on a one-by-one basis.

        Why? Is the space savings that significant?

        If all you have is a handful of substitutions, you can probably hand-pick the strings:

        http www. .com .org .net :// index .htm .jpg .gif google yahoo mail news ebay
Re: Imploding URLs
by ank (Scribe) on Jun 11, 2005 at 07:17 UTC
    If you just want to shorten the URLs after the HTMLs have been parsed and their URLs changed, what keeps you from doing something like exchanging them for numeric keys in a disk-tied hash, and presenting each of them as an http request parameter?
      Saving state server-side is (currently) not an option.


Re: Imploding URLs
by dReKurCe (Scribe) on Jun 11, 2005 at 09:33 UTC

    I havn't come across a post that made laugh as hard is this one in a while (mostly because I've been away).Once I figured out what was being sought I neerly dove in to the regexing I think that the ranking algorithm holds the intrigue.I'm not sure if I've made the code more or less accessable.Anyway it made me think.
    #! /usr/bin/perl # Authored By :DReKurCe # Fri Jun 10 21:10:33 EDT 2005 use warnings; use strict; my @mostcommon=qw(www com org edu ); my @urlist =qw (; graburl(@urlist); sub graburl { my @urlist=@_; my $listlen=@urlist; foreach my $url(@urlist){; urlcom(@mostcommon,$url); } } sub urlcom{ my $url; (@mostcommon,$url)=@_; for my $match0(@mostcommon){ my @atomize=split (//,$url); foreach $atom(@atomize){ if (<masterful regex>)rank($url,1) else(<masterful regex>)rank($url,2) . . . } } } sub rank{ <impliment ranking behaviour for url> }
Re: Imploding URLs
by elwarren (Priest) on Jun 10, 2005 at 22:33 UTC
    Sounds kind of like an obfu-proxy.
      Head of nail, hit you have.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://465321]
Approved by davidrw
Front-paged by tall_man
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-05-21 13:17 GMT
Find Nodes?
    Voting Booth?

    No recent polls found