Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Imploding URLs

by mobiGeek (Beadle)
on Jun 10, 2005 at 02:59 UTC ( [id://465396]=note: print w/replies, xml ) Need Help??


in reply to Re: Imploding URLs
in thread Imploding URLs

You are crazy. :-)

So here's the bigger gist. I am improving a special-purpose HTTP proxy server that rewrites URLs in pages that it fetches so that they all point back to itself (e.g. the URL "http://www.yahoo.com/" gets rewritten as "http://my_proxy?url=http://www.yahoo.com/". So though I have a large collection of URLs (from my logs), I need to "implode" URLs on a one-by-one basis. GZip and the like don't do very much on a single URL.

Finding the collection of "top" substrings has already reduced my downloads by 20% on a given page, but that was done by hand for a single test page with only 30 or so URLs in it.

So the problem as stated stands...I wish it were as simple as GZip/Compress. In fact, I used those and in many cases the URLs are actually larger (for short URLs)...especially once the data is encrypted and base64'ed...

mG.

Replies are listed 'Best First'.
Re^3: Imploding URLs
by TilRMan (Friar) on Jun 10, 2005 at 04:44 UTC
    So though I have a large collection of URLs (from my logs), I need to "implode" URLs on a one-by-one basis.

    Why? Is the space savings that significant?

    If all you have is a handful of substitutions, you can probably hand-pick the strings:

    http www. .com .org .net :// index .htm .jpg .gif google yahoo mail news ebay
      Yes, the savings is quite significant. From a hand-selected list of one user's habits, I was able to reduce some pages by more than 20%.

      The other thing is that this is not a collection of URLs from across the entire web. The URLs being crawled vary, but the proxy is part of a kind of "portal". So there are potentially thousands of URLs, but they come from a select list of sites. Thus the reason I am looking for the weighting of substrings.

      If one URL or one particular site (i.e. a particular substring) is crawled extremely frequently, then imploding that string might be much more bandwidth saving than simply imploding "http://" on all URLs.

      mG.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://465396]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2025-06-14 23:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.