Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: LWP::UserAgent and HTML::Parser and the joys of Open Source

by merlyn (Sage)
on Oct 15, 2001 at 18:18 UTC ( #118877=note: print w/ replies, xml ) Need Help??


in reply to LWP::UserAgent and HTML::Parser and the joys of Open Source

For a more robust solution, you only want to push a HTML parser onto a stream that has announced itself as MIME-type of text/html. I did that for a client once, and have talked about the code in a recent Usenet article, and hope to have it published soon. In there, I said:

I have (unpublished) a dynamic-pre-forking Apache-style web streaming proxy server in about 300 lines of pure Perl (using HTTP::Daemon and the other LWP items, of course). It takes the same parameters as Apache child management:
### configuration my $HOST = 'www.stonehenge.com'; my $PORT = 42001; # 0 = pick next available user-port my $START_SERVERS = 4; # start this many, and don't g +o below my $MAX_CLIENTS = 12; # don't go above my $MAX_REQUESTS_PER_CHILD = 250; # just in case there's a leak my $MIN_SPARE_SERVERS = 1; # minimum idle (if 0, never start new) my $MAX_SPARE_SERVERS = 12; # maximum idle (should be "single brow +ser max")
And acts accordingly, using a simple scoreboarding mechanism similar to the Apache method.

Using this code, the apache-benchmark program shows that I'm only half as fast as Apache, and has one quarter the footprint!

The best part is that in those 300 lines, I handle full SSL streaming (the CONNECT call), full content streaming (I was watching live-feed quicktime movies through the proxy), and if the content-type is text/html, an HTML parser in token mode is inserted, allowing real-time rewriting. For example, I could insert <font color=blue> tags around all <a href=> links, while not impeding the stream of the rest of the HTML... there'd just be a hiccup while the <a href=> was being noticed.

The code was originally written as a work for-hire for a client who had intended my work to become open source. But the client dot-bombed, so I'm still trying to get clarification of whether I can release the code under my own copyright. As soon as that clears up, expect a WebTechniques column or two on it. :)

-- Randal L. Schwartz, Perl hacker


Comment on Re: LWP::UserAgent and HTML::Parser and the joys of Open Source
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://118877]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (12)
As of 2015-07-01 23:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (25 votes), past polls