Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

How to download html with threads?

by Zeokat (Novice)
on Jul 30, 2007 at 19:31 UTC ( #629647=perlquestion: print w/replies, xml ) Need Help??

Zeokat has asked for the wisdom of the Perl Monks concerning the following question:

Ok, im using the module LWP to download HTML files and parsing them. The porblem is that i have too many html files to download and parse, so my code get one url and when finish get next url.... etc.... there are about 30000 urls, so this is too slow. I need to make at least that 10 threads work at the same time downloading. I read some infos about threads and threads::shared , but im a beginner and i cant find the way to do that by myself. My code is the next:
##########################START###################### #!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTTP::Request; print "Working...\n"; my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)"); $ua->timeout(15); open (URL_PLANETS,"url_planets.txt"); my @urls = <URL_PLANETS>; foreach my $planet (@urls) { open (NAMES,">>planet_names.txt"); print "Downloading: " , $planet , "\n"; my $req = HTTP::Request->new(GET => $planet); my $response = $ua->request($req); my $content = $response->content(); print NAMES $content =~ m@Rotations<i>(.*)</i>@m,"\n"; } close(URL_PLANETS); close(NAMES); ###############################END########################
Any easy code,easy example, tutorial or something for a begginer will be verry useful. Sorry for my poor english. Thanks in advance.

Replies are listed 'Best First'.
Re: How to download html with threads?
by ikegami (Pope) on Jul 30, 2007 at 19:32 UTC
      Thanks for the fast reply. I edited the post and added the <code> tags , will try with this module, thanks ton bro, i need this a lot. ;)
Re: How to download html with threads?
by Trizor (Pilgrim) on Jul 30, 2007 at 20:47 UTC

    Threaded programming is not easy, and while parallel LWP may be what you want, if you would like to learn threading in general this is a good example for introducing the concept.

    If you look at the processes your program is going through they list like this:

    • Load List of URLS
    • Fetch URL
    • Search through the content
    • Store the retrieved result in a file

    The bolded items can be parallelized but aren't inherently parallel, and each step can be in a separate thread, combining the Pipeline and Work crew models of threading.

    Pipeline? Workcrew? What are those? you ask. The work crew model of threading creates multiple threads that do the same thing on different bits of data, allowing you to leverage multiprocessing on your system to do things faster, however be warned: the overhead of creating a ton of threads will outweigh this benefit, the most work crew threads per job you probably want is the number of cores you have plus one.

    The pipeline threading model creates separate threads for separate tasks that are typically run sequentially, but need to be run over large amounts of data, to where each task can feed the next. This again can leverage the multiple cores on a system, if 4 threads are running for a 4 part task you are essentially running 4 of the tasks in parallel, but if one part can run faster (say the ingest part) it doesn't have to wait and can complete, then freeing the system to do other things while the data waits enqueue.

    Inter-thread communication

    These processing models sound great, but how does data move through the pipeline? There are many hard complex wizardly answers to this question, but perl makes things easy and provides Thread::Queue for dealing with this

    Thread::Queue provides a thread-safe construct for passing a Queue between threads. Its two main methods are enqueue and dequeue, identical to its non-thread friendly construct from any basic CS class.

    To extensively thread the code you provided, three Thread::Queue objects are required (#XXX: Has anyone ever thought of a Thread::Stack object...?), one for sending URLs from the file to the downloader, one for the downloader to send is content to the parser, and one for the parser to send its parsed data to the file writer.

    So I've got my fancy data structures, how do I create threads??

    The threads module facilitates the creation and management of threads. Creating threads is very easy to do with perl, simply pass the threads create method a sub ref and some arguments of the subroutine and it will be up and running in its own thread.

    Putting it all together

    So, you have threads, you have data structures, you have a model. What to do what to do what to do? Well stitch it all together!

    1. You need to refactor your code so that specific tasks are in their own subroutines and set them up to take a Queue or two as arguments and put any useful values into it. Your return value is now your exit code.
    2. sub ReadURLS { my ($queue,$filename) = @_; open my $urlFile,'<',$filename or die "ReadURLS: bad file!: $!" #Use + three arg open for security reasons, die on errors so we don't spew +nonsense or crash worse later. $queue->enqueue(<$urlFile>,undef); #Place each line into the queue, +followed by undef to signal the end of data. return 1; # Success! Return true. Or, if you're a unixy person, retu +rn 0 or maybe even 0 but true. } # Inqueue should be the queue object passed to ReadURLS # Outqueue should be the queue object passed to ParseContent sub DownloadContent { my ($outqueue,$inqueue) = @_; # Each thread needs their own UA my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)") +; $ua->timeout(15); while (my $url = $inqueue->dequeue){ #wait for data and abort when u +ndef comes down the pipe (that means theres no more) #this part should look familiar print "Downloading: $url\n"; my $req = HTTP::Request->new(GET => $url); my $response = $ua->request($req); $outqueue->enqueue($response->content()); #this changes, send the +output to the next task handler. } $outqueue->enqueue(undef); return 1; # See above return } # inqueue should be the outqueue from the downloader sub # outqueue should be passed to the output sub. # regex is of course, your regex. This allows for re-use of the code. +You could also consider taking some parsing rules and using and HTML +parser of some type... sub ParseContent { my ($outqueue,$inqueue,$regex) = @_; while (my $content = $inqueue->dequeue) { $outqueue->enqueue(join '',$content =~ m/$regex/m,"\n"); } $outqueue->enqueue(undef); return 1; } # queue should be the outqueue passed to ParseContent sub WriteOut { my ($queue,$filename) = @_; open my $outFH,'>>',$filename or die "WriteOut: open failed: $!"; while (my $data = $queue->dequeue) { print $outFH $data; } close $outFH; return 1; }
    3. If that seemed confusing, just wait. You'll understand when the code ties it together. You just start all of your various threads with their queues and watch the magic happen.
    4. my $nr_workers = 5; #set this value for the number of side by side dow +nloaders and parsers. Better yet, take it as an argument my $urlfile = "url_planets.txt"; # see comment about arguments my $outfile = "planet_names.txt"; # arguments are nice here too, but n +ot the current point my ($URLQueue,$ContentQueue,$ParsedQueue); $URLQueue = new Thread::Queue; $ContentQueue = new Thread::Queue; $ParsedQueue = new Thread::Queue; my @threadObjs; push @threadObjs,threads->create(&ReadURLS,$URLQueue,$urlfile); #creat +e the reading thread, and store areference to it in the threadObjs ar +ray, this will be important later # Set up the workers, any number of them can manipulate the queues. for (1..$nr_workers) { push @threadObjs,threads->create(&DownloadContent,$ContentQueue,$URL +Queue); push @threadObjs,threads->create(&ParseContent,$ParsedQueue,$Content +Queue,qr!Rotations<i>(.*)</i>!); } push @threadObjs,threads->create(&WriteOut,$ParsedQueue,$outfile); # Now that all the threads are created, the main thread should call jo +in on all of its child thread objects to ask perl to clean up after t +hem, and so it doesn't exit before they're done causing an abrupt ter +mination. foreach my $thr (@threadObjs) { $thr->join(); # Join can have a return value, but checking it adds o +verhead, only if you really need to } # At this point, barring some horrible catastrophe, the specified $out +file should have the desired output.

    It should be noted that this is much more than is needed for just a speed boost, and this post is inteded to provide some example based direction to learning threaded programming. If you made it to the end I suggest you go read perlthrtut and explore the references and See Alsos it mentions.

    If you're looking for a simpler answer BrowserUK's response will do just fine

      I applaud (and upvoted) your post, but would just point out one thing. Since you are retrieving the entire contents of the urls as a single string, and then processing that string using a single regex, the cost of pushing the data to a shared queue, reading it back to process it and then passing the concatenate results to another thread via another queue is going to cost far more than it will ever save.

      You are also starting multiple threads all appending to a single file, but you are not mutexing the writes. In the olden days, it was generally considered safe to write append mode to files from multiple processes because CRTs guarenteed 'atomic' writes in append mode. It's not at all clear if any or all builds Perl uses the underlying CRT for this. Nor is it clear whether any or all CRTs make the same guarentees when called from multipe threads.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        There aren't multiple threads on a single file in my example code,only the capability becuase WriteOut was wrapped in a sub to be made a thread. Only one writer thread is created, to atomically dequeue processed data and write it out.

        As for the overhead issue, while in its current state the overhead doesn't merit separate threads, if this grows and starts using some form of HTML Parser in the parse stage then the split begins to make more sense as HTML parsers can be slower than downloading the document to feed them, separating the processes allows the download to finish faster and make room for the parsing.

Re: How to download html with threads?
by BrowserUk (Pope) on Jul 30, 2007 at 20:27 UTC

    With minimal changes to your existing code, a (untested) threaded solution might look like:

    #!/usr/bin/perl -w use strict; use threads; use threads::shared; use LWP::UserAgent; use HTTP::Request; print "Working...\n"; my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)"); $ua->timeout(15); open URL_PLANETS, '<', "url_planets.txt" or die $!; my @urls = <URL_PLANETS>; close(URL_PLANETS); chomp @urls; open NAMES, '>>', 'planet_names.txt' or die $!; my $mutexStdout :shared; my $mutexFile :shared; my $running :shared = 0; foreach my $planet (@urls) { async { { lock $running; ++$running; } { lock $mutexStdout; print "Downloading: " , $planet , "\n" }; my $req = HTTP::Request->new(GET => $planet); my $response = $ua->request($req); my $content = $response->content(); lock $mutexFile; print NAMES $content =~ m[Rotations<i>(.*)</i>]m,"\n"; { lock $running; --$running; } }->detach; sleep 1 while $running > 10; } sleep 1 while $running; ## Let the last 10 finish. close(NAMES);

    Updated: misc++

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Just for completeness..
      I believe your code should end with
      sleep 1 while $running; close(NAMES);

      I'd always prefer a self written solution above an existing module, if it's not too complicated.
      You'll learn this way, besides using an existing module can sometimes be more expensive than write your own code,
      since you'll have to learn the api and possibly to deal with unexpected behaviour.

      I'm not sure about what you're going to do (30000 planets??), but if you'll have to fetch the data regularly it would perhaps be senseful to save the modification time of the web sites along with your data, and compare later just the modification time of the online pages with your locally stored data.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://629647]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2021-06-24 16:37 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (130 votes). Check out past polls.