Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
There's more than one way to do things
 
PerlMonks  

Mass downloads.

by BrowserUk (Pope)
on May 20, 2005 at 18:43 UTC ( #459090=monkdiscuss: print w/ replies, xml ) Need Help??

I wish to download a large volume of post bodies from PM.

My thought is to do this via the displaytype=xml;node_id=nnnnnn interface, probably early hours on Sunday morning.

  • Is this acceptable?
  • Is there a prefered time (GMT)?
  • At what rate may I make the pulls without having undue impact upon the site?

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
  • Comment on Mass downloads.
    Download Code
    Re: Mass downloads.
    by jeffa (Chancellor) on May 20, 2005 at 19:13 UTC

      When tilly temporarily left Perlmonks and there were rumors that his posts would be deleted (the rumors were that his posts were property of the company he worked for) i ... um, downloaded them all ... on a weekend, in XML format. This did not to seem to stress the server anywhere near as bad as a Super Search tends to (especially back then before tye made major improvements), but i had my script sleep for 1 minute between hits and let the script run for several hours.

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
    Re: Mass downloads. (N+5)
    by tye (Cardinal) on May 20, 2005 at 19:40 UTC

      Time how long a fetch takes, add at least a few seconds to that, and then wait at least that long before starting the next fetch. That should do a pretty good job of preventing server overload for batches of requests that only get run rarely.

      Note that I mean for you to time each fetch. If the server gets bogged down, then your script should immediately notice that the previous fetch took longer and automatically compensate by waiting longer before trying the next fetch.

      Thanks.

      - tye        

        Hey, cool, that's like reflexive tit for tat.
          It's client side throttling. Effective when both sides play nice. If perlmonks wanted to add complexity and security, it'd have to be done on the server side. You lose some flexibility. such as, getting 10 nodes really fast regardless of the day, but at the end of the day, it was only 10 requests. Not a horrible thing, but mine is only an opinion. :)

          ----
          Give me strength for today.. I will not talk it away..
          Just for a moment.. It will burn through the clouds.. and shine down on me.

        I'm not really sure how many posts I would be pulling--it is dependant upon the contents of those I pull--, but it probably be in the order of 10s of 1000s.

        At the rate of 1 every 5 or more seconds, 10,000 would take 13 hours, which given the 2 hour cutoff on my dialup account is somewhat impractical. I was hoping to get authority to run at a rather faster rate at times of low system load.

        If that is not permissible, I may have to abort the idea.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

          Why don't you ask jcwren to give you an account on perlmonk.org? There you will have all the time you need.

          HTH, Valerio

          Can't whatever program you run, download them in batch and keep a marker of what node was last successfully downloaded and saved? Heck, when it can't reach the server, have it sleep for 5 or 10 mintues. When you connect again, it'll just pickup where it left off.

          ----
          Give me strength for today.. I will not talk it away..
          Just for a moment.. It will burn through the clouds.. and shine down on me.

    Re: Mass downloads.
    by BUU (Prior) on May 21, 2005 at 01:38 UTC
      What about hitting thepen instead? (Unless it stopped being up to date at some point), all of it's pages are static, so none of the bloated CGI stuff perlmonks goes through for each page.

        As was noted here somwhere recently, thepen has been redirecting back to this site for a while now.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
          Whups, you are completely correct. I stand corrected. As to your stated goal, I wonder if it would be possible to get the raw database dump, so perlmonks doesn't have to actually render how ever many pages.
    Re: Mass downloads.
    by TedPride (Priest) on May 21, 2005 at 07:53 UTC
      Why not have PM append all posts to a log file (cropped every so often via crontab) that people can download? We're probably talking at most a MB or two of download with very little additional strain on the site.
    Re: Mass downloads.
    by zby (Vicar) on Jun 06, 2005 at 16:10 UTC
      Would you provide the downloaded texts for public consumption? I have an idea to make an 'module index' for PM - like quick list of all nodes mentioning a module or some module popularity contests. If the data was available this should not be a very complicated task - but doing the downloading seems a bit excessive. I am sure others will come with some other ideas on how to use the downloaded texts.

        I abondoned the idea. To do the indexing I envisioned justice, I would have had to download the greater majority of PM's nodes. At the mandated rate of 1 every 5 seconds+, it would require 500 hours. Split that into 2 hour chunks of connect time and it becomes untenable.

        Hence I've never bothered to extend the scripts beyond their simplist form:

        PMDown.pl takes a filename containing a list of PM nodeid's to download:

        BewareEven with 1 thread running, this will far exceed the download rate approved.

        ExtractWords.pl

        #! perl -slw use G; my %words; while( <> ) { $words{ $_ }++ for m[\b([a-zA-Z][a-zA-Z']+[a-zA-Z])\b]g; } open WORDS, '>', 'words.dat' or die $!; print WORDS for sort keys %words; close WORDS;

        IndexDocs.pl

        #! perl -slw use strict; use G; $|=1; chomp( my @words = do{ open my $fh, '<words.dat'; <$fh> } ); print "loaded: " . @words . ' words'; local $/; my %index; @index{ @ARGV } = ('') x @ARGV; while( <> ) { chomp( my $file = lc ); 1+index( $file, $words[ $_ ] ) and vec( $index{ $ARGV }, $_, 1 ) = 1 for 0 .. $#words; } open INDEX, '>', 'index.dat' or die $!; print INDEX "$_(@{[ unpack '%b*', $index{ $_ } ]}) : [@{[ unpack 'b*', + $index{ $_ } ]}]" for sort keys %index; close INDEX;

        Note: G.pm is Jenda's module that does wildcard ARGV expansion.

        The result of processing is a file that looks like this:

        .\171594.txt : all an and anonymous asked at back be better but by com + concerning contain create directories even excluding expression foll +owing for gone has have hours in index jun last list looking monks of + on over pl probably question renders replies round seekers simple th +anks that the this to want wisdom without would .\171599.txt : am and are at be being brothers but by com comes create + darkness directories doubt enlightenment etiquette help here if in i +ndex jun light list living me my no not of on order piece pl re repli +es reply seeking so someone strong sure tell that the thread to unsur +e until way weak will with without

        But I manually filtered the intermediate words list.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
          I will not use it, but thanks. The pain of downloading was exactly the reason why I asked if you would provide the results of it :)

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: monkdiscuss [id://459090]
    Approved by ww
    Front-paged by calin
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (6)
    As of 2014-04-19 02:49 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      April first is:







      Results (475 votes), past polls