Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Scraping using WWW::Mechanize::Firefox

by ww (Bishop)
on Feb 06, 2013 at 23:28 UTC ( #1017527=note: print w/ replies, xml ) Need Help??


in reply to Scraping using WWW::Mechanize::Firefox

Sorry that this reply is almost boiler-plate, but...

  1. Does that site's usage guidance permit scraping?
  2. Do you have authority/permission to extract data?
  3. Does the site publish an API you could use rather than rolling your own?
  4. Assuming that by "the number 1975" you mean you're looking for white males, DOB in 1923 and resident of San Diego Cty in 1940, who still turn up on the site in 1975 data, why would you expect that (to find 1975 data in records focused on 1923 and 1940)? Is your description of the site imprecise?
      If the 1975 you're looking for is not a date, please describe what it is.
  5. And speaking of descriptions, "doe not seem to wok" (sic, interpreted as 'does not seem to work') is NOT a description adequate to help us help you.
      What doesn't work? How do the results, if any, vary from your intent?
      What warnings or error messages appear (please report them verbatim).

Perhaps some of these questions will help you diagnose your problems; that sometimes happens just by virtue of asking the question ( see this one node (among many) re the Teddy Bear effect ); sometimes a set of possible issues is useful... and most certainly, more detail about how your program fails will help us to help you.


Comment on Re: Scraping using WWW::Mechanize::Firefox
Re^2: Scraping using WWW::Mechanize::Firefox
by tobyink (Abbot) on Feb 06, 2013 at 23:56 UTC

    If you visit the link in your browser you'll see the page contains:

    1-20 of 1,975 results

    I assume that this is the source of the "1975" which the OP wants to extract.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
      yes, that is clearly the 1975 I am looking for. And by does not work, I am not getting 1975. The code would work if it gave me 1975 in legit way for that test case.
      Now that tobyink has -- in effect, :-) -- vouched for that site and specific address, I've been trying to get there too.

      That raises a new issue: when I connected (and it's an LDS geneaology site, for all concerned) the query ran a very long time -- more than 3 minutes of rotating arrow without any timeout or other informational message -- each of three times I tried to follow it.

      I don't have superfast DSL, but it's not that slow...and so I wonder if the problem may be in the length of the (multiply-compounded) query or in the OP's connection.

        I've updated the code. Now I'm stuck in a different loop as I noted. On my connection and computer, it takes less then 10 seconds to get the site loaded. It was probably less than 2.5 secs. However, at the moment, perl is going through that infinite loop of printing sleep for way more than at least 1 minute.
Re^2: Scraping using WWW::Mechanize::Firefox
by census (Initiate) on Feb 07, 2013 at 01:43 UTC
    I've updated the code. Now I'm stuck in a different loop as I noted. On my connection and computer, it takes less then 10 seconds to get the site loaded. It was probably less than 2.5 secs. However, at the moment, perl is going through that infinite loop of printing sleep for way more than at least 1 minute.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1017527]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-07-28 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (198 votes), past polls