Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Scraping using WWW::Mechanize::Firefox

by census (Initiate)
on Feb 06, 2013 at 22:42 UTC ( #1017521=perlquestion: print w/ replies, xml ) Need Help??
census has asked for the wisdom of the Perl Monks concerning the following question:

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.

I have installed WWW::Mechanize::Firefox.

Here is some code that I have written. I make to the forms section and then I'm completely confused as to what to do. :(

use strict; use warnings; use WWW::Mechanize::Firefox; my $mech = WWW::Mechanize::Firefox->new( activate => 1, # bring the tab to the foreground ); $mech->get('https://familysearch.org/search/collection/results#count=2 +0&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2% +3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2 +Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html', +synchronize => 0); my $retries = 10; while ($retries-- and $mech->is_visible( xpath => '//*[@id="hourgl +ass"]' )) { print "Sleep until we find the thing\n"; sleep 2; }; die "Timeout while waiting for application" if 0 > $retries; # Now the hourglass is not visible anymore #fill out the search form my @forms = $mech->forms(); #<input id="census_bp" name="birth_place" type="text" tabindex="0"/> + #A selector prefixed with '#' must match the id attribute of the input +. A selector prefixed with '.' matches the class attribute. A selecto +r prefixed with '^' or with no prefix matches the name attribute. $mech->field( birth_place => 'value_for_birth_place' ); # Click on the submit $mech->click({xpath => '//*[@class="form-submit"]'});
Would appreciate any help in getting it to work!

Comment on Scraping using WWW::Mechanize::Firefox
Download Code
Re: Scraping using WWW::Mechanize::Firefox
by ww (Bishop) on Feb 06, 2013 at 23:28 UTC
    Sorry that this reply is almost boiler-plate, but...
    1. Does that site's usage guidance permit scraping?
    2. Do you have authority/permission to extract data?
    3. Does the site publish an API you could use rather than rolling your own?
    4. Assuming that by "the number 1975" you mean you're looking for white males, DOB in 1923 and resident of San Diego Cty in 1940, who still turn up on the site in 1975 data, why would you expect that (to find 1975 data in records focused on 1923 and 1940)? Is your description of the site imprecise?
        If the 1975 you're looking for is not a date, please describe what it is.
    5. And speaking of descriptions, "doe not seem to wok" (sic, interpreted as 'does not seem to work') is NOT a description adequate to help us help you.
        What doesn't work? How do the results, if any, vary from your intent?
        What warnings or error messages appear (please report them verbatim).

    Perhaps some of these questions will help you diagnose your problems; that sometimes happens just by virtue of asking the question ( see this one node (among many) re the Teddy Bear effect ); sometimes a set of possible issues is useful... and most certainly, more detail about how your program fails will help us to help you.

      If you visit the link in your browser you'll see the page contains:

      1-20 of 1,975 results

      I assume that this is the source of the "1975" which the OP wants to extract.

      package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
        yes, that is clearly the 1975 I am looking for. And by does not work, I am not getting 1975. The code would work if it gave me 1975 in legit way for that test case.
        Now that tobyink has -- in effect, :-) -- vouched for that site and specific address, I've been trying to get there too.

        That raises a new issue: when I connected (and it's an LDS geneaology site, for all concerned) the query ran a very long time -- more than 3 minutes of rotating arrow without any timeout or other informational message -- each of three times I tried to follow it.

        I don't have superfast DSL, but it's not that slow...and so I wonder if the problem may be in the length of the (multiply-compounded) query or in the OP's connection.

      I've updated the code. Now I'm stuck in a different loop as I noted. On my connection and computer, it takes less then 10 seconds to get the site loaded. It was probably less than 2.5 secs. However, at the moment, perl is going through that infinite loop of printing sleep for way more than at least 1 minute.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1017521]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (13)
As of 2014-07-25 15:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (172 votes), past polls