Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Getting the innerHTML from the DOM , not the “source”

by Karels (Initiate)
on Apr 23, 2013 at 18:24 UTC ( #1030210=perlquestion: print w/ replies, xml ) Need Help??
Karels has asked for the wisdom of the Perl Monks concerning the following question:

I am processing a webpage with Perl using IO::All and the io($url) function.

The pages I am processing ostensibly have well formed URLs. e.g.

http://www.forbes.com/billionaires/list/#page:15_sort:0_direction:asc_ +search:_filter:All%20industries_filter:All%20countries_filter:All%20s +tates

Note the #page:15_ portion of the url

When I view the source or print out the source return fromm io() I see names that appear on page 1 of the website, e.g.,

<!-- Start: list_row --> <tr> <td class="rank">1</td> <td class="company"> <a href="/profile/carlos-slim-helu/"> <img src="http://i.forbesimg.com/media/lists/people/carlos-slim- +helu_50x50.jpg" alt=""> <h3>Carlos Slim Helu & family</h3> </a> </td> <td class="worth">$73 B</td> <td>73</td> <td>telecom</td> <td>Mexico</td> </tr>

However, if I open the page up in developer tools in the browser and look through the object model (HTML tab in IE)I see the entries in the list for the people I expect to see on page 15 e.g.,

<TR> <TD class=rank>1342</TD> <TD class=company><A href="/profile/park-hyeon-joo/"> <IMG alt="" src="http://i.forbesimg.com/media/lists/people/park-hy +eon-joo_50x50.jpg"> <H3>Park Hyeon-Joo</H3></A> </TD> <TD class=worth>$1 B</TD> <TD>54</TD> <TD>Mirae</TD> <TD>South Korea</TD> </TR>

Can I get Perl to open the page and give me the right contents?

Someone suggested looking at Mechanize. I have looked at pQuery and Web::Query. The ->text methods just return the data from page 1, not page 15.

Is there a way to keep the url "alive" until it refeshes with the correct content?

Thanks in advance

Comment on Getting the innerHTML from the DOM , not the “source”
Select or Download Code
Replies are listed 'Best First'.
Re: Getting the innerHTML from the DOM , not the “source”
by moritz (Cardinal) on Apr 23, 2013 at 18:28 UTC
      OK, so I am using WWW::Mechanize::Firefox and I am getting at the data I am after. From a processing POV I need to loop through several pages. I would like to update the contents of $mech-> with the contents of a new url--it there a way to do this? When I try re-issuing $mech->get with a new argument the program appears to hang... I don't see anything obvious in the Cookbook, examples, etc.

      Actually, now I get a warning:

      Subroutine MozRepl::__load_plugins redefined at C:/Perl/site/lib/Modul +e/Pluggable/Fast.pm line 104.
      Any thoughts would be welcome.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1030210]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2015-07-29 05:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (260 votes), past polls