Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Parsing HTTP...

by philipbailey (Chaplain)
on Jun 15, 2011 at 18:58 UTC ( #909837=note: print w/ replies, xml ) Need Help??


in reply to Parsing HTTP...

Ask yourself "what is different" about the two requests, from your browser, and from your Perl code. There are two classes of common reasons for differences:

  1. Differences in the request.
  2. Differences in the processing of the response document.

For (1), remember that the request is much more than the URL: a number of headers may be sent by your browser. Headers that commonly change behaviour include Cookie, User-Agent, Referer, but any header should be looked at. You can look at the headers by sniffing the network (Wireshark), a browser plugin (e.g. Firebug for Firefox) or a proxy (Fiddler, on Windows). LWP (if that is what you are using) allows you to change the headers of your request.

For (2), usually this is Javascript. The commonly-used Perl tools, LWP and derivatives (e.g. WWW::Mechanize) do not support Javascript. In most cases you can read the Javascript yourself and manually mimic what it is doing by further requests or Perl code. But there do seem to be some Perl modules floating around that claim Javascript capabilities, usually through a conventional browser; have a look on CPAN. You could also look at Selenium.

Finally, think laterally--perhaps you can get your data another way. The website you mention seems to have various XML feeds.


Comment on Re: Parsing HTTP...
Re^2: Parsing HTTP...
by insectopalo (Initiate) on Jun 19, 2011 at 23:01 UTC

    Thank you all. I have been trying some of the alternatives, but it seems pretty difficult in general. However, using WWW::Mechanize::Firefox has (in practical terms) solved the issue. It looks gross though, to see the browser doing the dirty job.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://909837]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (17)
As of 2014-07-30 20:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (240 votes), past polls