http://www.perlmonks.org?node_id=909765

insectopalo has asked for the wisdom of the Perl Monks concerning the following question:

Hi there

Don't kill me if this is stupid. I'm a beginner... I find struggling the fact that when I copypaste the following url to the browser, I get a fantastic page of results with details on prices.

http://www.booking.com/searchresults.html?src=index&error_url=http%3A%2F%2Fwww.booking.com%2Findex.en.html%3Fsid%3D6c8cbc31dd4bdb6a976db86f3442c32b%3B&sid=6c8cbc31dd4bdb6a976db86f3442c32b&si=ai%2Cco%2Cci%2Cre%2Cdi&ss=Andalucia%2C+Spain&checkin_monthday=1&checkin_year_month=2011-7&checkout_monthday=7&checkout_year_month=2011-7&group_adults=1&group_children=0&clear_group=1

HOWEVER, when I do

$agent->get($url);

The $agent->{content} is not complete, is missing the prices... Is that javascript? Why get doesn't "get" it if it's HTML at the end?

Cheers!

Replies are listed 'Best First'.
Re: Parsing HTTP...
by Corion (Patriarch) on Jun 15, 2011 at 10:39 UTC
Re: Parsing HTTP...
by philipbailey (Curate) on Jun 15, 2011 at 18:58 UTC

    Ask yourself "what is different" about the two requests, from your browser, and from your Perl code. There are two classes of common reasons for differences:

    1. Differences in the request.
    2. Differences in the processing of the response document.

    For (1), remember that the request is much more than the URL: a number of headers may be sent by your browser. Headers that commonly change behaviour include Cookie, User-Agent, Referer, but any header should be looked at. You can look at the headers by sniffing the network (Wireshark), a browser plugin (e.g. Firebug for Firefox) or a proxy (Fiddler, on Windows). LWP (if that is what you are using) allows you to change the headers of your request.

    For (2), usually this is Javascript. The commonly-used Perl tools, LWP and derivatives (e.g. WWW::Mechanize) do not support Javascript. In most cases you can read the Javascript yourself and manually mimic what it is doing by further requests or Perl code. But there do seem to be some Perl modules floating around that claim Javascript capabilities, usually through a conventional browser; have a look on CPAN. You could also look at Selenium.

    Finally, think laterally--perhaps you can get your data another way. The website you mention seems to have various XML feeds.

      Thank you all. I have been trying some of the alternatives, but it seems pretty difficult in general. However, using WWW::Mechanize::Firefox has (in practical terms) solved the issue. It looks gross though, to see the browser doing the dirty job.

Re: Parsing HTTP...
by sundialsvc4 (Abbot) on Jun 15, 2011 at 12:11 UTC

    In a way, “Perl isn’t about Perl ... it’s about the CPAN library.”

    Wanna do something?   Odds are certain that thousands of other folks want to do the same thing, and that some of them have written a good CPAN module to do it.   You can also be sure that they have talked about it extensively here, at PerlMonks.   So, whether your assigned task is “HTML Parsing” or something else, “search, and you shall find.”

    Only n00b13s try to start on a task by coding it from scratch, given that hundreds of well-tested packages are available for any conceivable purpose.   That simply is not how things are done.

      $agent->get($url); $agent->content();
      There's a very good chance the OP is using a CPAN module. Further, the question has nothing to do with parsing HTML but incomplete HTML. Finaly, the OP hasn't tried "...to code it from scratch".

      The OP asked a very good question (not, as he/she fears, in the least stupid) and made a reasonable stab at the answer. Corion has helpfully given some pointers that may likely help find a solution.

      ++ to the OP and ++ to Corion. Your contribution? I'll leave that as an exercise for the reader.