Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Screen scraping complex tables and divs

by parser (Acolyte)
on Oct 13, 2017 at 18:53 UTC ( [id://1201337]=perlquestion: print w/replies, xml ) Need Help??

parser has asked for the wisdom of the Perl Monks concerning the following question:

I have been screen scraping for a few years with WWW::Mechanize and HTML::TokeParser and they have served me well. However, I recently encountered a set of pages which use complex table structures and numerous tab divs. I need a module (or methodology) which will allow me to search for sections of HTML in a more jQuery find()-like manner rather than simply consuming tokens from a stream of HTML.

I read through the post The State of Web spidering in Perl and, while helpful, the focus is more on spidering than scraping. I am interested in recommendations from the Monks if there are higher-order methods of finding contructs in HTML using Perl besides regular expressions and token parsing.

I read Mahmoud's jquery module on CPAN with interest but it appears not to have been maintained since 2013 and and I am uncertain it can query on table structures. To be fair, jQuery is limited on querying unlabeled table structures as well.

Here is a small example of what I am trying to accomplish:
1) Find the 6th and 9th rows in a named table (given an id) and pull out the name and value pairs.
2) Slurp in every row in a named table and parse out the name value pairs.

Cheers!
  • Comment on Screen scraping complex tables and divs

Replies are listed 'Best First'.
Re: Screen scraping complex tables and divs (updated)
by LanX (Saint) on Oct 13, 2017 at 19:18 UTC
    I'm confused because the thread you linked to is already very good.

    You mostly use

    in live inspections (ie when you need browser for JS) and as far as I remember did WWW::Mechanize::Firefox and its various siblings support both.

    The alternative is mirroring the DOM into a Perl/XML data structure and using the query API. (Mostly like xpath)

    Maybe you should ask more precisely and show what you tried?

    update

    > 1) Find the 6th and 9th rows in a named table (given an id) and pull out the name and value pairs.

    > 2) Slurp in every row in a named table and parse out the name value pairs.

    See

    • $mech->xpath( $query, %options)
    and alternatively
    • $mech->select( $name, $value )
    Both methods support querying children elements of a given ID.

    Query syntax is not a Perl question, but there are plenty of good tutorials online.

    Look out for browser features/addons allowing to play around with queries.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Rolf,

      I am confused now too. Are you saying WWW::Mechanize supports CSS selector and XPath? Or that WWW::Mechanize::Firefox does? If the latter, I also read it was very difficult to build.

      Query syntax is not a Perl question, but there are plenty of good tutorials online.

      I agree. However, determining how best to query HTML source via Perl is.

      The option of mirroring the DOM into a Perl/XML data structure and using the query API sounds quite good. I'll give that a go and see how it works. Anything is better than parsing table tags with TokParser.
        WWW::Mechanize::Firefox does and I took it as an example out of many because I worked with it in the past.

        But it really depends if you need JS or not, so I don't want to go into details.

        Querying Html was your question, something like xpath or css selector is mostly the solution.

        Regarding the Perl backend: it depends.

        Sorry there is no generic answer for TIMTOWTDI .

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        PS:

        > > Look out for browser features/addons allowing to play around with queries.

        I had very good experience using Firepath to find the right CSS selectors / XPath expressions inside Firefox.

        You can copy an auto-generated explicit expression by right clicking on a DOM-element and change them interactively.

        Simply copy the final path and/or selector into your Perl code then.

        HTH! :)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Screen scraping complex tables and divs
by marto (Cardinal) on Oct 13, 2017 at 20:56 UTC

    Mojolicious provides Mojo::DOM, and it makes life much simpler if you can use css selectors. In this example I use Mojolicious to parse a page and download associated links. If you run into problems post what you've tried and an example of the HTML you have to work with.

      Thank you Marto,

      I will check out mojolicious.

      Update: Mojo::DOM is perfect! It combines both CSS selectors and XML DOM parsing and has eliminated about 60% of my existing code.

Re: Screen scraping complex tables and divs
by Anonymous Monk on Oct 13, 2017 at 22:33 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1201337]
Approved by ww
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (6)
As of 2024-04-20 00:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found