Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR

by LanX (Saint)
on Nov 26, 2022 at 13:58 UTC ( [id://11148398]=note: print w/replies, xml ) Need Help??


in reply to WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR

I had a quick glimpse into the docs of ->xpath

and found this passages and emphasized two parts

    $mech->xpath( $query, %options )

    • my $link = $mech->xpath('//a[id="clickme"]', one => 1);
      # croaks if there is no link or more than one link found
    • my @para = $mech->xpath('//p');
      # Collects all paragraphs
    • my @para_text = $mech->xpath('//p/text()', type => $mech->xpathResult('STRING_TYPE'));
      # Collects all paragraphs as text
    ...
    • node - node relative to which the query is to be executed. Note that you will have to use a relative XPath expression as well. Use

      .//foo

      instead of

      //foo

      Querying relative to a node only works for restricting to children of the node, not for anything else. This is because we need to do the ancestor filtering ourselves instead of having a Chrome API for it.

two insights into potential bottlenecks so:

  • the module has to identify the parent itself, instead of assembling an xpath. Putting all into one path by yourself might be far more efficient (and probably your identifier is not as unambiguous as you thought)
  • you might get expensive wrapper objects for each result, unless you specify a type of text

Of course this is all speculation as long as you can't provide an SSCCE ... :)

Cheers Rolf
(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^2: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by ait (Hermit) on Nov 27, 2022 at 10:27 UTC

    After adding HTML::Tree and parsing some stuff in pure Perl land I think that IS actually the right approach:

    1. Use W::M::Chrome for JS rendering, JS interactions and high-level xpath
    2. Slurp HTML chunks and process in the Perl side as much as possible

      That's one approach.

      But as I said I think putting the logic into a more elaborate xpath to do the heavy lifting inside the browser would fix your performance issue without needing HTML::Tree

      IMHO your code will force the Perl part in W:M:C to do a lot of own filtering and create thousands of proxy objects. These Perl objects will also tunnel requests back and forth to the browser for most method calls.

      Hence many potential bottlenecks.

      update

      as an illustration, this xpath in chrome's dev console for https://meta.wikimedia.org/wiki/Wikipedia_article_depth returns 1016 strings at once

      //table[3]//tr//td//text()

      Disclaimer: I don't have W:M:C installed and my xpath foo is rusted, so I'm pretty sure there are even better ways to do it.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        True.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148398]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (7)
As of 2024-09-17 08:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (22 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.