Re: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by marto (Cardinal) on Nov 25, 2022 at 13:08 UTC
|
"Speculating: it would seem that relative to node xpath has some bug and may be scanning the whole table each time." Did you test this hypothesis? Do you have an example URL? If you don't need JavaScript you could benchmark alternatives such as Mojo::UserAgent.
| [reply] |
|
> If you don't need JavaScript
Even if ...
supposing communication overhead or an implementation loop are causing a bottleneck ...
... he could also try to fetch the whole table as html once using WWW::Mechanize::Chrome and do the parsing with Mojo::UserAgent
| [reply] |
|
" ... he could also try to fetch the whole table as html once using WWW::Mechanize::Chrome and do the parsing with Mojo::UserAgent"
I've used this work around in the past for things that need special sign in or bounce back things that aren't being detected as a 'real' browser, purely so I don't have to do a lot of code changes :) As the location of the bottleneck is not yet understood this may not resolve the issue of performance.
| [reply] |
|
|
I will try this, thank you!
I noticed that fetching the TRs of the table seems pretty fast with WWW::Mechanize::Chrome and xpath. What's seems absurd is that fetching the TDs relative to a single TR takes so long, and the time is proportional to the number of total TRs. That doesn't make any sense unless there's a bug somewhere in WWW::Mechanize::Chrome xpath implementation.
| [reply] |
|
|
|
|
Here is a simple timing code to replicate the issue.
I couldn't find any large tables in public websites but I found one in Wikipedia with 162 rows that illustrates the problem. If you find one with 400+ you'll see it takes 3-4 seconds for obtaining the TDs of a TR.
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
no warnings qw(experimental);
use Log::Log4perl qw(:easy);
use WWW::Mechanize::Chrome;
use Time::HiRes qw( gettimeofday tv_interval );
my $debug = 0;
my ($t0, $elapsed);
Log::Log4perl->easy_init($ERROR);
my $mech = WWW::Mechanize::Chrome->new(
headless => 0,
autodie => 0,
autoclose => 0
);
$mech->get('https://meta.wikimedia.org/wiki/Wikipedia_article_depth');
sleep(2);
my @nodes = $mech->xpath('//table');
$t0 = [gettimeofday];
my @rows = $mech->xpath('.//tr', node => $nodes[3]);
say 'xpath for TR tooK:'.tv_interval ( $t0 );
my @cell_keys = ( );
my @table_data = ( );
say "Timing for $#rows rows.";
foreach my $row_index (0 .. $#rows) {
my %row_data = ( );
# column names
if($row_index == 0){
$t0 = [gettimeofday];
my @cells = $mech->xpath('.//th', node => $rows[$row_index]);
say 'xpath for TH tooK:'.tv_interval ( $t0 );
foreach (0 ... $#cells) {
say "HEADER CELL: $_, VALUE:".$cells[$_]->get_text() if $d
+ebug;
push @cell_keys, $cells[$_]->get_text();
}
if($debug) {
say 'Column Names:';
say $_ foreach @cell_keys;
}
}
# data row
else{
$t0 = [gettimeofday];
my @cells = $mech->xpath('.//td', node => $rows[$row_index]);
say 'xpath for TD tooK:'.tv_interval ( $t0 );
say "DATA ROW: $row_index" if $debug;
foreach (0 ... $#cells) {
say "DATA CELL: $_, VALUE:" . $cells[$_]->get_text() if $d
+ebug;
$row_data{ $cell_keys[$_] } = $cells[$_]->get_text();
}
push @table_data, \%row_data;
if($debug) {
say 'Column Data:';
say $row_data{$_} foreach @cell_keys;
}
}
}
say Dumper(@table_data) if $debug;
Here are the results:
| [reply] [d/l] [select] |
|
No, I haven't done more than simple measurements to pinpoint the delays in my own code. But because the delays in fetching the TDs in the context of a specific TR node are proportional (or maybe exponential) to the amount of TRs, it seems obvious that there's either a bug, or some intrinsic limitation in the way that xpath is implemented (e.g. re-parsing the whole page every time).
| [reply] |
|
| [reply] |
Re: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by LanX (Saint) on Nov 26, 2022 at 13:58 UTC
|
I had a quick glimpse into the docs of ->xpath
and found this passages and emphasized two parts
two insights into potential bottlenecks so:
- the module has to identify the parent itself, instead of assembling an xpath. Putting all into one path by yourself might be far more efficient (and probably your identifier is not as unambiguous as you thought)
- you might get expensive wrapper objects for each result, unless you specify a type of text
Of course this is all speculation as long as you can't provide an SSCCE ... :)
| [reply] [d/l] [select] |
|
| [reply] |
|
That's one approach.
But as I said I think putting the logic into a more elaborate xpath to do the heavy lifting inside the browser would fix your performance issue without needing HTML::Tree
IMHO your code will force the Perl part in W:M:C to do a lot of own filtering and create thousands of proxy objects. These Perl objects will also tunnel requests back and forth to the browser for most method calls.
Hence many potential bottlenecks.
update
as an illustration, this xpath in chrome's dev console for https://meta.wikimedia.org/wiki/Wikipedia_article_depth returns 1016 strings at once
//table[3]//tr//td//text()
Disclaimer: I don't have W:M:C installed and my xpath foo is rusted, so I'm pretty sure there are even better ways to do it.
| [reply] [d/l] |
|
Re: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by ait (Hermit) on Nov 25, 2022 at 21:18 UTC
|
Thank you all, as always, for you valuable input and ideas! Ye monks are a smart bunch.
As much as I'd love to help debug W::M::Chrome, I have a short deadline so I decided to use LanX's idea to use xpath to get the table node and the HTML content and then parse that in Perl land. I decided to use HTML::Tree which is simple and tried.
For anyone having a similar issue, here is the code I wrote for this (assuming it has thead, th, and tbody, YMMV):
my @nodes = $mech->xpath('//table');
my @data = parse_table($nodes[0]);
sub parse_table ($table_node){
my $root = HTML::TreeBuilder->new_from_content($table_node->get_at
+tribute('outerHTML'));
my @tparts = $root->find_by_tag_name('table')->content_list;
my @colnames = ( );
my @data;
foreach my $tpart (@tparts){
if($tpart->tag eq 'thead'){
my @rows = $tpart->content_list;
foreach my $row (@rows) {
if($row->tag eq 'tr'){
my @cells = $row->content_list;
# assumes no TH is empty (see below safeguard for
+data cells)
foreach (@cells) {
push @colnames, $_->content->[0];
}
}
}
}
elsif($tpart->tag eq 'tbody'){
my @rows = $tpart->content_list;
foreach my $row (@rows) {
my %row_data = ();
if($row->tag eq 'tr'){
my @cells = $row->content_list;
foreach (0..$#cells) {
# HTML::Element's content method weirdness
if($cells[$cell]->content && scalar(@{$cells[$
+cell]->content})){
$row_data{ $colnames[$cell] } = $cells[$ce
+ll]->content->[0];
}
else{
$row_data{ $colnames[$cell] } = '';
}
}
}
push @data, \%row_data;
}
}
}
return \@data;
}
Thanks again y'all !
--
Alex
| [reply] [d/l] |