Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Extract Portion of HTML

by Rhandom (Curate)
on Sep 20, 2011 at 14:44 UTC ( #926959=note: print w/ replies, xml ) Need Help??


in reply to Extract Portion of HTML

I was interested in relative performance. Here are all of the modules listed so far in this thread as well as App::scrape and Web::Scraper.

There is also a module called SGMLExtract which I wrote but have generally only used internal to the company. SGMLExtract is a regex based extractor which means that it requires having regularly formed HTML - not necessarily well formed. The other solutions are great for parsing from documents that could be poorly formed, but I haven't come across a situation where I have a legitimate reason to scrape information from poorly formed HTML. (I do use HTML::TreeBuilder for a module I'll be releasing to CPAN sometime in the next year - TreeBuilder is awesome at "being a browser"). I haven't released SGMLExtract to CPAN because I wasn't sure there is enough external demand (and we have here at least 7 modules filling the niche) but I could release it if there is enough interest. It is a whopping 90 lines of code with 0 dependencies.

Of all of the outputs, the SGMLExtract one is the only one that does what the OP requested which is to pull the content of the div tag without the enclosing div. The Mojo::DOM one also failed to re-encapsulate the legacy bold tag.

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese timethese); use App::scrape qw(scrape); use HTML::Query qw(Query); use HTML::Selector::XPath qw(selector_to_xpath); use HTML::TreeBuilder qw(); use HTML::TreeBuilder::XPath; use Mojo::DOM; use SGMLExtract qw(sgml_find sgml_extract); use Web::Query qw(wq); use Web::Scraper qw(process scraper); use Debug; my $html = q{<html> --stuff-- <head> --more stuff-- </head> <body> --still more stuff-- <div>Stuff I do not want</div> <div class="myBody"> --all the stuff <b>I</b> want, which might include div tags, too-- </div> --yet more stuff-- </body> </html> }; # appse and sgmle cheat because they go off relative position of the d +iv - not the class name sub m_appse { (scrape($html, ['div'], {class => 'myBody'}))[1]->[0] } sub m_hselx { (HTML::TreeBuilder::XPath->new_from_content($html)->find +nodes(selector_to_xpath('div.myBody')))[0]->as_HTML } sub m_htmlq { Query(text => $html)->query('div.myBody')->as_HTML } sub m_mojod { Mojo::DOM->new->parse($html)->at('.myBody')->text } sub m_sgmle { sgml_extract(\$html, 'div', {all => 1, content => 1})->[ +1]->{'content'} } sub m_sgmlf { sgml_find(\$html, 'div', {class => 'myBody'})->[0]->{'co +ntent'} } sub m_treeb { HTML::TreeBuilder->new_from_content($html)->look_down(_t +ag => 'div', class => 'myBody')->as_HTML(q{}) } sub m_webqy { wq($html)->find('div.myBody')->html } sub m_websc { (scraper { process "div.myBody", key => 'TEXT' }->scrape +($html))[0]->{'key'} } debug m_appse(), m_hselx(), m_htmlq(), m_mojod(), m_treeb(), m_sgmle() +, m_sgmlf(), m_webqy(), m_websc(); cmpthese timethese -1, { appse => \&m_appse, hselx => \&m_hselx, htmlq => \&m_htmlq, mojod => \&m_mojod, sgmle => \&m_sgmle, sgmlf => \&m_sgmlf, treeb => \&m_treeb, webqy => \&m_webqy, websc => \&m_websc, }; __END__ debug: paul/bench.pl line 45 m_appse() = "--all the stuff I want, which might include div tags, too +--"; m_hselx() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_htmlq() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_mojod() = "\n--all the stuff want, which might include div tags, to +o--\n"; m_treeb() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_sgmle() = "\n--all the stuff <b>I</b> want, which might include div +tags, too--\n"; m_sgmlf() = "\n--all the stuff <b>I</b> want, which might include div +tags, too--\n"; m_webqy() = "<div class=\"myBody\"> --all the stuff <b>I</b> want, whi +ch might include div tags, too-- </div>"; m_websc() = " --all the stuff I want, which might include div tags, to +o-- "; Rate webqy hselx websc appse htmlq treeb mojod sgmlf sgmle webqy 697/s -- -4% -5% -37% -47% -54% -72% -97% -98% hselx 724/s 4% -- -1% -35% -44% -52% -71% -97% -97% websc 731/s 5% 1% -- -34% -44% -51% -70% -97% -97% appse 1110/s 59% 53% 52% -- -15% -26% -55% -95% -96% htmlq 1305/s 87% 80% 78% 18% -- -13% -47% -94% -95% treeb 1506/s 116% 108% 106% 36% 15% -- -39% -93% -95% mojod 2465/s 254% 240% 237% 122% 89% 64% -- -89% -91% sgmlf 22330/s 3103% 2983% 2953% 1912% 1611% 1383% 806% -- -22% sgmle 28709/s 4018% 3864% 3825% 2486% 2100% 1807% 1065% 29% --


my @a=qw(random brilliant braindead); print $a[rand(@a)];


Comment on Re: Extract Portion of HTML
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://926959]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (12)
As of 2014-07-30 19:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (240 votes), past polls