Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^3: extracting data from HTML

by bitingduck (Friar)
on Jun 04, 2012 at 04:00 UTC ( #974215=note: print w/ replies, xml ) Need Help??


in reply to Re^2: extracting data from HTML
in thread extracting data from HTML

Don't look for a particular general module that will solve all your HTML to Data problems. Look at the page or pages that you want to extract data from, and figure out what are the best modules for those particular cases. In my experience (which is less than most others here), it's not worth the trouble to find something that will go straight from HTML to appropriately structured XML. Whoever generated the page had some database model and spewed it into some template that they invented, probably with no thought whatsoever in making it easy to turn it back into data. Or they didn't even do things in a consistent way, making your problem in inverting it even worse.

If you have access to a lot of O'Reilly stuff, don't look at the general books. Look at a practical one--I started HTML scraping with recipes out of Spidering Hacks and still refer back to it occasionally.

Here's a recent example (after the more tag) where I had a bunch of pages on a website that I wanted to copy book metadata from all the pages and put it into XML so I could generate a catalog from the XML. The catch is that the pages were all hand coded. They did a pretty good job using CSS to identify the relevant parts, but there were still inconsistencies, and a few of the older pages were so out of whack that they didn't get processed at all.

If you look at the code, it's pretty specific to the pages I was scraping, so it's ugly in all sorts of ways. It could also be made somewhat simpler if I needed to do it a bunch more times-- it's a bit repetitive in pulling out a bunch of the labeled items, so those could be a loop through an array of names, and maybe add flags to the array for special treatment. There are also extraneous modules called-- the original pages were inconsistent about odd characters and entities, and that was one of the bigger headaches. Note how I find the pieces I want-I know how they're named, so I just do a "look down" to find them, and then process contents from there. Note also that I use XML::Writer to generate the XML, rather than trying to do it myself.

#!/usr/bin/perl # bosonbooks_scrape.pl # # scraper to scrape a local dump of the boson books website # and extract out the information for each of the books # gets the URL # generates a QR code of the URL # gets the title: <div class="title"> # gets the author: <div class="author"> # gets the price: (if there) <span class="price"> # gets the ISBN: <span class="isbn"> # gets the book cover location: <img class="bookcover" align="right" s +rc="FILENAME" /> # gets the description: <div class="bookdescription"> # gets the about the author: <div id="aboutauthor"> # #!/usr/bin/perl # bosonbooks_scrape.pl # # scraper to scrape a local dump of the boson books website # and extract out the information for each of the books # gets the URL # generates a QR code of the URL # gets the title: <div class="title"> # gets the author: <div class="author"> # gets the price: (if there) <span class="price"> # gets the ISBN: <span class="isbn"> # gets the book cover location: <img class="bookcover" align="right" s +rc="FILENAME" /> # gets the description: <div class="bookdescription"> # gets the about the author: <div id="aboutauthor"> # use strict; use warnings; use WWW::Mechanize; use HTML::TreeBuilder; use HTML::Entities; use Data::Dumper; use XML::Writer; use Encode; use GD::Barcode::QRcode; binmode STDOUT, ":utf8"; our $max_desc=0; our $max_auth=0; my %title_list=(); my $total_books=0; my $starturl ='http://localhost/~homedirectory/BosonBooks/www.bosonboo +ks.com/boson/fiction/fiction.html'; my $baseurl='http://localhost/~homedirectory/BosonBooks/www.bosonbooks +.com/boson/fiction'; my $QRbase='http://www.bosonbooks.com/'; my $DTD= <<END; <!ELEMENT booklist (book)*> <!ELEMENT book (url,title,QRcode?,author+,price?,isbn?,coverfile?,desc +ription?,aboutauth?)> <!ELEMENT url (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT QRcode (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT isbn (#PCDATA)> <!ELEMENT coverfile (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT aboutauth (#PCDATA)> END #print $DTD; #write the + DTD at the top my $writer = new XML::Writer( OUTPUT =>'STDOUT', ENCODING=>'utf-8'); $writer->xmlDecl( 'UTF-8' ); $writer->doctype( 'booklist' ); $writer->startTag('booklist');print "\n"; my $mech= WWW::Mechanize->new(); $mech->get($starturl); die $mech->response->status_line unless $mech->success; #print $mech->title, "\n"; my $html=$mech->content; my @links=$mech->find_all_links(); #get all the + links in the page my @urls=map{$_->[0]} @links; foreach my $url (@urls){ #walk through +them my $link= $baseurl .'/' .$url; #print $link . "\n"; if ($link =~ /^$baseurl\/(.*?)\/\1\.html$/){ my $page=WWW::Mechanize->new(); $page->get($link); if ($page->success) { $link=~ /(^$baseurl\/(.*?)\/)\2\.html$/; my $imgbase=$1; #print STDERR $imgbase ."\n"; my $pagehtml=$page->content; if ($pagehtml =~ /isbn/){ #if the page +contains "isbn" it generates an xml record unless (defined ($title_list{$link})) { # #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD +] | [#x10000-#x10FFFF] (strip invalid xml chars) #$pagehtml= decode_utf8($pagehtml); $pagehtml =~ s%<br/>%\n\n%go; # +change <br/> to double line breaks $pagehtml =~ s%(</p>)%\</p\>\n\n%go; $pagehtml =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000} +-\x{FFFD}\x{10000}-\x{10FFFF}]//go; scrape_page($pagehtml, $link, $imgbase, $writer); $title_list{$link}=1; $total_books++; #print STDERR "WRITING"; } } } } } $writer->endTag('booklist'); $writer->end(); print STDERR "max description " . $max_desc . "\n"; print STDERR "max authlength " . $max_auth . "\n"; print STDERR "total written " . $total_books."\n"; sub scrape_page { #makes a single pass with treebuilder to pull out the metadata. #and put it into an xml record #also generates a QR code based on the url and puts the filename in th +e XML record. #record is #<book> # <title> # <author></author> # <description></description> # <coverart href="filename"/> # <isbn></isbn> # <url></url> # <QRCode href="QRfilename"/> # <book> my $html= $_[0]; #the passed html my $link=$_[1]; #the passed link my $baseurl=$_[2]; #the base url for getting the co +ver image my $writer=$_[3]; #the xml writer object # my $writer = new XML::Writer( OUTPUT => 'STDOUT', ENCODING=>'utf +-8'); #print $html . "\n"; my $tree= HTML::TreeBuilder->new; $tree->parse($html); $writer->startTag('book');print "\n"; $link =~ m%http://localhost/~homedirectory/BosonBooks/(.*)%; #gene +rate the correct URL my $realurl= 'http://'.$1; $writer->dataElement("url", $realurl);print "\n"; my $t1=$tree->look_down( _tag => 'div', class => 'title' ); if ($t1) { my $title=$t1->as_text; $writer->dataElement("title", $title);print "\n"; my $filename= 'qrdata/'. $title . "QR.png"; #put th +e qrcodes in "qrdata/filename" $filename =~s/[ ']//g; #generat +e the filename for the QR code qrgen($realurl,$filename); #gen +erate the QR code $writer->emptyTag("QRCode", 'href'=> "file://". $filename); pr +int "\n"; #store the QR code fname in a tag } else { warn "no title! in $baseurl"; } my $t2=$tree->look_down( # get auth +or _tag => 'div', class => 'author' ); if ($t2) { my $author=$t2->as_text; $author =~ m/by (.*)/; if ($1) { $writer->dataElement("author", $1);print "\n"; } else { $writer->dataElement("author", $author);print "\n"; } } else { warn "no author! in $baseurl"; } my $t3=$tree->look_down( # get pric +e _tag => 'span', class => 'price' ); if ($t3) { my $price=$t3->as_text; $writer->dataElement("price", $price);print "\n"; } my $t4=$tree->look_down( # get isbn _tag => 'span', class => 'isbn' ); if ($t4) { my $isbn=$t4->as_text; $writer->dataElement("isbn", $isbn);print "\n"; } my $imageobj=WWW::Mechanize->new(); # new m +ech object to get the image my $t5=$tree->look_down( # get file +name for cover art _tag => 'img', class => 'bookcover' ); if ($t5) { my $coverfile=$t5->attr('src'); $imageobj->get($baseurl . $coverfile); $imageobj->save_content('coverart/'.$coverfile); $writer->emptyTag("coverfile", 'href'=> "file://coverart/".$co +verfile); print "\n" } # get the +book description my $t6=$tree->look_down( _tag => 'div', class => 'bookdescription' ); if ($t6) { my $description=$t6->as_text; #$description =~ s/ \& / \&amp; /; if (length($description) > $max_desc) { $max_desc=length($description); }; $writer->dataElement("description", $description);print "\n"; } # get the about the au +thor. # might want to try to + retain formatting (i.e. italic and bold tags) # need to remove autho +rs website my $t7=$tree->look_down( _tag => 'div', id => 'aboutauthor' ); if ($t7) { my $aboutauth=$t7->as_text; #$aboutauth =~ s/ \& / \&amp;/; $aboutauth =~ /About the Author(.*)/; if ($1) { # print $1 ,"\n"; if (length($1) > $max_auth) {$max_auth=length($1);}; my $aboutauth=$1; $writer->dataElement("aboutauth", $aboutauth);print "\n"; } else { if (length($aboutauth) > $max_auth) { $max_auth=length($aboutauth); }; $writer->dataElement("aboutauth", $aboutauth);print "\n"; } } $writer->endTag('book'); print "\n";print "\n"; $tree->delete; } sub qrgen { # generates a QR code from a url and filename my $url=$_[0]; my $filename=$_[1]; open FILE, ">", $filename; print FILE GD::Barcode::QRcode->new($url, { Ecc => 'L', Version=>4, ModuleSize => 4} )->plot->png; close FILE; }


Comment on Re^3: extracting data from HTML
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://974215]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2014-10-25 00:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (138 votes), past polls