HTML::TreeBuilder:: identifing xpath-expression

Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

Hello good Morning dear Monks,

firt of all: Here at this place i have learned alot!! Yesterday Morgon gave me some hints to work with Xpather. Now i am trying go do some first steps. I want to apply all that i have learned!

i am currently working on a parser script: I have to parse all the detail-pages of this site here:<url>http://www.educa.ch/dyn/79362.asp?action=search#0</url> There are several ways to do it. i have to get rid of a lot of crap by only using the text data out of the page... See the page - wich is very very simple: <url> http://www.educa.ch/dyn/79376.asp?id=1187</url> Output:

Altes Schulhaus Ossingen
Guntibachstrasse 10
8475  Ossingen
sekretariat.psossingen@bluewin.ch
Tel:052 317 15 45
Fax:052 317 04 42
[download]

Well we see - i need a little PERL-script to get this six-lines of text out of the HTML-page.And yes: if i can i parse one page i can do it for all available 5000 to 6000 pages. I have to parse all of them. A True PERL-Job! I am sure Perl can do this job with ease! Well - how we do that: Personally I like HTML::TreeBuilder::XPath that we would have to install from CPAN. Here is how we would then extract the name from one of the files with it:

Note: i am not sure about the Arguments that i have to take! See below my trials:

use strict;
use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]


});

print $name->as_text;
[download]

As we can see we simply use an xpath-expression to indentify the node we want.
So how to determine that?

Hmm - i tried to use a Firefox-plugin called XPather, that allows us to simply click on a html-element and extract the corresponding xpath.
So we load the file we want to parse in Firefox, click on the stuff we want, get the xpath and use that in the perl-script.
Well i am not very sure that i did the job with XPather very well. I tired to find the arguments for the follwing page:
See the page - wich is very very simple: http://www.educa.ch/dyn/79376.asp?id=1187 see the full page:
http://www.educa.ch/dyn/79363.asp?action=search#62

See below my trials: the arguments that i found with XPather ... are they really arguments -that help me to parse the above mentioned detai-result-page: http://www.educa.ch/dyn/79376.asp?id=1187

/html/body/div[3]/text()
/html/body/div[4]/text()
/html/body/div[6]/text()
/html/body/div[7]/text()
/html/body/div[9]/a/text()
/html/body/div[10]/text()
/html/body/div[11]/text()[1]
/html/body/div[11]/text()[2]
/html/body/div[12]/text()[1]
/html/body/div[12]/text()[2]
/html/body/div[13]/text()
[download]

see: http://www.educa.ch/dyn/79376.asp?id=1187

see the html code

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
+www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con
+tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de">
+<title>educa.ch</title><meta http-equiv="Content-Type" content="text/
+html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri
+pt src="102.htm"></script><script language="JavaScript"><!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin
+="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp
+acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" 
+class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t
+d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz<
+/td><td width="20" class="popuphead" valign="middle"><a href="#" titl
+e="Print" onclick="window.print(); return false;"><img src="../pics/p
+rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt
+h="20" class="popuphead" valign="middle"><a href="#" title="close" on
+click="window.close(); return false;"><img src="../pics/close21x13.gi
+f" alt="Schliessen" width="21" height="13"></a></td></tr>


<tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width="
+1" height="1"></td></tr></table><div class="leerzeile">&#160;</div><d
+iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al
+tes Schulhaus Ossingen    </div><div class="leerzeile">&#160;</div><d
+iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10
+</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div>
+<img src="/0.gif" alt="" width="15" height="8">8475 &#160;Ossingen</d
+iv><div class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" w
+idth="15" height="8"><a href="" target="_blank"></a></div><div><img s
+rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat
+.psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d
+iv class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" width=
+"15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052
+ 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8">
+Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div
+><div>&#160;</div></body></html>
[download]

Well - if i am able to identify the XPATH expressions for this site http://www.educa.ch/dyn/79376.asp?id=1187 then i am able to do the job!

Note: if i can do it for one site -i am able to do it for more than 5000 - since i have to parse al of them..;-) Well - we see that there are three tasks.

a. fetching the pages
b. parsing them
c. storing the results in a database

for the first task we can use LWP-USERAGENT or MECHANIZE for the next tasks we can use HTML-Parser! For the third task we need some knowledge of PERL::DBI

Comment on HTML::TreeBuilder:: identifing xpath-expression - first attempt Select or Download Code

Replies are listed 'Best First'.
Re: HTML::TreeBuilder:: identifing xpath-expression - first attempt by kcott (Archbishop) on Oct 17, 2010 at 07:59 UTC
XPath is a W3C Recommendation. The documentation from that link will help you understand what XPather is outputting. I looked at the source for the link (http://www.educa.ch/dyn/79376.asp?id=1187) you provided. All of the data that you are collecting appears to be in `<div class="leerzeile">` elements. It's generally better to target something like that than use hard-coded indices (e.g. `div[3]`, `div[4]`, etc.): if the site owner's decide to add some additional comment at the top of the page, what's in the 3rd `<div>` today may be in the 4th tomorrow; addresses won't necessarily have the same number of lines so perhaps you only want up to the 12th `<div>` or could need the 14th `<div>` as well. Following the XPath link above, you'll see a large number of examples under Location Paths. One in particular stands out as close to what you want: child::para[attribute::type="warning"] selects all para children of the context node that have a type attribute with value warning So, you can probably get all the data you need by accessing `div[attribute:class="leerzeile"]` instead of a long list of `div[N]` paths. I didn't study the markup in minute detail: if that doesn't get everything you want, I'd still use that type of technique rather than opting for a hard-coded index. Finally, just a note on the markup in your question: links are created with `[url]` and named links with `[url\|name]` - this and related information are explained in Writeup Formatting Tips. -- Ken	[reply] [d/l] [select]
Re^2: HTML::TreeBuilder:: identifing xpath-expression - first attempt by viveksnv (Sexton) on Oct 17, 2010 at 08:36 UTC
Hi, Try with WWW::Mechanize It can solve your problem easily Regards, Vivek	[reply]
Re^3: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Anonymous Monk on Oct 17, 2010 at 09:20 UTC
Hi viveksnv kcott doesn't have the problem, Perlbeginner1 does (clicking the correct reply/comment on link is important) WWW::Mechanize won't help him extract html by xpath queries	[reply]
Re^4: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Khen1950fx (Canon) on Oct 17, 2010 at 23:17 UTC
Re^5: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Perlbeginner1 (Scribe) on Oct 19, 2010 at 15:54 UTC
Re^2: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Anonymous Monk on Oct 17, 2010 at 09:18 UTC
FWIW, this might be a gentler introduction http://w3schools.com/xpath/default.asp	[reply]
Re: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Khen1950fx (Canon) on Oct 17, 2010 at 11:39 UTC
This is as simple as I could make it: `#!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; use LWP::Simple; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(get 'http://www.educa.ch/dyn/79376.asp?id=1187'); $tree->findnodes(q{//tr[1]/td[2]}); print $tree->as_text, "\n";` [download]	[reply] [d/l]
Re^2: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Anonymous Monk on Oct 17, 2010 at 12:31 UTC
With help from htmltreexpather.pl - xpath helper, creates xpath search strings from html #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; #~ $XML::XPathEngine::DEBUG = 1; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con +tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> +<title>educa.ch</title><meta http-equiv="Content-Type" content="text/ +html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri +pt src="102.htm"></script><script language="JavaScript"><!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin +="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp +acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" +class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t +d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz< +/td><td width="20" class="popuphead" valign="middle"><a href="#" titl +e="Print" onclick="window.print(); return false;"><img src="../pics/p +rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt +h="20" class="popuphead" valign="middle"><a href="#" title="close" on +click="window.close(); return false;"><img src="../pics/close21x13.gi +f" alt="Schliessen" width="21" height="13"></a></td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width=" +1" height="1"></td></tr></table><div class="leerzeile"> </div><d +iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al +tes Schulhaus Ossingen </div><div class="leerzeile"> </div><d +iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10 +</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div> +<img src="/0.gif" alt="" width="15" height="8">8475  Ossingen</d +iv><div class="leerzeile"> </div><div><img src="/0.gif" alt="" w +idth="15" height="8"><a href="" target="_blank"></a></div><div><img s +rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat +.psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d +iv class="leerzeile"> </div><div><img src="/0.gif" alt="" width= +"15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 + 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8"> +Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div +><div> </div></body></html> __HTML__ # you can delete html/body for my $query ( qw! /html/body/div[2] /html/body/div[4] /html/body/div[6] /html/body/div[9] /html/body/div[11] /html/body/div[12] ! ) { print $query,"\n",$tree->findvalue($query),"\n\n"; } __END__ /html/body/div[2] Altes Schulhaus Ossingen /html/body/div[4] Guntibachstrasse 10 /html/body/div[6] 8475 �Ossingen /html/body/div[9] sekretariat.psossingen@bluewin.ch /html/body/div[11] Tel:052 317 15 45 /html/body/div[12] Fax:052 317 04 42 [download]	[reply] [d/l]
Re^3: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Perlbeginner1 (Scribe) on Oct 17, 2010 at 13:35 UTC
Hello Ken (kcott ) viveksnv, Khen1950fx hello anonymous Monk this is a great place for learning. I am so happy bout the answers -they show me this community is alive and so great - in helping and giving a helping hand. this is a great expericence! i will read all the answers later - since i have to leave the house at the moment! i come back later this day. meanwhle many many thanks for all! update Well - if i am able to identify the XPATH expressions for this site http://www.educa.ch/dyn/79376.asp?id=1187 then i am able to do the job! Note: if i can do it for one site -i am able to do it for more than 5000 - since i have to parse al of them..;-) Well - we see that there are three tasks. a. fetching the pages b. parsing them c. storing the results in a database for the first task we can use LWP-USERAGENT or MECHANIZE for the next tasks we can use HTML-Parser! For the third task we need some knowledge of PERL::DBI	[reply]
Re^4: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Perlbeginner1 (Scribe) on Oct 17, 2010 at 17:10 UTC
Re^3: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Perlbeginner1 (Scribe) on Oct 17, 2010 at 17:29 UTC
Hello dear anonymous_monk! I am triying to understand your posting! you refer to the page that explains and hepls finding xpaths. That is very very interesting! I am trying to learn something here. you use this link: http://www.perlmonks.org/?node_id=865792 It leads to this code! this is a great great totuorial and a supergreat tool: Lemme ask yo +u if i got this right!? With that i can determine the paths - in ot +her words i can find out all the paths in a HTML-file!? $ perl htmltreexpather.pl select.html _tag option HTML::Element=HASH(0xb139ec) 0.1.1.0.0 Chose Some aaa /html/body/form/select/option /html/body/form/select/option /html/body[@bgcolor='red']/form[@action='/foo.cgi' and @name='queryfoo +']/select[@name='singlelist']/option[@value='aaa'] ------------------------------------------------------------------ [download] Question: this above mentioned code helps to throw out the Paths of a (general) HTML-document!?!? At least you make usage here: #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; #~ $XML::XPathEngine::DEBUG = 1; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> <title>educa.ch</title><meta http-equiv="Content-Type" content="text/ html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri pt src="102.htm"></script><script language="JavaScript"><!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin ="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz< /td><td width="20" class="popuphead" valign="middle"><a href="#" titl e="Print" onclick="window.print(); return false;"><img src="../pics/p rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt h="20" class="popuphead" valign="middle"><a href="#" title="close" on click="window.close(); return false;"><img src="../pics/close21x13.gi f" alt="Schliessen" width="21" height="13"></a></td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width=" 1" height="1"></td></tr></table><div class="leerzeile"> </div><d iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al tes Schulhaus Ossingen </div><div class="leerzeile"> </div><d iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10 </div><div><img src="/0.gif" alt="" width="15" height="8"></div><div> <img src="/0.gif" alt="" width="15" height="8">8475 Ossingen</d iv><div class="leerzeile"> </div><div><img src="/0.gif" alt="" w idth="15" height="8"><a href="" target="_blank"></a></div><div><img s rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat .psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d iv class="leerzeile"> </div><div><img src="/0.gif" alt="" width= "15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8"> Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div ><div> </div></body></html> __HTML__ # you can delete html/body for my $query ( qw! /html/body/div[2] /html/body/div[4] /html/body/div[6] /html/body/div[9] /html/body/div[11] /html/body/div[12] ! ) { print $query,"\n",$tree->findvalue($query),"\n\n"; } __END__ /html/body/div[2] Altes Schulhaus Ossingen /html/body/div[4] Guntibachstrasse 10 /html/body/div[6] 8475 Ossingen /html/body/div[9] sekretariat.psossingen@bluewin.ch /html/body/div[11] Tel:052 317 15 45 /html/body/div[12] Fax:052 317 04 42 [download] That is very very impressive. I try to understand this code - and your usage of your example -that you were refering to! `$ perl htmltreexpather.pl select.html _tag option HTML::Element=HASH(0xb139ec) 0.1.1.0.0 Chose Some aaa /html/body/form/select/option /html/body/form/select/option /html/body[@bgcolor='red']/form[@action='/foo.cgi' and @name='queryfoo +']/select[@name='singlelist']/option[@value='aaa']` [download] if i get you right - then i can use this script for many many cases - in order to get out the Xpaths!? Is this right look forwward to hear form you! I guess that i can learn alot! Plz help me here!	[reply] [d/l] [select]
Re^4: HTML::TreeBuilder:: identifing xpath-expression - first attempt by Anonymous Monk on Apr 02, 2011 at 15:11 UTC


Perl Monk, Perl Meditation
	PerlMonks

HTML::TreeBuilder:: identifing xpath-expression - first attempt