Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

HTML::TreeBuilder:: identifing xpath-expression - first attempt

by Perlbeginner1 (Scribe)
on Oct 17, 2010 at 07:07 UTC ( [id://865774]=perlquestion: print w/replies, xml ) Need Help??

Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

Hello good Morning dear Monks,

firt of all: Here at this place i have learned alot!! Yesterday Morgon gave me some hints to work with Xpather. Now i am trying go do some first steps. I want to apply all that i have learned!

i am currently working on a parser script: I have to parse all the detail-pages of this site here:<url>http://www.educa.ch/dyn/79362.asp?action=search#0</url> There are several ways to do it. i have to get rid of a lot of crap by only using the text data out of the page... See the page - wich is very very simple: <url> http://www.educa.ch/dyn/79376.asp?id=1187</url> Output:

Altes Schulhaus Ossingen Guntibachstrasse 10 8475 Ossingen sekretariat.psossingen@bluewin.ch Tel:052 317 15 45 Fax:052 317 04 42


Well we see - i need a little PERL-script to get this six-lines of text out of the HTML-page.And yes: if i can i parse one page i can do it for all available 5000 to 6000 pages. I have to parse all of them. A True PERL-Job! I am sure Perl can do this job with ease! Well - how we do that: Personally I like HTML::TreeBuilder::XPath that we would have to install from CPAN. Here is how we would then extract the name from one of the files with it:

Note: i am not sure about the Arguments that i have to take! See below my trials:



use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2] }); print $name->as_text;


As we can see we simply use an xpath-expression to indentify the node we want.
So how to determine that?

Hmm - i tried to use a Firefox-plugin called XPather, that allows us to simply click on a html-element and extract the corresponding xpath.
So we load the file we want to parse in Firefox, click on the stuff we want, get the xpath and use that in the perl-script.
Well i am not very sure that i did the job with XPather very well. I tired to find the arguments for the follwing page:
See the page - wich is very very simple: http://www.educa.ch/dyn/79376.asp?id=1187 see the full page:
http://www.educa.ch/dyn/79363.asp?action=search#62

See below my trials: the arguments that i found with XPather ... are they really arguments -that help me to parse the above mentioned detai-result-page: http://www.educa.ch/dyn/79376.asp?id=1187


/html/body/div[3]/text() /html/body/div[4]/text() /html/body/div[6]/text() /html/body/div[7]/text() /html/body/div[9]/a/text() /html/body/div[10]/text() /html/body/div[11]/text()[1] /html/body/div[11]/text()[2] /html/body/div[12]/text()[1] /html/body/div[12]/text()[2] /html/body/div[13]/text()



see: http://www.educa.ch/dyn/79376.asp?id=1187

see the html code
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con +tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> +<title>educa.ch</title><meta http-equiv="Content-Type" content="text/ +html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri +pt src="102.htm"></script><script language="JavaScript"><!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin +="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp +acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" +class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t +d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz< +/td><td width="20" class="popuphead" valign="middle"><a href="#" titl +e="Print" onclick="window.print(); return false;"><img src="../pics/p +rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt +h="20" class="popuphead" valign="middle"><a href="#" title="close" on +click="window.close(); return false;"><img src="../pics/close21x13.gi +f" alt="Schliessen" width="21" height="13"></a></td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width=" +1" height="1"></td></tr></table><div class="leerzeile">&#160;</div><d +iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al +tes Schulhaus Ossingen </div><div class="leerzeile">&#160;</div><d +iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10 +</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div> +<img src="/0.gif" alt="" width="15" height="8">8475 &#160;Ossingen</d +iv><div class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" w +idth="15" height="8"><a href="" target="_blank"></a></div><div><img s +rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat +.psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d +iv class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" width= +"15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 + 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8"> +Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div +><div>&#160;</div></body></html>


Well - if i am able to identify the XPATH expressions for this site http://www.educa.ch/dyn/79376.asp?id=1187 then i am able to do the job!

Note: if i can do it for one site -i am able to do it for more than 5000 - since i have to parse al of them..;-) Well - we see that there are three tasks.


a. fetching the pages
b. parsing them
c. storing the results in a database



for the first task we can use LWP-USERAGENT or MECHANIZE for the next tasks we can use HTML-Parser! For the third task we need some knowledge of PERL::DBI

Replies are listed 'Best First'.
Re: HTML::TreeBuilder:: identifing xpath-expression - first attempt
by kcott (Archbishop) on Oct 17, 2010 at 07:59 UTC

    XPath is a W3C Recommendation. The documentation from that link will help you understand what XPather is outputting.

    I looked at the source for the link (http://www.educa.ch/dyn/79376.asp?id=1187) you provided. All of the data that you are collecting appears to be in <div class="leerzeile"> elements.

    It's generally better to target something like that than use hard-coded indices (e.g. div[3], div[4], etc.): if the site owner's decide to add some additional comment at the top of the page, what's in the 3rd <div> today may be in the 4th tomorrow; addresses won't necessarily have the same number of lines so perhaps you only want up to the 12th <div> or could need the 14th <div> as well.

    Following the XPath link above, you'll see a large number of examples under Location Paths. One in particular stands out as close to what you want:

    child::para[attribute::type="warning"] selects all para children of the context node that have a type attribute with value warning

    So, you can probably get all the data you need by accessing div[attribute:class="leerzeile"] instead of a long list of div[N] paths. I didn't study the markup in minute detail: if that doesn't get everything you want, I'd still use that type of technique rather than opting for a hard-coded index.

    Finally, just a note on the markup in your question: links are created with [url] and named links with [url|name] - this and related information are explained in Writeup Formatting Tips.

    -- Ken

      Hi,

      Try with WWW::Mechanize

      It can solve your problem easily

      Regards,
      Vivek
        Hi viveksnv
        • kcott doesn't have the problem, Perlbeginner1 does (clicking the correct reply/comment on link is important)
        • WWW::Mechanize won't help him extract html by xpath queries
Re: HTML::TreeBuilder:: identifing xpath-expression - first attempt
by Khen1950fx (Canon) on Oct 17, 2010 at 11:39 UTC
    This is as simple as I could make it:
    #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; use LWP::Simple; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(get 'http://www.educa.ch/dyn/79376.asp?id=1187'); $tree->findnodes(q{//tr[1]/td[2]}); print $tree->as_text, "\n";
      With help from htmltreexpather.pl - xpath helper, creates xpath search strings from html
      #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; #~ $XML::XPathEngine::DEBUG = 1; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con +tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> +<title>educa.ch</title><meta http-equiv="Content-Type" content="text/ +html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri +pt src="102.htm"></script><script language="JavaScript"><!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin +="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp +acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" +class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t +d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz< +/td><td width="20" class="popuphead" valign="middle"><a href="#" titl +e="Print" onclick="window.print(); return false;"><img src="../pics/p +rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt +h="20" class="popuphead" valign="middle"><a href="#" title="close" on +click="window.close(); return false;"><img src="../pics/close21x13.gi +f" alt="Schliessen" width="21" height="13"></a></td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width=" +1" height="1"></td></tr></table><div class="leerzeile">&#160;</div><d +iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al +tes Schulhaus Ossingen </div><div class="leerzeile">&#160;</div><d +iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10 +</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div> +<img src="/0.gif" alt="" width="15" height="8">8475 &#160;Ossingen</d +iv><div class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" w +idth="15" height="8"><a href="" target="_blank"></a></div><div><img s +rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat +.psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d +iv class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" width= +"15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 + 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8"> +Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div +><div>&#160;</div></body></html> __HTML__ # you can delete html/body for my $query ( qw! /html/body/div[2] /html/body/div[4] /html/body/div[6] /html/body/div[9] /html/body/div[11] /html/body/div[12] ! ) { print $query,"\n",$tree->findvalue($query),"\n\n"; } __END__ /html/body/div[2] Altes Schulhaus Ossingen /html/body/div[4] Guntibachstrasse 10 /html/body/div[6] 8475  Ossingen /html/body/div[9] sekretariat.psossingen@bluewin.ch /html/body/div[11] Tel:052 317 15 45 /html/body/div[12] Fax:052 317 04 42
        Hello Ken (kcott ) viveksnv, Khen1950fx hello anonymous Monk

        this is a great place for learning. I am so happy bout the answers -they show me this community is alive and so great - in helping and giving a helping hand.

        this is a great expericence!

        i will read all the answers later - since i have to leave the house at the moment!

        i come back later this day.
        meanwhle many many thanks for all!


        update



        Well - if i am able to identify the XPATH expressions for this site http://www.educa.ch/dyn/79376.asp?id=1187 then i am able to do the job!

        Note: if i can do it for one site -i am able to do it for more than 5000 - since i have to parse al of them..;-) Well - we see that there are three tasks.


        a. fetching the pages
        b. parsing them
        c. storing the results in a database



        for the first task we can use LWP-USERAGENT or MECHANIZE for the next tasks we can use HTML-Parser! For the third task we need some knowledge of PERL::DBI
        Hello dear anonymous_monk! I am triying to understand your posting!


        you refer to the page that explains and hepls finding xpaths. That is very very interesting! I am trying to learn something here.

        you use this link: http://www.perlmonks.org/?node_id=865792

        It leads to this code!

        this is a great great totuorial and a supergreat tool: Lemme ask yo +u if i got this right!? With that i can determine the paths - in ot +her words i can find out all the paths in a HTML-file!? $ perl htmltreexpather.pl select.html _tag option HTML::Element=HASH(0xb139ec) 0.1.1.0.0 Chose Some aaa /html/body/form/select/option /html/body/form/select/option /html/body[@bgcolor='red']/form[@action='/foo.cgi' and @name='queryfoo +']/select[@name='singlelist']/option[@value='aaa'] ------------------------------------------------------------------



        Question: this above mentioned code helps to throw out the Paths of a (general) HTML-document!?!?

        At least you make usage here:

        #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; #~ $XML::XPathEngine::DEBUG = 1; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_content(<<'__HTML__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" con tent="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> <title>educa.ch</title><meta http-equiv="Content-Type" content="text/ html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><scri pt src="102.htm"></script><script language="JavaScript"><!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin ="0" marginwidth="0" marginheight="0" onload="check();"><table cellsp acing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></t d><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz< /td><td width="20" class="popuphead" valign="middle"><a href="#" titl e="Print" onclick="window.print(); return false;"><img src="../pics/p rint16x13.gif" alt="Drucken" width="16" height="13"></a></td><td widt h="20" class="popuphead" valign="middle"><a href="#" title="close" on click="window.close(); return false;"><img src="../pics/close21x13.gi f" alt="Schliessen" width="21" height="13"></a></td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width=" 1" height="1"></td></tr></table><div class="leerzeile"> </div><d iv class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Al tes Schulhaus Ossingen </div><div class="leerzeile"> </div><d iv><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10 </div><div><img src="/0.gif" alt="" width="15" height="8"></div><div> <img src="/0.gif" alt="" width="15" height="8">8475 Ossingen</d iv><div class="leerzeile"> </div><div><img src="/0.gif" alt="" w idth="15" height="8"><a href="" target="_blank"></a></div><div><img s rc="/0.gif" alt="" width="15" height="8"><a href="mailto: sekretariat .psossingen@bluewin.ch">sekretariat.psossingen@bluewin.ch</a></div><d iv class="leerzeile"> </div><div><img src="/0.gif" alt="" width= "15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8"> Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div ><div> </div></body></html> __HTML__ # you can delete html/body for my $query ( qw! /html/body/div[2] /html/body/div[4] /html/body/div[6] /html/body/div[9] /html/body/div[11] /html/body/div[12] ! ) { print $query,"\n",$tree->findvalue($query),"\n\n"; } __END__ /html/body/div[2] Altes Schulhaus Ossingen /html/body/div[4] Guntibachstrasse 10 /html/body/div[6] 8475 Ossingen /html/body/div[9] sekretariat.psossingen@bluewin.ch /html/body/div[11] Tel:052 317 15 45 /html/body/div[12] Fax:052 317 04 42



        That is very very impressive. I try to understand this code - and your usage of your example -that you were refering to!


        $ perl htmltreexpather.pl select.html _tag option HTML::Element=HASH(0xb139ec) 0.1.1.0.0 Chose Some aaa /html/body/form/select/option /html/body/form/select/option /html/body[@bgcolor='red']/form[@action='/foo.cgi' and @name='queryfoo +']/select[@name='singlelist']/option[@value='aaa']


        if i get you right - then i can use this script for many many cases - in order to get out the Xpaths!? Is this right

        look forwward to hear form you! I guess that i can learn alot! Plz help me here!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://865774]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-26 00:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found