Perl Monk, Perl Meditation | |
PerlMonks |
HTML::TreeBuilder:: identifing xpath-expression - first attemptby Perlbeginner1 (Scribe) |
on Oct 17, 2010 at 07:07 UTC ( [id://865774]=perlquestion: print w/replies, xml ) | Need Help?? |
Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:
Hello good Morning dear Monks,
firt of all: Here at this place i have learned alot!! Yesterday Morgon gave me some hints to work with Xpather. Now i am trying go do some first steps. I want to apply all that i have learned! i am currently working on a parser script: I have to parse all the detail-pages of this site here:<url>http://www.educa.ch/dyn/79362.asp?action=search#0</url> There are several ways to do it. i have to get rid of a lot of crap by only using the text data out of the page... See the page - wich is very very simple: <url> http://www.educa.ch/dyn/79376.asp?id=1187</url> Output:
Well we see - i need a little PERL-script to get this six-lines of text out of the HTML-page.And yes: if i can i parse one page i can do it for all available 5000 to 6000 pages. I have to parse all of them. A True PERL-Job! I am sure Perl can do this job with ease! Well - how we do that: Personally I like HTML::TreeBuilder::XPath that we would have to install from CPAN. Here is how we would then extract the name from one of the files with it: Note: i am not sure about the Arguments that i have to take! See below my trials:
As we can see we simply use an xpath-expression to indentify the node we want. So how to determine that? Hmm - i tried to use a Firefox-plugin called XPather, that allows us to simply click on a html-element and extract the corresponding xpath. So we load the file we want to parse in Firefox, click on the stuff we want, get the xpath and use that in the perl-script. Well i am not very sure that i did the job with XPather very well. I tired to find the arguments for the follwing page: See the page - wich is very very simple: http://www.educa.ch/dyn/79376.asp?id=1187 see the full page: http://www.educa.ch/dyn/79363.asp?action=search#62 See below my trials: the arguments that i found with XPather ... are they really arguments -that help me to parse the above mentioned detai-result-page: http://www.educa.ch/dyn/79376.asp?id=1187
see: http://www.educa.ch/dyn/79376.asp?id=1187 see the html code
Well - if i am able to identify the XPATH expressions for this site http://www.educa.ch/dyn/79376.asp?id=1187 then i am able to do the job! Note: if i can do it for one site -i am able to do it for more than 5000 - since i have to parse al of them..;-) Well - we see that there are three tasks. a. fetching the pages b. parsing them c. storing the results in a database for the first task we can use LWP-USERAGENT or MECHANIZE for the next tasks we can use HTML-Parser! For the third task we need some knowledge of PERL::DBI
Back to
Seekers of Perl Wisdom
|
|