Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

How to extract xpath from the webpage

by perladdict (Chaplain)
on Nov 03, 2009 at 10:01 UTC ( #804640=perlquestion: print w/ replies, xml ) Need Help??
perladdict has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
How to extract the xpath of each elements in a web page.I gone through all the nodes in perlmonestry gate as Html::Treebuilder::xpath module doc. I don't know how to how to start, seeking help from the experts monks that how can i start in order to achive the task objective.

Comment on How to extract xpath from the webpage
Re: How to extract xpath from the webpage
by moritz (Cardinal) on Nov 03, 2009 at 10:13 UTC
    Your question is a bit puzzling - do you really want to obtain an xpath expression for each HTML tag?

    Usually it's the other way round: You need to extract some specific HTML tags, and use xpath for that.

    So the best way to start would be to learn XPath a bit, then look at the HTML page you want to extract stuff from, and write an XPath expression to extract what you need.

    Install HTML::TreeBuilder::XPath, experiment with it, and refine your xpath expression until it does what you want.

Re: How to extract xpath from the webpage
by Corion (Pope) on Nov 03, 2009 at 10:14 UTC

    Do you really mean that you have an HTML structure and want one XPath expression for each element?

    This smells of homework to me because constructing an XPath expression if you have the path to an element is trivial:

    <myml> <foo> <bar id="1" /> <bar id="2" /> </foo> </myml>

    To get the xpath expression for each element, you concatenate all parent tags of each elements with /, and add the index of each element as the :nth-child axis.

    Generating such an XPath expression does not help you much, which is why I think this is homework. But if this is not homework, maybe you can explain what actual problem you're trying to solve.

      Hi Corion, I am doing web page automation to find the links, text and image links by using selenium,which uses xpath to locate the links like "//td2/div/a/img" from the web page source. I am trying.
      I am trying with Html::TreeBuilder::xpath, i don't know what are all the other modules i can import in my script.

        If Selenium supports XPath queries, you don't need any Perl XPath modules. If you want to access Selenium and its results, see WWW::Selenium. If you want to use HTML::TreeBuilder::XPath, I'm not sure where your actual problem in your code is. The "synopsis" section shows how to extract HTML fragments from a given HTML string. Maybe you want to fetch the images using LWP::UserAgent then?

        Personally, I automate websites with WWW::Mechanize::FireFox, which supports Javascript (and XPath).

Re: How to extract xpath from the webpage
by spx2 (Chaplain) on Nov 03, 2009 at 14:16 UTC

    Here you go, this should get you started with HTML::TreeBuilder::XPath. It's code that parses google search results. You will get an IP ban if you use it too much so this is just for educational purposes.

    Also consider reading the actual documentation.

    Good luck and most importantly, have fun!

      Can you please add script that uses the google perl script you have shared? I am looking for capturing all the xpaths for the search term "blue suede shoes" on google page. Thanks, M

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://804640]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (18)
As of 2014-09-18 14:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (116 votes), past polls