Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^3: Trivial HTML extractor utility

by eserte (Deacon)
on Nov 22, 2007 at 20:53 UTC ( #652448=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Trivial HTML extractor utility
in thread Trivial HTML extractor utility

If you used HTML::TreeBuilder::XPath it would be even more powerful.
Not for me; I don't know how to write an xpath expression.
You should really give it a try, it's one of the few fine things coming from the XML world. I once wrote a utility called xmlgrep, which uses XPath expressions for extracting things from HTML or XML files. For extracting links one would write:
GET http://www.perlmonks.org | xmlgrep -parse-html '//a/@href'
but you can also add additional conditions, for example extract only absolute links:
GET http://www.perlmonks.org | xmlgrep -parse-html '//a/@href[contains +(.,"http://")]'


Comment on Re^3: Trivial HTML extractor utility
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://652448]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2015-07-30 23:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (273 votes), past polls