Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: Trivial HTML extractor utility

by eserte (Deacon)
on Nov 22, 2007 at 20:53 UTC ( #652448=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Trivial HTML extractor utility
in thread Trivial HTML extractor utility

If you used HTML::TreeBuilder::XPath it would be even more powerful.
Not for me; I don't know how to write an xpath expression.
You should really give it a try, it's one of the few fine things coming from the XML world. I once wrote a utility called xmlgrep, which uses XPath expressions for extracting things from HTML or XML files. For extracting links one would write:
GET http://www.perlmonks.org | xmlgrep -parse-html '//a/@href'
but you can also add additional conditions, for example extract only absolute links:
GET http://www.perlmonks.org | xmlgrep -parse-html '//a/@href[contains +(.,"http://")]'


Comment on Re^3: Trivial HTML extractor utility
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://652448]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (10)
As of 2014-04-21 15:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (495 votes), past polls