Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^2: Trivial HTML extractor utility

by Dominus (Parson)
on Nov 22, 2007 at 19:14 UTC ( [id://652433]=note: print w/replies, xml ) Need Help??


in reply to Re: Trivial HTML extractor utility
in thread Trivial HTML extractor utility

If you used HTML::TreeBuilder::XPath it would be even more powerful.
Not for me; I don't know how to write an xpath expression.

Seriously, I think it's really interesting how we seem to have completely different outlooks on this. I was worried that the program was already excessively general and overfeaturized. I've only used it once, to extract titles, and I was considering getting rid of the command-line argument, downgrading it to a program that does nothing but extract titles. Meanwhile you, who have used it even less than I have, want to enhance it to to all sorts of other stuff.

Maybe you have some application in mind for some of that fancy stuff, but you didn't say you did, and you didn't give an example, so I wonder what value you see in enhancing the features of a program that already has way more features than have ever been used.

Please take this as a serious question, not as rhetoric.

Replies are listed 'Best First'.
Re^3: Trivial HTML extractor utility
by eserte (Deacon) on Nov 22, 2007 at 20:53 UTC
    If you used HTML::TreeBuilder::XPath it would be even more powerful.
    Not for me; I don't know how to write an xpath expression.
    You should really give it a try, it's one of the few fine things coming from the XML world. I once wrote a utility called xmlgrep, which uses XPath expressions for extracting things from HTML or XML files. For extracting links one would write:
    GET http://www.perlmonks.org | xmlgrep -parse-html '//a/@href'
    but you can also add additional conditions, for example extract only absolute links:
    GET http://www.perlmonks.org | xmlgrep -parse-html '//a/@href[contains +(.,"http://")]'
Re^3: Trivial HTML extractor utility
by hossman (Prior) on Nov 22, 2007 at 21:17 UTC
    Not for me; I don't know how to write an xpath expression.
    ...
    I wonder what value you see in enhancing the features of a program that already has way more features than have ever been used.

    Fair enough ... but I suspect if you knew XPath my comment would make more sense.

    you strike me as the kind of guy who whips up little scripts to solve problems a lot -- heck, anyone who uses perl on a regular becomes thta kind of person if they weren't already. as you say: right now it's got a feature you've never used (the ability to pick an arbitrary tag name at run time) and if you never use the script again oh well ... it's not like it took you a lot of work to code it right? But if at some point in your life you think "i need to get the <h1> tags out of all these HTML pages", you might remember your handy script use it, and then realize what you really want is the *first* <h1> out of all the files, and you'd probably add a quick option to let you pick the first instance. Then maybe 6 months later you're crunching some more HTML files and you want the "content" attribute of any <meta http-equiv="refresh" ... > tags ... so you crank out another little script.

    Or, if you know XPath, the first time you need a something a little more complicated then just all values of all the tags with a certain name, you add about 12 characters to your current script, and start passing some simple XPath expressions on the command line.

    Or you don't.

    Like you say: it's a trivial utility ... if it does everything you want then call it a day and go fishing. To answer your specific question: The value I see in enhancing it comes from the ability to gain large amount of additional functionality from a small amount of additional work.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://652433]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-03-28 14:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found