Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re: text extract

by b10m (Vicar)
on Feb 02, 2004 at 13:21 UTC ( #325847=note: print w/replies, xml ) Need Help??

in reply to text extract

This can be done in many ways (as usual), but one way would be to grab the webpage with LWP::Simple and parse it. This will include HTML, so you'd have to filter that out (assuming you just want to extract the text) with, say HTML::Parser or, HTML::Strip.

WWW::Mechanize, a very popular module amongst some monks, can probably help too, although I still have been to lazy to check it out.

Then, of course, you could use programs such as "lynx" to dump the non-HTML page, and parse that. No need for HTML stripping and or LWP-like modules then.

Update: it'd be helpfull if you could tell us a little bit more on your motives for doing this (so we know if you want to get rid of the HTML, if you should go for the "lynx" approach etc.), and what you've tried so far.


All code is usually tested, but rarely trusted.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://325847]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2021-06-24 06:14 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (123 votes). Check out past polls.