Re: text extract

by b10m (Vicar)
on Feb 02, 2004 at 13:21 UTC ( #325847=note: print w/replies, xml ) Need Help??

in reply to text extract

This can be done in many ways (as usual), but one way would be to grab the webpage with LWP::Simple and parse it. This will include HTML, so you'd have to filter that out (assuming you just want to extract the text) with, say HTML::Parser or, HTML::Strip.

WWW::Mechanize, a very popular module amongst some monks, can probably help too, although I still have been to lazy to check it out.

Then, of course, you could use programs such as "lynx" to dump the non-HTML page, and parse that. No need for HTML stripping and or LWP-like modules then.

Update: it'd be helpfull if you could tell us a little bit more on your motives for doing this (so we know if you want to get rid of the HTML, if you should go for the "lynx" approach etc.), and what you've tried so far.


All code is usually tested, but rarely trusted.

