Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
Re: text extractby b10m (Vicar) |
on Feb 02, 2004 at 13:21 UTC ( [id://325847]=note: print w/replies, xml ) | Need Help?? |
This can be done in many ways (as usual), but one way would be to grab the webpage with LWP::Simple and parse it. This will include HTML, so you'd have to filter that out (assuming you just want to extract the text) with, say HTML::Parser or, HTML::Strip. WWW::Mechanize, a very popular module amongst some monks, can probably help too, although I still have been to lazy to check it out. Then, of course, you could use programs such as "lynx" to dump the non-HTML page, and parse that. No need for HTML stripping and or LWP-like modules then. Update: it'd be helpfull if you could tell us a little bit more on your motives for doing this (so we know if you want to get rid of the HTML, if you should go for the "lynx" approach etc.), and what you've tried so far.
In Section
Seekers of Perl Wisdom
|
|