shu has asked for the wisdom of the Perl Monks concerning the following question:
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: text extract
by b10m (Vicar) on Feb 02, 2004 at 13:21 UTC | |
This can be done in many ways (as usual), but one way would be to grab the webpage with LWP::Simple and parse it. This will include HTML, so you'd have to filter that out (assuming you just want to extract the text) with, say HTML::Parser or, HTML::Strip. WWW::Mechanize, a very popular module amongst some monks, can probably help too, although I still have been to lazy to check it out. Then, of course, you could use programs such as "lynx" to dump the non-HTML page, and parse that. No need for HTML stripping and or LWP-like modules then. Update: it'd be helpfull if you could tell us a little bit more on your motives for doing this (so we know if you want to get rid of the HTML, if you should go for the "lynx" approach etc.), and what you've tried so far. | [reply] |
Re: text extract
by Roger (Parson) on Feb 02, 2004 at 13:39 UTC | |
Ok, I have written a demo using the following modules: LWP::UserAgent ... to fetch the HTML source HTML::Strip ... to strip HTML tags
And the output -
| [reply] [d/l] [select] |
by shu (Initiate) on Feb 03, 2004 at 05:44 UTC | |
| [reply] |
by shu (Initiate) on Feb 03, 2004 at 08:41 UTC | |
| [reply] |
by Roger (Parson) on Feb 03, 2004 at 13:10 UTC | |
That's no moon, that's a space station -- Obiwan Kenobi. To do text extraction based on known pattern is easy if you know what the section start and finish look like in general. However you are looking for a generic algorithm on logical text extraction, you need to build a text-classification/pattern-recognition engine, and that's going to be very very difficult. Difficult, but not impossible. But that's way beyond me, besides I don't want to lose too many brain cells over this. ;-) I will only cover the easy way, ie, (deterministic) text extraction based on a set of known patterns...
And the output -
| [reply] [d/l] [select] |
Re: text extract
by LordWeber (Monk) on Feb 02, 2004 at 14:33 UTC | |
[reply] |