Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Re: Re: text extract

by Roger (Parson)
on Feb 03, 2004 at 13:10 UTC ( #326186=note: print w/ replies, xml ) Need Help??


in reply to Re: Re: text extract
in thread text extract

Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code.

That's no moon, that's a space station -- Obiwan Kenobi.

To do text extraction based on known pattern is easy if you know what the section start and finish look like in general. However you are looking for a generic algorithm on logical text extraction, you need to build a text-classification/pattern-recognition engine, and that's going to be very very difficult. Difficult, but not impossible. But that's way beyond me, besides I don't want to lose too many brain cells over this. ;-)

I will only cover the easy way, ie, (deterministic) text extraction based on a set of known patterns...

use strict; use warnings; use Data::Dumper; # build a hash of known patterns for each known web site my %patterns = ( 'www.foo.com' => { start => "<h3><font[^>]*><b><!--KEY--></b>", finish => "(?<!</font>\n)<br>", }, 'www.bar.com' => { start => "...", finish => "...", }, ); my $html = do { local $/; <DATA> }; print ExtractSection($html, 'www.foo.com', 'Section 2'), "\n\n"; print ExtractSection($html, 'www.foo.com', 'Section 1'), "\n\n"; print ExtractSection($html, 'www.foo.com', 'Section 3'), "\n\n"; # ----------------------------------------------------- sub ExtractSection { my ($html, $site, $section) = @_; my $ps = $patterns{$site}->{start}; my $pf = $patterns{$site}->{finish}; $ps =~ s/<!--KEY-->/$section/; $pf =~ s/<!--KEY-->/$section/; my ($text) = $html =~ /($ps.*?$pf)/sm; return $text; } __DATA__ <HTML> <h3><font size=+1><b>Section 1</b></font> <br> <li>Item 1 <li>Item 2 <li>Item 3 <br> <h3><font size=+1><b>Section 2</b></font> <br> <li>Item 4 <li>Item 5 <li>Item 6 <br> <h3><font size=+1><b>Section 3</b></font> <br> <li>Item 7 <li>Item 8 <li>Item 9 <br> </HTML>

And the output -
<h3><font size=+1><b>Section 2</b></font> <br> <li>Item 4 <li>Item 5 <li>Item 6 <br> <h3><font size=+1><b>Section 1</b></font> <br> <li>Item 1 <li>Item 2 <li>Item 3 <br> <h3><font size=+1><b>Section 3</b></font> <br> <li>Item 7 <li>Item 8 <li>Item 9 <br>


Comment on Re: Re: Re: text extract
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://326186]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (10)
As of 2015-07-30 08:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (270 votes), past polls