Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Re: Re: text extract

by Roger (Parson)
on Feb 03, 2004 at 13:10 UTC ( #326186=note: print w/replies, xml ) Need Help??


in reply to Re: Re: text extract
in thread text extract

Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code.

That's no moon, that's a space station -- Obiwan Kenobi.

To do text extraction based on known pattern is easy if you know what the section start and finish look like in general. However you are looking for a generic algorithm on logical text extraction, you need to build a text-classification/pattern-recognition engine, and that's going to be very very difficult. Difficult, but not impossible. But that's way beyond me, besides I don't want to lose too many brain cells over this. ;-)

I will only cover the easy way, ie, (deterministic) text extraction based on a set of known patterns...
use strict; use warnings; use Data::Dumper; # build a hash of known patterns for each known web site my %patterns = ( 'www.foo.com' => { start => "<h3><font[^>]*><b><!--KEY--></b>", finish => "(?<!</font>\n)<br>", }, 'www.bar.com' => { start => "...", finish => "...", }, ); my $html = do { local $/; <DATA> }; print ExtractSection($html, 'www.foo.com', 'Section 2'), "\n\n"; print ExtractSection($html, 'www.foo.com', 'Section 1'), "\n\n"; print ExtractSection($html, 'www.foo.com', 'Section 3'), "\n\n"; # ----------------------------------------------------- sub ExtractSection { my ($html, $site, $section) = @_; my $ps = $patterns{$site}->{start}; my $pf = $patterns{$site}->{finish}; $ps =~ s/<!--KEY-->/$section/; $pf =~ s/<!--KEY-->/$section/; my ($text) = $html =~ /($ps.*?$pf)/sm; return $text; } __DATA__ <HTML> <h3><font size=+1><b>Section 1</b></font> <br> <li>Item 1 <li>Item 2 <li>Item 3 <br> <h3><font size=+1><b>Section 2</b></font> <br> <li>Item 4 <li>Item 5 <li>Item 6 <br> <h3><font size=+1><b>Section 3</b></font> <br> <li>Item 7 <li>Item 8 <li>Item 9 <br> </HTML>

And the output -
<h3><font size=+1><b>Section 2</b></font> <br> <li>Item 4 <li>Item 5 <li>Item 6 <br> <h3><font size=+1><b>Section 1</b></font> <br> <li>Item 1 <li>Item 2 <li>Item 3 <br> <h3><font size=+1><b>Section 3</b></font> <br> <li>Item 7 <li>Item 8 <li>Item 9 <br>

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://326186]
help
Chatterbox?
[erix]: 1=0 is as short as TOP :)
[erix]: "code of someone that died" -- kinda nice if your code stops working too
[erix]: hard to implement, hmm
[Corion]: erix: Well, they also seem to have changed the server, or some software, or whatever, and seem to be in the process of changing the DB schema from having the "username" as primary key to something else.
[Corion]: Far too many things being done at once, or maybe only now has it become apparent that nobody knows that piece of software anymore
[marto]: good morning all
[Corion]: I consider having an abstract key as userid in your system good, because the "real" company-wide (or even larger) user id will likely not fit your criteria well
[Corion]: A good morning marto!

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2017-01-23 09:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you watch meteor showers?




    Results (192 votes). Check out past polls.