Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: text extract

by Roger (Parson)
on Feb 02, 2004 at 13:39 UTC ( #325850=note: print w/replies, xml ) Need Help??

in reply to text extract

I remember you came here a while ago asking for advise on how to automatically parse online staff info html pages into useful data. How far are you into your project anyway? Did you experiment with the various CPAN modules mentioned in the replies to your original post?

Ok, I have written a demo using the following modules:
LWP::UserAgent ... to fetch the HTML source
HTML::Strip ... to strip HTML tags
use strict; use warnings; use LWP::UserAgent; use HTML::Strip; # fetch the html source my $ua = new LWP::UserAgent; my $req = new HTTP::Request (GET => ' +/eekteoh.htm'); my $res = $ua->request($req); die "unable to fetch HTML source: $res->status_line" if !$res->is_success(); my $html = $res->content(); # fetch the html source $html =~ s/\x0D//g; # convert to unix format # grab the html fragments my ($research_interest, $selected_publications) = $html =~ /(Research Interests.*)(Selected Publications.*)Projects/s; # strip html tags my $hs = HTML::Strip->new(); $research_interest = $hs->parse( $research_interest ); $research_interest =~ s/\n+/\n/sg; $research_interest =~ s/(Research Interests)/$1\n------------------/; $selected_publications = $hs->parse($selected_publications); $selected_publications =~ s/\n+/\n/sg; $selected_publications =~ s/(Selected Publications)/$1\n-------------- +-------/; $hs->eof; print "$research_interest\n\n$selected_publications";

And the output -
Research Interests ------------------ Computer Vision and Pattern Recognition Autonomous Navigation of Outdoor AGVs Intelligent Systems Robotics Industrial Automation Selected Publications --------------------- DG Shen, Harace HS Ip, ....

Replies are listed 'Best First'.
Re: Re: text extract
by shu (Initiate) on Feb 03, 2004 at 05:44 UTC
    Hi there..thank for the code...yah i have explored all the modules and have managed to grab the HTML pages as well as parse the HTML to get the required data. What im having problems with is just getting a part of all the text on the page and recognising the beg and end of the text i mentioned under the required headings. I have been able to get ALL the text but not just the stuff i want under the specific headings. Can you help wiht a snippet or code? Thanx
Re: Re: text extract
by shu (Initiate) on Feb 03, 2004 at 08:41 UTC
    As far as my project goes, I have been able to make a flowchart of what I need to do and started development. But for the reasons of dynamic and ever changing formats of HTML, I have narrowed my focus on around 10 educational web pages and extracting info from there. However being new to perl, I am confused as to how to use the modules effectively. I have read and understood the functionality and can perform the basic functions but the combination of regular expressions and the HTML::parse etc to extract only CERTAIN parts of tect from a page is where i keep getting stuck. Please help or advise what I should do. The code you gave works fine with the page i mentioned. Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code. These small hiccups are what are avoiding me from moving on fast. In the end i need GUI also which i think i can manage as Ive worked on it. Thanx...
      Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code.

      That's no moon, that's a space station -- Obiwan Kenobi.

      To do text extraction based on known pattern is easy if you know what the section start and finish look like in general. However you are looking for a generic algorithm on logical text extraction, you need to build a text-classification/pattern-recognition engine, and that's going to be very very difficult. Difficult, but not impossible. But that's way beyond me, besides I don't want to lose too many brain cells over this. ;-)

      I will only cover the easy way, ie, (deterministic) text extraction based on a set of known patterns...
      use strict; use warnings; use Data::Dumper; # build a hash of known patterns for each known web site my %patterns = ( '' => { start => "<h3><font[^>]*><b><!--KEY--></b>", finish => "(?<!</font>\n)<br>", }, '' => { start => "...", finish => "...", }, ); my $html = do { local $/; <DATA> }; print ExtractSection($html, '', 'Section 2'), "\n\n"; print ExtractSection($html, '', 'Section 1'), "\n\n"; print ExtractSection($html, '', 'Section 3'), "\n\n"; # ----------------------------------------------------- sub ExtractSection { my ($html, $site, $section) = @_; my $ps = $patterns{$site}->{start}; my $pf = $patterns{$site}->{finish}; $ps =~ s/<!--KEY-->/$section/; $pf =~ s/<!--KEY-->/$section/; my ($text) = $html =~ /($ps.*?$pf)/sm; return $text; } __DATA__ <HTML> <h3><font size=+1><b>Section 1</b></font> <br> <li>Item 1 <li>Item 2 <li>Item 3 <br> <h3><font size=+1><b>Section 2</b></font> <br> <li>Item 4 <li>Item 5 <li>Item 6 <br> <h3><font size=+1><b>Section 3</b></font> <br> <li>Item 7 <li>Item 8 <li>Item 9 <br> </HTML>

      And the output -
      <h3><font size=+1><b>Section 2</b></font> <br> <li>Item 4 <li>Item 5 <li>Item 6 <br> <h3><font size=+1><b>Section 1</b></font> <br> <li>Item 1 <li>Item 2 <li>Item 3 <br> <h3><font size=+1><b>Section 3</b></font> <br> <li>Item 7 <li>Item 8 <li>Item 9 <br>

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://325850]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2021-06-25 02:43 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (133 votes). Check out past polls.