shu has asked for the wisdom of the Perl Monks concerning the following question:

Hi.... Can anyone tell me code to extract text under the headings "Reasearch Interests" and "Selected Publications" using perl on the following page- Thanx

Replies are listed 'Best First'.
Re: text extract
by b10m (Vicar) on Feb 02, 2004 at 13:21 UTC

    This can be done in many ways (as usual), but one way would be to grab the webpage with LWP::Simple and parse it. This will include HTML, so you'd have to filter that out (assuming you just want to extract the text) with, say HTML::Parser or, HTML::Strip.

    WWW::Mechanize, a very popular module amongst some monks, can probably help too, although I still have been to lazy to check it out.

    Then, of course, you could use programs such as "lynx" to dump the non-HTML page, and parse that. No need for HTML stripping and or LWP-like modules then.

    Update: it'd be helpfull if you could tell us a little bit more on your motives for doing this (so we know if you want to get rid of the HTML, if you should go for the "lynx" approach etc.), and what you've tried so far.


    All code is usually tested, but rarely trusted.
Re: text extract
by Roger (Parson) on Feb 02, 2004 at 13:39 UTC
    I remember you came here a while ago asking for advise on how to automatically parse online staff info html pages into useful data. How far are you into your project anyway? Did you experiment with the various CPAN modules mentioned in the replies to your original post?

    Ok, I have written a demo using the following modules:
    LWP::UserAgent ... to fetch the HTML source
    HTML::Strip ... to strip HTML tags
    use strict; use warnings; use LWP::UserAgent; use HTML::Strip; # fetch the html source my $ua = new LWP::UserAgent; my $req = new HTTP::Request (GET => ' +/eekteoh.htm'); my $res = $ua->request($req); die "unable to fetch HTML source: $res->status_line" if !$res->is_success(); my $html = $res->content(); # fetch the html source $html =~ s/\x0D//g; # convert to unix format # grab the html fragments my ($research_interest, $selected_publications) = $html =~ /(Research Interests.*)(Selected Publications.*)Projects/s; # strip html tags my $hs = HTML::Strip->new(); $research_interest = $hs->parse( $research_interest ); $research_interest =~ s/\n+/\n/sg; $research_interest =~ s/(Research Interests)/$1\n------------------/; $selected_publications = $hs->parse($selected_publications); $selected_publications =~ s/\n+/\n/sg; $selected_publications =~ s/(Selected Publications)/$1\n-------------- +-------/; $hs->eof; print "$research_interest\n\n$selected_publications";

    And the output -
    Research Interests ------------------ Computer Vision and Pattern Recognition Autonomous Navigation of Outdoor AGVs Intelligent Systems Robotics Industrial Automation Selected Publications --------------------- DG Shen, Harace HS Ip, ....

      As far as my project goes, I have been able to make a flowchart of what I need to do and started development. But for the reasons of dynamic and ever changing formats of HTML, I have narrowed my focus on around 10 educational web pages and extracting info from there. However being new to perl, I am confused as to how to use the modules effectively. I have read and understood the functionality and can perform the basic functions but the combination of regular expressions and the HTML::parse etc to extract only CERTAIN parts of tect from a page is where i keep getting stuck. Please help or advise what I should do. The code you gave works fine with the page i mentioned. Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code. These small hiccups are what are avoiding me from moving on fast. In the end i need GUI also which i think i can manage as Ive worked on it. Thanx...
        Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code.

        That's no moon, that's a space station -- Obiwan Kenobi.

        To do text extraction based on known pattern is easy if you know what the section start and finish look like in general. However you are looking for a generic algorithm on logical text extraction, you need to build a text-classification/pattern-recognition engine, and that's going to be very very difficult. Difficult, but not impossible. But that's way beyond me, besides I don't want to lose too many brain cells over this. ;-)

        I will only cover the easy way, ie, (deterministic) text extraction based on a set of known patterns...
        use strict; use warnings; use Data::Dumper; # build a hash of known patterns for each known web site my %patterns = ( '' => { start => "<h3><font[^>]*><b><!--KEY--></b>", finish => "(?<!</font>\n)<br>", }, '' => { start => "...", finish => "...", }, ); my $html = do { local $/; <DATA> }; print ExtractSection($html, '', 'Section 2'), "\n\n"; print ExtractSection($html, '', 'Section 1'), "\n\n"; print ExtractSection($html, '', 'Section 3'), "\n\n"; # ----------------------------------------------------- sub ExtractSection { my ($html, $site, $section) = @_; my $ps = $patterns{$site}->{start}; my $pf = $patterns{$site}->{finish}; $ps =~ s/<!--KEY-->/$section/; $pf =~ s/<!--KEY-->/$section/; my ($text) = $html =~ /($ps.*?$pf)/sm; return $text; } __DATA__ <HTML> <h3><font size=+1><b>Section 1</b></font> <br> <li>Item 1 <li>Item 2 <li>Item 3 <br> <h3><font size=+1><b>Section 2</b></font> <br> <li>Item 4 <li>Item 5 <li>Item 6 <br> <h3><font size=+1><b>Section 3</b></font> <br> <li>Item 7 <li>Item 8 <li>Item 9 <br> </HTML>

        And the output -
        <h3><font size=+1><b>Section 2</b></font> <br> <li>Item 4 <li>Item 5 <li>Item 6 <br> <h3><font size=+1><b>Section 1</b></font> <br> <li>Item 1 <li>Item 2 <li>Item 3 <br> <h3><font size=+1><b>Section 3</b></font> <br> <li>Item 7 <li>Item 8 <li>Item 9 <br>

      Hi there..thank for the code...yah i have explored all the modules and have managed to grab the HTML pages as well as parse the HTML to get the required data. What im having problems with is just getting a part of all the text on the page and recognising the beg and end of the text i mentioned under the required headings. I have been able to get ALL the text but not just the stuff i want under the specific headings. Can you help wiht a snippet or code? Thanx
Re: text extract
by LordWeber (Monk) on Feb 02, 2004 at 14:33 UTC