http://www.perlmonks.org?node_id=468233

agynr has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have a problem regarding the extraction of text from the html page. Is it possible to open Internet explorer with a specified page and then extract the text contents of the page(not the source). Kindly help me. Thanx

Replies are listed 'Best First'.
Re: Getting the text of the html page
by ercparker (Hermit) on Jun 20, 2005 at 06:57 UTC
    here is one way using WWW:Mechanize:

    use strict; use warnings; use WWW::Mechanize; my $mech = WWW::Mechanize->new( autocheck => 1, cookie_jar => {}, ); $mech->get("http://perlmonks.org/?node_id=468232"); print $mech->content( format => "text" );

    That will strip all of the markup and print a text version of the page.

    hopefully I understood your question.

    -Eric
      Hello Eric, While doing with the www.mechanize it is giving the error on the get statement.The error goes like this Can't locate object method "host" via package "URI::Foreign".... From where I could load this package as it is not installed earlier on my system.
Re: Getting the text of the html page
by gube (Parson) on Jun 20, 2005 at 08:20 UTC
Re: Getting the text of the html document
by dyer85 (Acolyte) on Jun 20, 2005 at 08:30 UTC

    If I understand your question properly, I think you mean you want to strip out the HTML tags. If so, the following ought to do the trick

    #!/usr/bin/perl -w use strict; print "Content-type: text/html\r\n"; my $file="path/to/page.html"; open(fp, $file) or die "Couldn't open file: $!"; while ( my $output = <fp> ) { $output=~s/<[^>]*?>//g; $output=~s/&/&amp;/g; $output=~s/"/&quot;/g; $output=~s/</&lt;/g; $output=~s/>/&gt;/g; print $output . "\n"; }; close(fp);

      Best to let one of the several CPAN modules do this for you. I'd look at HTML::Strip for starters.


      —Brad
      "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
      The only way to deal with HTML (or other mark-up languages) is to parse the HTML-code. A "simple" regex-solution is not guaranteed to work in all cases.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        That's a good point. My little regexes there don't convert every single entity, but it strips EVERY tag, and converts the <'s, >'s, quotes, and ampersands. Not much else would be left behind, honestly.

        Regardless of that fact, bradcathey, seems to have a very nice solution which is much faster than regex anyway.

Re: Getting the text of the html page
by Anonymous Monk on Jun 20, 2005 at 06:56 UTC
    Is it possible to open Internet explorer with a specified page and then extract the text contents of the page(not the source). Kindly help me. Thanx
    That's a question for microsoft.