Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Getting the text of the html page

by agynr (Acolyte)
on Jun 20, 2005 at 06:44 UTC ( #468233=perlquestion: print w/ replies, xml ) Need Help??
agynr has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have a problem regarding the extraction of text from the html page. Is it possible to open Internet explorer with a specified page and then extract the text contents of the page(not the source). Kindly help me. Thanx

Comment on Getting the text of the html page
Re: Getting the text of the html page
by Anonymous Monk on Jun 20, 2005 at 06:56 UTC
    Is it possible to open Internet explorer with a specified page and then extract the text contents of the page(not the source). Kindly help me. Thanx
    That's a question for microsoft.
Re: Getting the text of the html page
by ercparker (Hermit) on Jun 20, 2005 at 06:57 UTC
    here is one way using WWW:Mechanize:

    use strict; use warnings; use WWW::Mechanize; my $mech = WWW::Mechanize->new( autocheck => 1, cookie_jar => {}, ); $mech->get("http://perlmonks.org/?node_id=468232"); print $mech->content( format => "text" );

    That will strip all of the markup and print a text version of the page.

    hopefully I understood your question.

    -Eric
      Hello Eric, While doing with the www.mechanize it is giving the error on the get statement.The error goes like this Can't locate object method "host" via package "URI::Foreign".... From where I could load this package as it is not installed earlier on my system.
Re: Getting the text of the html page
by gube (Parson) on Jun 20, 2005 at 08:20 UTC
Re: Getting the text of the html document
by dyer85 (Acolyte) on Jun 20, 2005 at 08:30 UTC

    If I understand your question properly, I think you mean you want to strip out the HTML tags. If so, the following ought to do the trick

    #!/usr/bin/perl -w use strict; print "Content-type: text/html\r\n"; my $file="path/to/page.html"; open(fp, $file) or die "Couldn't open file: $!"; while ( my $output = <fp> ) { $output=~s/<[^>]*?>//g; $output=~s/&/&amp;/g; $output=~s/"/&quot;/g; $output=~s/</&lt;/g; $output=~s/>/&gt;/g; print $output . "\n"; }; close(fp);

      Best to let one of the several CPAN modules do this for you. I'd look at HTML::Strip for starters.


      —Brad
      "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
      The only way to deal with HTML (or other mark-up languages) is to parse the HTML-code. A "simple" regex-solution is not guaranteed to work in all cases.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        That's a good point. My little regexes there don't convert every single entity, but it strips EVERY tag, and converts the <'s, >'s, quotes, and ampersands. Not much else would be left behind, honestly.

        Regardless of that fact, bradcathey, seems to have a very nice solution which is much faster than regex anyway.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://468233]
Approved by Tanalis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2014-07-14 03:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (254 votes), past polls