Getting the text of the html page

agynr has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have a problem regarding the extraction of text from the html page. Is it possible to open Internet explorer with a specified page and then extract the text contents of the page(not the source). Kindly help me. Thanx

Comment on Getting the text of the html page

Replies are listed 'Best First'.
Re: Getting the text of the html page by ercparker (Hermit) on Jun 20, 2005 at 06:57 UTC
here is one way using WWW:Mechanize: `use strict; use warnings; use WWW::Mechanize; my $mech = WWW::Mechanize->new( autocheck => 1, cookie_jar => {}, ); $mech->get("http://perlmonks.org/?node_id=468232"); print $mech->content( format => "text" );` [download] That will strip all of the markup and print a text version of the page. hopefully I understood your question. -Eric	[reply] [d/l]
Re^2: Getting the text of the html page by agynr (Acolyte) on Jun 20, 2005 at 07:30 UTC
Hello Eric, While doing with the www.mechanize it is giving the error on the get statement.The error goes like this Can't locate object method "host" via package "URI::Foreign".... From where I could load this package as it is not installed earlier on my system.	[reply]
Re^3: Getting the text of the html page by ank (Scribe) on Jun 20, 2005 at 08:13 UTC
You'll find these references useful: A guide to installing modules and Writing, Installing, and Using Perl Modules also, take a look at CPAN -- ank	[reply]
Re^4: Getting the text of the html page by agynr (Acolyte) on Jun 20, 2005 at 08:27 UTC
Re^5: Getting the text of the html page by ank (Scribe) on Jun 20, 2005 at 08:50 UTC
Re: Getting the text of the html page by gube (Parson) on Jun 20, 2005 at 08:20 UTC
Refer this node Extract Web Page	[reply]
Re: Getting the text of the html document by dyer85 (Acolyte) on Jun 20, 2005 at 08:30 UTC
If I understand your question properly, I think you mean you want to strip out the HTML tags. If so, the following ought to do the trick `#!/usr/bin/perl -w use strict; print "Content-type: text/html\r\n"; my $file="path/to/page.html"; open(fp, $file) or die "Couldn't open file: $!"; while ( my $output = <fp> ) { $output=~s/<[^>]*?>//g; $output=~s/&/&/g; $output=~s/"/"/g; $output=~s/</</g; $output=~s/>/>/g; print $output . "\n"; }; close(fp);` [download] My Site in Progress	[reply] [d/l]
Re^2: Getting the text of the html document by bradcathey (Prior) on Jun 20, 2005 at 12:34 UTC
Best to let one of the several CPAN modules do this for you. I'd look at HTML::Strip for starters. —Brad "The important work of moving the world forward does not wait to be done by perfect men." George Eliot	[reply]
Re^2: Getting the text of the html document by CountZero (Bishop) on Jun 20, 2005 at 13:12 UTC
The only way to deal with HTML (or other mark-up languages) is to parse the HTML-code. A "simple" regex-solution is not guaranteed to work in all cases. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^3: Getting the text of the html document by dyer85 (Acolyte) on Jul 19, 2005 at 09:26 UTC
That's a good point. My little regexes there don't convert every single entity, but it strips EVERY tag, and converts the <'s, >'s, quotes, and ampersands. Not much else would be left behind, honestly. Regardless of that fact, bradcathey, seems to have a very nice solution which is much faster than regex anyway. My Site in Progress	[reply]
Re^4: Getting the text of the html document by davorg (Chancellor) on Jul 19, 2005 at 09:36 UTC
Re: Getting the text of the html page by Anonymous Monk on Jun 20, 2005 at 06:56 UTC
Is it possible to open Internet explorer with a specified page and then extract the text contents of the page(not the source). Kindly help me. Thanx That's a question for microsoft.	[reply]

Back to Seekers of Perl Wisdom