Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: save a page as text

by Hero Zzyzzx (Curate)
on Apr 22, 2005 at 00:43 UTC ( #450244=note: print w/ replies, xml ) Need Help??


in reply to save a page as text

No need to involve a browser at all. Here's one way, using the excellent HTML::TokeParser::Simple by the monastery's own Ovid.

#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TokeParser::Simple; my $page=get('http://www.page.you.want.com/some/path'); my $p = HTML::TokeParser::Simple->new( \$page ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; print $token->as_is; }

Stuffing it into a file is left as an exercise for the poster.

-Any sufficiently advanced technology is
indistinguishable from doubletalk.

My Biz


Comment on Re: save a page as text
Download Code
Re^2: save a page as text
by Anonymous Monk on Apr 22, 2005 at 00:56 UTC
    I can't just strip the HTML, if I could I know how to do that myself. There is JavaScript in the code that prints something out and I need to retrieve what this is.

    I can't retrieve the source code becuse it's just the JS code there, not the data it prints. So I need a way to make a perl screen scraper to scrape text from a page without introducing HTML codes to any degree.

      Sorry, I was a bit confused by the question. I do very little on or for windows, so hopefully someone more experienced with automating the evil empire will speak up.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

      My Biz

      Javascript has been a problem for page scraping. People have tried to go around it by, say, recording the actual http parameters, which is not relevant to your problem. The other approach is to drive IE using Win32::OLE. I used Win32::IE::Mechanize before, but it's mainly for navigation/parsing, you/someone needs to figure out how to call the "Save As" method from COM.

      I didn't know "Save As Text" will evaluate javascript printing. I tried it out, apparently it works.

      Updated. just saw the module Win32::CaptureIE, it looks more promising.

        Being a *nix user, anything that drives IE wouldn't be very useful to me, so my approach would be to examine the JavaScript and port it to Perl.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://450244]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2014-08-01 08:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (257 votes), past polls