Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: save a page as text

by Hero Zzyzzx (Curate)
on Apr 22, 2005 at 00:43 UTC ( #450244=note: print w/replies, xml ) Need Help??

in reply to save a page as text

No need to involve a browser at all. Here's one way, using the excellent HTML::TokeParser::Simple by the monastery's own Ovid.

#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TokeParser::Simple; my $page=get(''); my $p = HTML::TokeParser::Simple->new( \$page ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; print $token->as_is; }

Stuffing it into a file is left as an exercise for the poster.

-Any sufficiently advanced technology is
indistinguishable from doubletalk.

My Biz

Replies are listed 'Best First'.
Re^2: save a page as text
by Anonymous Monk on Apr 22, 2005 at 00:56 UTC
    I can't just strip the HTML, if I could I know how to do that myself. There is JavaScript in the code that prints something out and I need to retrieve what this is.

    I can't retrieve the source code becuse it's just the JS code there, not the data it prints. So I need a way to make a perl screen scraper to scrape text from a page without introducing HTML codes to any degree.

      Javascript has been a problem for page scraping. People have tried to go around it by, say, recording the actual http parameters, which is not relevant to your problem. The other approach is to drive IE using Win32::OLE. I used Win32::IE::Mechanize before, but it's mainly for navigation/parsing, you/someone needs to figure out how to call the "Save As" method from COM.

      I didn't know "Save As Text" will evaluate javascript printing. I tried it out, apparently it works.

      Updated. just saw the module Win32::CaptureIE, it looks more promising.

        Being a *nix user, anything that drives IE wouldn't be very useful to me, so my approach would be to examine the JavaScript and port it to Perl.

      Sorry, I was a bit confused by the question. I do very little on or for windows, so hopefully someone more experienced with automating the evil empire will speak up.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

      My Biz

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://450244]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2021-04-20 15:36 GMT
Find Nodes?
    Voting Booth?

    No recent polls found