Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Browser automation to copy webpage to text

by eversuhoshin (Sexton)
on Oct 20, 2015 at 22:11 UTC ( [id://1145488]=perlquestion: print w/replies, xml ) Need Help??

eversuhoshin has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I desperately need help with browser automation

Specifically, I would like to copy what I see on the web page to either a word or a text file. I would like to keep the formatting

For instance, I would like to copy what I see on

http://www.sec.gov/Archives/edgar/data/1557421/000100201412000509/iogcs1-9132012.htm

to a word or text file, exactly as I see it on the browser

Thank you so much!

  • Comment on Browser automation to copy webpage to text

Replies are listed 'Best First'.
Re: Browser automation to copy webpage to text
by Athanasius (Archbishop) on Oct 21, 2015 at 03:55 UTC

    Hello eversuhoshin,

    I would like to copy what I see ... to a word or text file, exactly as I see it on the browser

    The requirement is unclear, expecially as an HTML page contains markup which can’t be translated into plain text.

    Here is a plain text approach using LWP::Simple to get the web page and HTML::FormatText to extract the text from the HTML:

    #! perl use strict; use warnings; use HTML::FormatText; use LWP::Simple; my $address = 'http://www.sec.gov/Archives/edgar/data/1557421/' . '000100201412000509/iogcs1-9132012.htm'; my $content = get($address); defined $content or die "Cannot read '$address': $!"; my $string = HTML::FormatText->format_string ( $content, leftmargin => 5, rightmargin => 75, ); print $string;

    Output (opening lines only):

    I’m not sure whether that output suits your needs? You could also look at HTML::HTML5::ToText.

    To produce a Word-readable file, change HTML::FormatText to HTML::FormatRTF:

    use strict; use warnings; use HTML::FormatRTF; use LWP::Simple; my $outfile = 'test.rtf'; my $address = 'http://www.sec.gov/Archives/edgar/data/1557421/' . '000100201412000509/iogcs1-9132012.htm'; my $content = get($address); defined $content or die "Cannot read '$address': $!"; open(my $rtf, '>', $outfile) or die "Cannot open file '$outfile' for writing: $!"; print $rtf HTML::FormatRTF->format_string($content); close $rtf or die "Cannot close file '$outfile': $!";

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      thank you so much! this is very helpful

      would there be a way for me to save the entire webpage as a pdf instead of an rtf?

      I realize even with rtf, some formats are broken

      Ideally, I would like the webpage to be saved in pdf and then copied to microsoft word

      Again, thank you so much

        For a Perl solution, you can try PDF::FromHTML — if you can get it to install. :-(

        For automated, non-Perl solutions, you can look at something like HTMLDOC (free, but you have to build it from source), or Doxillion Document Converter (not free).

        But you’ll probably get the best results by manually saving (or “printing”) the page to PDF format in your browser. For example, in Google Chrome select Print..., then under Destination click the Change button and select Save as PDF. In Firefox, install the “Save as PDF” add-on which places a Save as PDF by pdfcrown.com button on the address bar.

        You may be able to automate this browser-based approach from Perl via a module such as WWW::Mechanize::Firefox; but that’s way outside my experience.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Browser automation to copy webpage to text
by u65 (Chaplain) on Oct 20, 2015 at 22:52 UTC

    I'm not sure what you want can be done exactly as you wish (but if it can someone here can tell you how to do it). In the mean time, could you please put the link you cite inside square brackets to make it a hot link? Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1145488]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-24 23:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found