Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

about retrieving and parsing html without writing on disk

by limner (Novice)
on Apr 09, 2018 at 21:26 UTC ( #1212612=perlquestion: print w/replies, xml ) Need Help??

limner has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all brothers monks

i've successfully wrote a perl script that retrieve an html page, parse it and prepare,
at the end a logfile from the html page.

In order to do this, at this moment, the program does the following:

1) unlink the file from disk, if exist on disk
2) retrieve in memory the correct html page
3) write on disk the html page on a standard filename (file.html)
4) read the file on disk (file.html) and parse it
5) write on disk the logfile

What i would like to do is avoid to write the "file.html" on disk and work only
in ram, so i would like to retrieve it, NOT write it on disk, and parse it in memory.

The following are the program lines that do this:
$nomefile="file.html"; ### name of temporary filename unlink $nomefile; ### remove the file $url="http://www.sitename.com/pagespecial.html"; $mech->get($url); $mech->save_content($nomefile); ### Instr i would like to change use WWW::Mechanize; use HTML::TableExtract; use HTML::Entities; use Text::Unidecode; $user_agent='Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) + Gecko/20101203 Firefox/3.6.13'; my $mech = WWW::Mechanize->new(agent => $user_agent); my $headers = ['col1', 'col2', 'col3', 'col4', 'col5']; my $table_extract = HTML::TableExtract->new(headers => $headers); $table_extract->parse_file($nomefile); ### Inst i would like to chang +e my ($table) = $table_extract->tables;

Everithing works as i would, but in this way every time i parse a page
i remove and write file.html in order to parse it.

How can i do everithin in memory without writing the file?
Thanks Limner

Replies are listed 'Best First'.
Re: about retrieving and parsing html without writing on disk
by LanX (Cardinal) on Apr 09, 2018 at 22:15 UTC
    hmm, I'm too busy to install the modules, but it's at least possible to open a variable for reading and writing.

    open my $fh , "<", \$cache

    so if you can operate with filehandles instead of files this should work.

    update

    HTML::Parser allows ->parse_file($fh) and even ->parse($string)

    update

    Maybe have a look at $string = $mech->content(...) from WWW::Mechanize

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

      Maybe have a look at $string = $mech->content(...) from WWW::Mechanize

      and maybe at HTTP::Response as well, because

      $mech->get( $uri )

      returns an object of that type.

        Good note for checking $response->code and such. Along those lines, for the OP, if you use WWW::Mechanize remember that it fails hard, dies, on any non-success responses, 400s and 500s, unless you set autocheck => 0. You also have access to the response object from the mech object with $mech->response so you don't necessarily need a new variable for it.

Re: about retrieving and parsing html without writing on disk
by marto (Cardinal) on Apr 11, 2018 at 09:24 UTC
Re: about retrieving and parsing html without writing on disk
by learnedbyerror (Monk) on Apr 15, 2018 at 19:03 UTC

    The short answer is yes, you can. I don't use the exact parsing utilities that you are using, but I routinely WWW::Mechanize and parse the content

    Something like the following should work for you. NOTE: I did not test this exact code

    use HTML::TableExtract; use WWW::Mechanize; my $user_agent='Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2. +13)Gecko/20101203 Firefox/3.6.13'; my $mech = WWW::Mechanize->new(autocheck => 0, agent = $user_agent ); if ( $mech->success ) { my $html_string = $mech->content; my $headers = ['col1', 'col2', 'col3', 'col4', 'col5']; my $te = HTML::TableExtract->new( headers => $headers ); my @tables = $te->parse($html_string)->tables; } ...

    lbe

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1212612]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2020-10-20 02:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (208 votes). Check out past polls.

    Notices?