Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

grabbing link and 3 regexes to save HTML to disk

by Discipulus (Curate)
on Mar 22, 2013 at 08:59 UTC ( #1024892=perlquestion: print w/ replies, xml ) Need Help??
Discipulus has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

i'm rewriting the parsing part of my WebTimeLoad because i discovered that HTML::Parse is deprecated so i want to switch to HTML::LinkExtor. I also want to make the render option (save the page on disk and display it) more accurate.

The logic of the program is: get the page content (if a frame is found is pushed into pages queue), parse the content to grab links and put them into some %cache, process all links.

The code use this setup (semplified):
my $url; my %stat; # the cache hash where pages and link are accumulated in the +ir keys my $ua = LWP::UserAgent->new; my $parser = HTML::LinkExtor->new; my $resp = $ua->get($url); $parser->parse($resp->content); my $base = $resp->base;

1)grab all links

foreach my $link_found( $parser->links ) { next unless $$link_found[1] eq 'src'; my $uriobj = URI->new( $$link_found[2]); my $absurl = $uriobj->abs($base); #if is a frame add to pages adding an iteration to + this sub if ($$link_found[0] eq 'frame'||$$link_found[0] eq ' +iframe') { push @{$stat{'pages'}}, "$absurl"; next } #? need to stringify $absurl #else is a content and we add this to the cache ha +sh $stat{cache}{ $absurl }=[] # will store there leng +th and time later on }
$parser->links return a AoA is safe to select everything where third field is 'src' ? or i have to select based on link type ? only 'frame iframe img input layer script textarea video' tags can have src associated? make sense to grab all of them to repaint the page ?

2)modify the page

I want to modify the page before writing it to disk so that all src point to local resource and all web chars not permitted on filesystem are translated ('cause some link is naughty as www.it.org/js/jquery/jquery.color.js?ver=2.0-4561m):
if ($render){ mkdir "$ENV{TEMP}\\_temp_files"||die; open RENDER, "> $ENV{TEMP}/_temp.html"|| die "unable to write to % +TEMP%\\_temp.html"; # locaclize src (my $localcont = $resp->content ) =~s/src="([^"]*)\//src=".\/_te +mp_files\//gm; # translate chars to be filesystem safe $localcont =~ s/(:?src=".\/_temp_files\/)[\?=&,;:]+(:?")/_/gm; print RENDER $localcont; close RENDER; }

3)sanitize in the same way resources to be filesystem safe

# foreach link's $url if ($render){ (my $ele = $url )=~s/^.*\///; $ele =~ s/[\?=&,;:]/_/gm; ##same regex as above? open RENDER, "> $ENV{TEMP}\\_temp_files\\$ele"|| die "unabl +e to write to %TEMP%\\_temp_files\\$ele"; binmode RENDER; print RENDER $resp->content; close RENDER; }
With code showed above i get many errors ( binmode on closed filehandle.. )and missing element in the page. Can someone show me a better way to do this? a working regex or a completly different way?

thanks in advance for the patience
L*
there are no rules, there are no thumbs..

Comment on grabbing link and 3 regexes to save HTML to disk
Select or Download Code
Re: grabbing link and 3 regexes to save HTML to disk
by Athanasius (Monsignor) on Mar 22, 2013 at 13:01 UTC

    Hello Discipulus,

    I don’t have an answer to your question, sorry, just a few comments on syntax:

    • The comma operator has a lower precedence than ||, so a line such as:

      open RENDER, "> $ENV{TEMP}/_temp.html" || die "unable to write to %TEM +P%\\_temp.html";

      actually parses as:

      open RENDER, ( "> $ENV{TEMP}/_temp.html" || die "unable to write to %T +EMP%\\_temp.html" );

      which is not what you want. Either change || to the lower-precedence or, or put the arguments to open into parentheses.

    • In a regex, (:?X) captures X preceded by zero or one literal colons. For clustering (which is non-capturing), you need (?:X).

    • You can avoid “leaning toothpick syndrome” by using regex delimiters other than the forward slash:

      s{src="([^"]*)/}{src="./_temp_files/}gm

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hello Athanasius (i wish our nicks come true)

      many thanks for your points:
      • Never realized this about precedence: commas bites me everytime. I reopened perldoc and i see ANY example with parentheses! never used by me (bad). i'll take care in the future.
      • as you see i leak a lot with regexes (i was trying to install yape-regex-explain but got stucked in a 5.8 version..)
      • LOL .. imagine a non-english native, translating mentally this syndrome.. lol now I know is an idiom born in Perl's coulture. I'm cronical with that syndrome because i ever used a colorized Perl IDE.. but i'll try
      thanks a lot for the kindeness, even if OT.

      L*

      there are no rules, there are no thumbs..

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1024892]
Approved by Ratazong
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2014-09-23 10:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (218 votes), past polls