|Perl: the Markov chain saw|
grabbing link and 3 regexes to save HTML to diskby Discipulus (Priest)
|on Mar 22, 2013 at 08:59 UTC||Need Help??|
Discipulus has asked for the
wisdom of the Perl Monks concerning the following question:
i'm rewriting the parsing part of my WebTimeLoad because i discovered that HTML::Parse is deprecated so i want to switch to HTML::LinkExtor. I also want to make the render option (save the page on disk and display it) more accurate.
The logic of the program is: get the page content (if a frame is found is pushed into pages queue), parse the content to grab links and put them into some %cache, process all links.
The code use this setup (semplified):
1)grab all links
$parser->links return a AoA is safe to select everything where third field is 'src' ? or i have to select based on link type ? only 'frame iframe img input layer script textarea video' tags can have src associated? make sense to grab all of them to repaint the page ?
2)modify the pageI want to modify the page before writing it to disk so that all src point to local resource and all web chars not permitted on filesystem are translated ('cause some link is naughty as www.it.org/js/jquery/jquery.color.js?ver=2.0-4561m):
3)sanitize in the same way resources to be filesystem safe
With code showed above i get many errors ( binmode on closed filehandle.. )and missing element in the page. Can someone show me a better way to do this? a working regex or a completly different way?
thanks in advance for the patience
there are no rules, there are no thumbs..