Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Welcome to the Monastery
 
PerlMonks  

(almost) preserving a web page

by punkish (Priest)
on Jun 18, 2011 at 00:51 UTC ( #910277=perlquestion: print w/ replies, xml ) Need Help??
punkish has asked for the wisdom of the Perl Monks concerning the following question:

My program grabs web pages, stores two versions of them in a database:
  1. A version without any HTML tags, for which I use HTML::Strip. The text, that is, the non-tags content, of the web page is used to build a full-text index which is used for later searches;
  2. A version as the page was at the instant of downloading it. This one is used to show the user the web page as it was at the time and date when it was downloaded.
I am facing the following problem -- JavaScript in some web pages wreaks havoc the viewing of them (the whole mechanism is a part of a web application; the "historic" web pages are shown in an iframe). So, I thought perhaps I could remove the script tags and the enclosed JavaScript from the html content. First, how do I do that? However, I am not sure if that will also help. Since some web pages are actually built using JavaScript upon being loaded, it is likely that they might simply fail to load.

So, I am seeking two kinds of advice -- one, how to strip out only the JavaScript from a web page; and two, how to generally better accomplish the above.



when small people start casting long shadows, it is time to go to bed

Comment on (almost) preserving a web page
Re: (almost) preserving a web page
by Anonymous Monk on Jun 18, 2011 at 02:56 UTC

      I wouldn't use Javascript for getting at the text:

      print $mech->text

      Alternatively, if you're really interested in the textContent of a specific element, use

      print $element->{textContent}

      Thanks to MozRepl::RemoteObject, almost all stuff you can get at by Javascript, you can get at by using Perl.

      #!perl -w use strict; use WWW::Mechanize::Firefox::DSL; get 'http://perlmonks.org'; print text
        Thanks :) The javascript was just proof of concept for the correct DOM incantation ... I have never actually used WWW::Mechanize::Firefox :)
      It is possible that my original question was not clear enough, and hence, something else got answered. On the other hand, it is also possible that your answers are actually leading me to the right solution, but I can't see it yet. So, more discussion follows --

      I don't really want to get text via JavaScript on a page by page basis. If I had only one, predictable web site, perhaps I could devise a mechanism to work around its idiosyncrasies.

      However, what I have is an application that visits 30 different web sites on a periodic basis. It extracts the links from the "front page" of each of these web sites, discarding all the links that point outside the base domain. Then, it follows each one of those links. So, if we have an average of 10 links in the text of each web site's front page, the program will visit 30 * 10 web pages.

      For each of the web pages that it visits, it downloads the content, makes a copy, and strips out all the HTML tags from the copies. Then, it searches the plain text for certain keywords. If the keywords are present, it stores the plain text version in a full-text search (FTS) table (using SQLite's FTS4 implementation), and also stores its original web source, with HTML tags and all.

      At a later time, the user arrives at the application web page and is able to search the FTS content for various terms. If matching content is found, a link is presented to the user so the original web page may be examined. On clicking the link, the original web page (also stored in the database) is presented in an iFrame.

      For the most part, actually, having the exact content as it was originally is a good thing. It allows reconstructing the original web page as truthfully as possible. Sometimes this tactic fails, and more often than not, the failure is because of JavaScript in the original page firing off and doing something wonky.

      So, the intent is to be able to view the original web page as it appeared when it was published in a fool-proof, universally applicable manner.



      when small people start casting long shadows, it is time to go to bed

        httrack does that by mining the javscript for links, gets the more common ones, but doesn't get them all, and some javascript will redirect you from your local copy back to the internet

        http://crawler.archive.org/ does that by inserting its own javascript which does url rewriting so the images show up (even the dynamic ones), but like httrack, actual links are rewritten ...

        Then there is Mozilla Archive Format (with Faithful Save), which does a much better version of save-as, its close to perfect :)

        Another common tactic is to print-to-pdf from a browser like firefox via automation

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://910277]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (15)
As of 2014-04-17 15:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (453 votes), past polls