http://www.perlmonks.org?node_id=910277

punkish has asked for the wisdom of the Perl Monks concerning the following question:

My program grabs web pages, stores two versions of them in a database:
  1. A version without any HTML tags, for which I use HTML::Strip. The text, that is, the non-tags content, of the web page is used to build a full-text index which is used for later searches;
  2. A version as the page was at the instant of downloading it. This one is used to show the user the web page as it was at the time and date when it was downloaded.
I am facing the following problem -- JavaScript in some web pages wreaks havoc the viewing of them (the whole mechanism is a part of a web application; the "historic" web pages are shown in an iframe). So, I thought perhaps I could remove the script tags and the enclosed JavaScript from the html content. First, how do I do that? However, I am not sure if that will also help. Since some web pages are actually built using JavaScript upon being loaded, it is likely that they might simply fail to load.

So, I am seeking two kinds of advice -- one, how to strip out only the JavaScript from a web page; and two, how to generally better accomplish the above.



when small people start casting long shadows, it is time to go to bed