|Keep It Simple, Stupid|
Re^2: (almost) preserving a web pageby punkish (Priest)
|on Jun 18, 2011 at 22:31 UTC||Need Help??|
However, what I have is an application that visits 30 different web sites on a periodic basis. It extracts the links from the "front page" of each of these web sites, discarding all the links that point outside the base domain. Then, it follows each one of those links. So, if we have an average of 10 links in the text of each web site's front page, the program will visit 30 * 10 web pages.
For each of the web pages that it visits, it downloads the content, makes a copy, and strips out all the HTML tags from the copies. Then, it searches the plain text for certain keywords. If the keywords are present, it stores the plain text version in a full-text search (FTS) table (using SQLite's FTS4 implementation), and also stores its original web source, with HTML tags and all.
At a later time, the user arrives at the application web page and is able to search the FTS content for various terms. If matching content is found, a link is presented to the user so the original web page may be examined. On clicking the link, the original web page (also stored in the database) is presented in an iFrame.
So, the intent is to be able to view the original web page as it appeared when it was published in a fool-proof, universally applicable manner.
when small people start casting long shadows, it is time to go to bed