Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
It is possible that my original question was not clear enough, and hence, something else got answered. On the other hand, it is also possible that your answers are actually leading me to the right solution, but I can't see it yet. So, more discussion follows --

I don't really want to get text via JavaScript on a page by page basis. If I had only one, predictable web site, perhaps I could devise a mechanism to work around its idiosyncrasies.

However, what I have is an application that visits 30 different web sites on a periodic basis. It extracts the links from the "front page" of each of these web sites, discarding all the links that point outside the base domain. Then, it follows each one of those links. So, if we have an average of 10 links in the text of each web site's front page, the program will visit 30 * 10 web pages.

For each of the web pages that it visits, it downloads the content, makes a copy, and strips out all the HTML tags from the copies. Then, it searches the plain text for certain keywords. If the keywords are present, it stores the plain text version in a full-text search (FTS) table (using SQLite's FTS4 implementation), and also stores its original web source, with HTML tags and all.

At a later time, the user arrives at the application web page and is able to search the FTS content for various terms. If matching content is found, a link is presented to the user so the original web page may be examined. On clicking the link, the original web page (also stored in the database) is presented in an iFrame.

For the most part, actually, having the exact content as it was originally is a good thing. It allows reconstructing the original web page as truthfully as possible. Sometimes this tactic fails, and more often than not, the failure is because of JavaScript in the original page firing off and doing something wonky.

So, the intent is to be able to view the original web page as it appeared when it was published in a fool-proof, universally applicable manner.



when small people start casting long shadows, it is time to go to bed

In reply to Re^2: (almost) preserving a web page by punkish
in thread (almost) preserving a web page by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (7)
    As of 2014-12-20 10:08 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (95 votes), past polls