Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

This is just a basic primer of what I have learned about identifying the main content/dynamic content of web pages.

In the field of web based information retrieval, a common problem encountered is discerning the interesting text in a web page. A human decides this based on the formatting, location, and content of the text. Programmatically, this problem can be solved by using leaf containers from the DOM tree and n-grams or word token frequency counts of the page.

The first step is to decide how to segment the DOM tree. First, the DOM should be segmented on HTML containers. But choosing those containers can be difficult, domain specific, and may require changing based on how the majority of your corpus is constructed. For instance, if <p> tags often contain "large" amounts of text, then it can be used as a container leaf. But if the word count is "small", then it might not be useful as a container leaf. Using <table> and <div> should be near universal, given modern template design.

Once the DOM is segmented, the leaves must be differentiated into static content that won't change, or won't change much between pages. And differentiated into dynamic content which will change on every page, and usually represents the desired information on the page.

Depending on desired accuracy vs. processing time, either word token frequency counts or n-grams of increasing accuracy can be created for each leaf. Leaving the HTML markup intact can help differentiate leaves with little text in static leaves that might change between pages (advertisement anchors, navigation bars, etc). But removing the HTML markup can help cleanup word frequency counts or n-grams of highly repetitive, but useless information.

Using multiple pages from the same site, which should use the same template, can provide n-grams and word token frequency counts for terminal leaves, allowing leaves with high deltas between pages to be differentiated as dynamic content.

Using only a single page can serve, but with lower accuracy. By creating a word token frequency count of non stopword HTML stripped text, or a n-gram of same, the "gist" of the page can be ascertained. Then each leaf can be compared to this list, with highly similar leaves being considered part of the important text of the page, and thus dynamic content. There are many counter-examples of where this would not be true, but for pages with a large amount of content on the main subject of the page, leaves not pertinent to that subject should be scored as static, scoring as static the navigation bars, advertisements, and link lists.

Perl, of course, offers all the tools required to implement something like this. LWP, HTML::Tree (and friends), and Lingua can be combined in many combinations to meet almost anyones' needs.


In reply to Information Retrieval - Segmenting DOM Trees (Static vs Dynamic content) by perlmonkey2

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others avoiding work at the Monastery: (8)
    As of 2014-09-23 23:35 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (241 votes), past polls