Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

This is just a basic primer of what I have learned about identifying the main content/dynamic content of web pages.

In the field of web based information retrieval, a common problem encountered is discerning the interesting text in a web page. A human decides this based on the formatting, location, and content of the text. Programmatically, this problem can be solved by using leaf containers from the DOM tree and n-grams or word token frequency counts of the page.

The first step is to decide how to segment the DOM tree. First, the DOM should be segmented on HTML containers. But choosing those containers can be difficult, domain specific, and may require changing based on how the majority of your corpus is constructed. For instance, if <p> tags often contain "large" amounts of text, then it can be used as a container leaf. But if the word count is "small", then it might not be useful as a container leaf. Using <table> and <div> should be near universal, given modern template design.

Once the DOM is segmented, the leaves must be differentiated into static content that won't change, or won't change much between pages. And differentiated into dynamic content which will change on every page, and usually represents the desired information on the page.

Depending on desired accuracy vs. processing time, either word token frequency counts or n-grams of increasing accuracy can be created for each leaf. Leaving the HTML markup intact can help differentiate leaves with little text in static leaves that might change between pages (advertisement anchors, navigation bars, etc). But removing the HTML markup can help cleanup word frequency counts or n-grams of highly repetitive, but useless information.

Using multiple pages from the same site, which should use the same template, can provide n-grams and word token frequency counts for terminal leaves, allowing leaves with high deltas between pages to be differentiated as dynamic content.

Using only a single page can serve, but with lower accuracy. By creating a word token frequency count of non stopword HTML stripped text, or a n-gram of same, the "gist" of the page can be ascertained. Then each leaf can be compared to this list, with highly similar leaves being considered part of the important text of the page, and thus dynamic content. There are many counter-examples of where this would not be true, but for pages with a large amount of content on the main subject of the page, leaves not pertinent to that subject should be scored as static, scoring as static the navigation bars, advertisements, and link lists.

Perl, of course, offers all the tools required to implement something like this. LWP, HTML::Tree (and friends), and Lingua can be combined in many combinations to meet almost anyones' needs.

In reply to Information Retrieval - Segmenting DOM Trees (Static vs Dynamic content) by perlmonkey2

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (2)
    As of 2018-01-22 06:37 GMT
    Find Nodes?
      Voting Booth?
      How did you see in the new year?

      Results (232 votes). Check out past polls.