|XP is just a number|
Information Retrieval - Segmenting DOM Trees (Static vs Dynamic content)by perlmonkey2 (Beadle)
|on Jan 14, 2007 at 22:13 UTC||Need Help??|
This is just a basic primer of what I have learned about identifying the main content/dynamic content of web pages.
In the field of web based information retrieval, a common problem encountered is discerning the interesting text in a web page. A human decides this based on the formatting, location, and content of the text. Programmatically, this problem can be solved by using leaf containers from the DOM tree and n-grams or word token frequency counts of the page.
The first step is to decide how to segment the DOM tree. First, the DOM should be segmented on HTML containers. But choosing those containers can be difficult, domain specific, and may require changing based on how the majority of your corpus is constructed. For instance, if <p> tags often contain "large" amounts of text, then it can be used as a container leaf. But if the word count is "small", then it might not be useful as a container leaf. Using <table> and <div> should be near universal, given modern template design.
Once the DOM is segmented, the leaves must be differentiated into static content that won't change, or won't change much between pages. And differentiated into dynamic content which will change on every page, and usually represents the desired information on the page.
Depending on desired accuracy vs. processing time, either word token frequency counts or n-grams of increasing accuracy can be created for each leaf. Leaving the HTML markup intact can help differentiate leaves with little text in static leaves that might change between pages (advertisement anchors, navigation bars, etc). But removing the HTML markup can help cleanup word frequency counts or n-grams of highly repetitive, but useless information.
Using multiple pages from the same site, which should use the same template, can provide n-grams and word token frequency counts for terminal leaves, allowing leaves with high deltas between pages to be differentiated as dynamic content.
Using only a single page can serve, but with lower accuracy. By creating a word token frequency count of non stopword HTML stripped text, or a n-gram of same, the "gist" of the page can be ascertained. Then each leaf can be compared to this list, with highly similar leaves being considered part of the important text of the page, and thus dynamic content. There are many counter-examples of where this would not be true, but for pages with a large amount of content on the main subject of the page, leaves not pertinent to that subject should be scored as static, scoring as static the navigation bars, advertisements, and link lists.
Perl, of course, offers all the tools required to implement something like this. LWP, HTML::Tree (and friends), and Lingua can be combined in many combinations to meet almost anyones' needs.