Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Information Retrieval - Segmenting DOM Trees (Static vs Dynamic content)

by perlmonkey2 (Beadle)
on Jan 14, 2007 at 22:13 UTC ( #594667=perlmeditation: print w/ replies, xml ) Need Help??

This is just a basic primer of what I have learned about identifying the main content/dynamic content of web pages.

In the field of web based information retrieval, a common problem encountered is discerning the interesting text in a web page. A human decides this based on the formatting, location, and content of the text. Programmatically, this problem can be solved by using leaf containers from the DOM tree and n-grams or word token frequency counts of the page.

The first step is to decide how to segment the DOM tree. First, the DOM should be segmented on HTML containers. But choosing those containers can be difficult, domain specific, and may require changing based on how the majority of your corpus is constructed. For instance, if <p> tags often contain "large" amounts of text, then it can be used as a container leaf. But if the word count is "small", then it might not be useful as a container leaf. Using <table> and <div> should be near universal, given modern template design.

Once the DOM is segmented, the leaves must be differentiated into static content that won't change, or won't change much between pages. And differentiated into dynamic content which will change on every page, and usually represents the desired information on the page.

Depending on desired accuracy vs. processing time, either word token frequency counts or n-grams of increasing accuracy can be created for each leaf. Leaving the HTML markup intact can help differentiate leaves with little text in static leaves that might change between pages (advertisement anchors, navigation bars, etc). But removing the HTML markup can help cleanup word frequency counts or n-grams of highly repetitive, but useless information.

Using multiple pages from the same site, which should use the same template, can provide n-grams and word token frequency counts for terminal leaves, allowing leaves with high deltas between pages to be differentiated as dynamic content.

Using only a single page can serve, but with lower accuracy. By creating a word token frequency count of non stopword HTML stripped text, or a n-gram of same, the "gist" of the page can be ascertained. Then each leaf can be compared to this list, with highly similar leaves being considered part of the important text of the page, and thus dynamic content. There are many counter-examples of where this would not be true, but for pages with a large amount of content on the main subject of the page, leaves not pertinent to that subject should be scored as static, scoring as static the navigation bars, advertisements, and link lists.

Perl, of course, offers all the tools required to implement something like this. LWP, HTML::Tree (and friends), and Lingua can be combined in many combinations to meet almost anyones' needs.

Comment on Information Retrieval - Segmenting DOM Trees (Static vs Dynamic content)
Re: Information Retrieval - Segmenting DOM Trees (Static vs Dynamic content)
by lin0 (Curate) on Jan 15, 2007 at 19:35 UTC

    Hi perlmonkey2

    You are bringing up some very interesting points. However, I think you should include some links providing background information to better illustrate what you were talking about. In any case...

    In the field of web based information retrieval, a common problem encountered is discerning the interesting text in a web page. A human decides this based on the formatting, location, and content of the text.

    Yes, we tend to analyse the content to determine what concepts are described and how those concepts relate to our previous experiences. In that sense, marking a text as interesting will always depend on the person doing the evaluation. One research area that could help you deal with concepts is Granular Computing. Granular Computing will also allow you to deal with different levels of abstraction helping you control the level of accuracy required for each leaf in the DOM tree. One paper that could interest you is: Information Granulation for Web based Information Retrieval Support Systems. Finally, it is important that you consider different ways of grouping the data (or creating information granules in Granular Computing terms) to see which one is more suitable for your particular application. For that, I will recommend you to have a look at clustering methods like those in Algorithm::Cluster or in Re: module for cluster analysis.

    perlmonkey2, good luck with this project. Please keep us posted on how it progresses

    Cheers,

    lin0

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://594667]
Approved by Joost
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (12)
As of 2014-09-23 16:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (232 votes), past polls