|There's more than one way to do things|
Extracting arbitrary data from HTMLby vbfg (Initiate)
|on Apr 21, 2004 at 11:39 UTC||Need Help??|
vbfg has asked for the
wisdom of the Perl Monks concerning the following question:
This isnít really a Perl question as such I suppose, more a question on how best to proceed.
My problem is this. For the last few months Iíve been using Google News to search for news stories about Rugby League and Rugby League clubs. The intention has always been to collect stories and publish them on a Rugby League site in RDF format but before I go live with it I want to take Google News out of the loop and harvest news stories from the sites directly.
Google News has been incredibly useful. I have a large body of documents that I can use to train AI::Categorizer into recognising valid Rugby League stories. I have a huge collection of sites that actively publish stories about Rugby League. Most important of all so far, at least as far as any future users of this are concerned, is that Google gives me consistent HTML to parse and understand irrespective of which site the story came from. The headline of the news story is the link to the story so I simply find the headline that will ultimately be the link in the RDF file that way.
Hereís the problem. How do I extract such information from the raw HTML of the source site? I cannot use the tag that contains the information since it may not be unique within the document. Use regexes you say? Well, I canít do that really either. The reason is the HTML of one of the source sites will change at the least opportune moment which will require someone to sit and work out a new regex to extract the information. I donít want that person to me. It could be absolutely anyone involved in running our Rugby League site. Itís a community effort and experience could range from IT project manager to interested school kid looking for experience or who just wants to help.
My plan thus far is as follows:
Using Perl/TK Iíve built a small app which downloads a page and displays the HTML in its tree structure, a bit like the DOM Inspector in Mozilla would. I can look at the page directly in a browser and copy the text of the headline into my application. The application then searches down through the tree from the root to find the relevant node.
Thus far thatís all this helper application does. What I want to do next is, using that node as a starting point, discover what it is about the parent, siblings and children in the local part of the tree that makes it unique within the document. When that is done I can analyse that particular chunk of tree and devise a rule to describe it, possibly just dumping it out in Newick format or something like. I then can apply this rule to extract data from all pages of the same type that come from that site. Since these news sites are typically generated from a database, or according to some other rule set, the HTML is at least consistent on a given site even if it is liable to change without warning.
Thatís the key Ė rule creation and subsequent application has to be automatic.
I think I know what to do to analyse the tree and find unique portions of it. I can find the node with the headline we searched for and then look for other instances of it. If there are no other instances then thatís my rule right there. If there are then I can look at the parent and siblings and see if other instances of the same node have the same parents and siblings, and so on until I have a unique description.
The question is, is it worth the effort? How would you tackle this? It seems incredible to me that Google would do something similar for their news service; their sources must number in the many tens of thousands and to manage them all would probably require some significant effort. Whatever they use does make mistakes even though it mostly seems to get it right. One site in particular (Manchester Evening News) usually comes up with ĎSite created and maintained byÖí as the headline of the story. Looking at the text of the HTML itís not falling back on the <TITLE> or anything like that, it is extracting that from some point deep within the HTML tree as though some rules are guiding it to that point.
Suggestions and hints on how to proceed are very welcome. Thanks for your time.