smalhotra has asked for the wisdom of the Perl Monks concerning the following question:
I have been working on this project on and off for a couple of months and wanted to share the idea to receive some feedback. I do a lot of screen scraping - parsing web pages into perl data structures. So what I have is a simple framework that allows a unified interface to drivers that scrape news articles from websites. Kind of like what WWW::Search does for search engines/databases etc.
Take some news website, say BBC News or The Hindu. I pass the URL to Khabar, which finds the appropriate pareser if available and gives it the URL to parse. Then I can get back some data structure thats got basically the same information available on the website, but something I could use in my application.
The basic structure I'm using now is title, publisher, date, author, byline, content, category, related articles, related links, imbedded images, ad banner URL and links, etc. I also have a simple module that can output this into RSS2.0
Do the wise monks have any ideas of other projects to look at, design suggestions, potential pitfalls ...? I could also use hints/tips/advice on better screenscraping/parsing. Hopefully in time every Monk can contribute a parser that can read their local news website and we will no longer be dependent on the Googleopoly for news aggregation.
Re: Framework for News Articles
by kvale (Monsignor) on Mar 24, 2004 at 22:20 UTC
|
Sounds like a cool project. One potential pitfall is legal. Some sites don't like robots gathering information from their pages automatically, because (1) ads are not seen and (2) automatic collection of info could be a violation of copyright, depending on the use it is put to.
The solution is to check terms of use on the website, or ask the webmaster.
| [reply] [Watch: Dir/Any] |
|
Then they should put a robot.txt file on their website and of course all well-behaved robots check that and apply it.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
Re: Framework for News Articles
by ryantate (Friar) on Mar 25, 2004 at 00:19 UTC
|
With WWW::Search, sites are updated or added by writing a subclass module, which is then, ideally, distributed through CPAN. This seems to me like too much friction to keep up with changes in the particulars of individual sites in an efective way.
I am not sure I have a better idea. But one approach I have thought of is to have a base class/module that reads in site-specific data through simple XML files. The XML would contain meta-data about the site incuding, at heart, one or more Perl5 regexes. Another key piece of information in the XML would be one or more URLs for updating the file when it seems to be out of date.
The advantage to this scheme is that, since there is support for Perl5 regexes outside of Perl5, the XML files could be used in other applications, for example a Windows-based aggregator. Also, the update URL(s) allow for more rapid correction when the site information changes.
Finally, because site descriptor files could be created with Perl5 regular expressions and a few pieces of information about the site, there is potentially a wider audience of authors than on CPAN. (Especially if someone created a Web service that made creating or updating a site descriptor file as easy as filling out a Web form.)
The disadvantage to such a scheme, of course, is that it relies heavily on regexes to extract data, which can be less efficient, less reliable and less powerful than proper parsing.
| [reply] [Watch: Dir/Any] |
Re: Framework for News Articles
by Jaap (Curate) on Mar 24, 2004 at 22:38 UTC
|
- Are you going to make it an open source project?
- Do you already have it online somewhere?
- Is it merely for archiving or also for current news?
- Do you have some docs on it?
Just a few questions that come to mind reading your post.
Also, what is it exactly that you want to know from us? | [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: Framework for News Articles
by artist (Parson) on Mar 24, 2004 at 23:11 UTC
|
There exists 1000s of news website on Internet. The problem is that many websites changes over time. The structure constantly changes and it is not always easy to write the parser to get the appropriate information. There are several concerns. How do you find which are the 'new' news? What do you do when the story spans several pages?. How do you find where are the links for the stories?. How do you know which are not the advertizes or logo and the actual image for the story?
Answers to all these question needs some time and careful study of each website. Question can be answered better by the monks who have done some work in "NewsSearch". For example Google searches 4000+ news sources and all of them do not have RSS. Some mechanism must have been employed ..
| [reply] [Watch: Dir/Any] |
|
There exists 1000s of news website on Internet. The problem is that many websites changes over time. The structure constantly changes and it is not always easy to write the parser to get the appropriate information. There are several concerns. How do you find which are the 'new' news? What do you do when the story spans several pages?. How do you find where are the links for the stories?. How do you know which are not the advertizes or logo and the actual image for the story?
You've just hit on the difference between a programming interface and an implementation of that interface. While the specific details of a site's design may change over time, there are certain characteristics inherent to the data that should always exist. A well-designed solution for scraping news articles would have some sort of API that defines common attributes of all articles, and a set of site-specific implementations of this API. If a single site changes its structure, then just change the implementation while keeping the API consistent.
For example, smalhotra discusses "title, publisher, date, author, byline, content, category, related articles, related links, imbedded images" as being general classes of information, that a news article could have.
| [reply] [Watch: Dir/Any] |
|
Yep, smalhotra asked about screenscraping/parsing of HTML.
That's implementation and that's not easy. Defining API is trivial task compared to implementation here.
| [reply] [Watch: Dir/Any] |
Re: Framework for News Articles
by smalhotra (Scribe) on Mar 25, 2004 at 00:36 UTC
|
There seems to be some confusion on the intentions of this project I apologize for not being clear. Thanks for raising these issues. They help clear ambiguities as well as implementation problems (copyrights or changing websites).
1. Khabar itself is not a crawler or an aggregator. Given a website with certain content it parses it for data you could use for whatever reason. It separates article specific content from things like page headers, menus, etc. What you do with the data is up to you, in accordance with the site's usage license. You could use Khabar to read pages found by a crawler or aggregator.
2. In most cases downloading this content for personal use is fair. It is really no different from Finance::Quote.
3. Dealing with page structure changes is up to the person who writes the parser. Good idea; perhaps it should be suggested that they write test to ensure the parser/format is still valid. I like the XML idea, but it's perhaps too much for the first version.
4. I suggest that the parsers return any advertisements they can read from the page as part of fair use. The person using the data can decide what to do with it. In general, the more details you can accurately parse out, the better.
Keep it coming ... | [reply] [Watch: Dir/Any] |
|
at the mercy of change
My problem is I want data. Not *pretty web pages*. Raw data in feed format that I can process. I'm pretty much getting the results you are looking for now but not beating my head around having to parse html with all it's problems: namely you open to the mercy of web designers whim to change the layout.
use rdf, rss or pda feeds
So I avoid HTML. I'm lazy. I look for the rss, rdf, pda html pages. Point my spider and dump them in a directory for later parsing. Most news sites have rss feeds (though my local newspaper, The Age supplies rss feeds for a fee. but produces a lite page for pda's.) so some parsing is necessary.
Now suppose I want to parse a page (in Perl) why wouldn't I use Andy Lesters fine WWW::Mechanise? (WWW::Mechanise article).
questions, questions, devils advocate
I'm not actually knocking the idea.
- does an existing CPAN module exist that does a subset already?
- could you build apon such a module?
I ask this for 2 reasons. The first is the idea of rss feeds and web api's are gaining traction. The second is for quick hacks tools already exist. Take this example of Andys hack to get and sort the Perl Haiku results.
- is the intention to build it to scratch your itch or solve a generic problem?
- if you are using data structures to store the data could you investigate using/supporting YAML. (for multi-language support)?
now you may say, goon your an idiot, be quiet. but ...
-1 is what you get for having update button near the vote button :(
| [reply] [Watch: Dir/Any] |
Re: Framework for News Articles
by pg (Canon) on Mar 25, 2004 at 04:32 UTC
|
My thought is more on the different side of this issue.
This kind of project would alway be a kind of pain. The biggest concern is really that, it is too difficult to make your application stable, as the the way the content being organized can change all the time, and parse HTML is not a very meaningful of extracting data. The value of this sort of software is quite limited.
Look at the big picture, the right way to resolve this problem is not to attempt it on the receiver side, but to have the web content organized in a more strctured way, and separate data from presentation more clearly, if the content provider wishes.
From a technical point of view, XML island would be the best choice I can see at this time. In this sense, web content is RENDERED through HTML, but the "data" (the news here) presented as XML island.
If you are interested in XML island, just google it.
| [reply] [Watch: Dir/Any] |
|
|