Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Framework for News Articles

by smalhotra (Scribe)
on Mar 24, 2004 at 21:44 UTC ( [id://339559]=perlquestion: print w/replies, xml ) Need Help??

smalhotra has asked for the wisdom of the Perl Monks concerning the following question:

I have been working on this project on and off for a couple of months and wanted to share the idea to receive some feedback. I do a lot of screen scraping - parsing web pages into perl data structures. So what I have is a simple framework that allows a unified interface to drivers that scrape news articles from websites. Kind of like what WWW::Search does for search engines/databases etc.

Take some news website, say BBC News or The Hindu. I pass the URL to Khabar, which finds the appropriate pareser if available and gives it the URL to parse. Then I can get back some data structure thats got basically the same information available on the website, but something I could use in my application.

The basic structure I'm using now is title, publisher, date, author, byline, content, category, related articles, related links, imbedded images, ad banner URL and links, etc. I also have a simple module that can output this into RSS2.0

Do the wise monks have any ideas of other projects to look at, design suggestions, potential pitfalls ...? I could also use hints/tips/advice on better screenscraping/parsing. Hopefully in time every Monk can contribute a parser that can read their local news website and we will no longer be dependent on the Googleopoly for news aggregation.

Replies are listed 'Best First'.
Re: Framework for News Articles
by kvale (Monsignor) on Mar 24, 2004 at 22:20 UTC
    Sounds like a cool project. One potential pitfall is legal. Some sites don't like robots gathering information from their pages automatically, because (1) ads are not seen and (2) automatic collection of info could be a violation of copyright, depending on the use it is put to.

    The solution is to check terms of use on the website, or ask the webmaster.

    -Mark

      Then they should put a robot.txt file on their website and of course all well-behaved robots check that and apply it.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        While robots.txt is an established standard for regulating the behavior of robots, the non-existence of a robots.txt is not a license to violate copyright. Many sites would want robots from Google and other search engines to index their pages, but they don't want any random person scraping their content and putting it up on another site. You could make a case for downloading content for personal use, but there's definitey gray areas out there.

        The moral of the story is that the legality definitely depends on the use of downloaded information.

Re: Framework for News Articles
by ryantate (Friar) on Mar 25, 2004 at 00:19 UTC

    With WWW::Search, sites are updated or added by writing a subclass module, which is then, ideally, distributed through CPAN. This seems to me like too much friction to keep up with changes in the particulars of individual sites in an efective way.

    I am not sure I have a better idea. But one approach I have thought of is to have a base class/module that reads in site-specific data through simple XML files. The XML would contain meta-data about the site incuding, at heart, one or more Perl5 regexes. Another key piece of information in the XML would be one or more URLs for updating the file when it seems to be out of date.

    The advantage to this scheme is that, since there is support for Perl5 regexes outside of Perl5, the XML files could be used in other applications, for example a Windows-based aggregator. Also, the update URL(s) allow for more rapid correction when the site information changes.

    Finally, because site descriptor files could be created with Perl5 regular expressions and a few pieces of information about the site, there is potentially a wider audience of authors than on CPAN. (Especially if someone created a Web service that made creating or updating a site descriptor file as easy as filling out a Web form.)

    The disadvantage to such a scheme, of course, is that it relies heavily on regexes to extract data, which can be less efficient, less reliable and less powerful than proper parsing.

Re: Framework for News Articles
by Jaap (Curate) on Mar 24, 2004 at 22:38 UTC
    - Are you going to make it an open source project?
    - Do you already have it online somewhere?
    - Is it merely for archiving or also for current news?
    - Do you have some docs on it?

    Just a few questions that come to mind reading your post.
    Also, what is it exactly that you want to know from us?

      1. Yes. It will be on CPAN, and I would suggest individual contributors do the same.
      2. No.
      3. The idea is to convert a human readable page back to computer understandable. What you do after that is up to you. Once the framework and parsers are developed then perhaps an aggregation system can by built around it. But that's much later.
      4. No.

      I am already writing this. What I am looking for is maybe similar projects, design suggestions, etc. before I put something up on CPAN. It never hurts to bounce ideas off smart people.

Re: Framework for News Articles
by artist (Parson) on Mar 24, 2004 at 23:11 UTC
    There exists 1000s of news website on Internet. The problem is that many websites changes over time. The structure constantly changes and it is not always easy to write the parser to get the appropriate information. There are several concerns. How do you find which are the 'new' news? What do you do when the story spans several pages?. How do you find where are the links for the stories?. How do you know which are not the advertizes or logo and the actual image for the story?

    Answers to all these question needs some time and careful study of each website. Question can be answered better by the monks who have done some work in "NewsSearch". For example Google searches 4000+ news sources and all of them do not have RSS. Some mechanism must have been employed ..

      There exists 1000s of news website on Internet. The problem is that many websites changes over time. The structure constantly changes and it is not always easy to write the parser to get the appropriate information. There are several concerns. How do you find which are the 'new' news? What do you do when the story spans several pages?. How do you find where are the links for the stories?. How do you know which are not the advertizes or logo and the actual image for the story?

      You've just hit on the difference between a programming interface and an implementation of that interface. While the specific details of a site's design may change over time, there are certain characteristics inherent to the data that should always exist. A well-designed solution for scraping news articles would have some sort of API that defines common attributes of all articles, and a set of site-specific implementations of this API. If a single site changes its structure, then just change the implementation while keeping the API consistent.

      For example, smalhotra discusses "title, publisher, date, author, byline, content, category, related articles, related links, imbedded images" as being general classes of information, that a news article could have.

        Yep,
        smalhotra asked about screenscraping/parsing of HTML. That's implementation and that's not easy. Defining API is trivial task compared to implementation here.
Re: Framework for News Articles
by smalhotra (Scribe) on Mar 25, 2004 at 00:36 UTC
    There seems to be some confusion on the intentions of this project I apologize for not being clear. Thanks for raising these issues. They help clear ambiguities as well as implementation problems (copyrights or changing websites).

    1. Khabar itself is not a crawler or an aggregator. Given a website with certain content it parses it for data you could use for whatever reason. It separates article specific content from things like page headers, menus, etc. What you do with the data is up to you, in accordance with the site's usage license. You could use Khabar to read pages found by a crawler or aggregator.
    2. In most cases downloading this content for personal use is fair. It is really no different from Finance::Quote.
    3. Dealing with page structure changes is up to the person who writes the parser. Good idea; perhaps it should be suggested that they write test to ensure the parser/format is still valid. I like the XML idea, but it's perhaps too much for the first version.
    4. I suggest that the parsers return any advertisements they can read from the page as part of fair use. The person using the data can decide what to do with it. In general, the more details you can accurately parse out, the better.

    Keep it coming ...
      at the mercy of change

      My problem is I want data. Not *pretty web pages*. Raw data in feed format that I can process. I'm pretty much getting the results you are looking for now but not beating my head around having to parse html with all it's problems: namely you open to the mercy of web designers whim to change the layout.

      use rdf, rss or pda feeds

      So I avoid HTML. I'm lazy. I look for the rss, rdf, pda html pages. Point my spider and dump them in a directory for later parsing. Most news sites have rss feeds (though my local newspaper, The Age supplies rss feeds for a fee. but produces a lite page for pda's.) so some parsing is necessary.

      Now suppose I want to parse a page (in Perl) why wouldn't I use Andy Lesters fine WWW::Mechanise? (WWW::Mechanise article).

      questions, questions, devils advocate

      I'm not actually knocking the idea.

      • does an existing CPAN module exist that does a subset already?
      • could you build apon such a module?
          I ask this for 2 reasons. The first is the idea of rss feeds and web api's are gaining traction. The second is for quick hacks tools already exist. Take this example of Andys hack to get and sort the Perl Haiku results.
      • is the intention to build it to scratch your itch or solve a generic problem?
      • if you are using data structures to store the data could you investigate using/supporting YAML. (for multi-language support)?

      now you may say, goon your an idiot, be quiet. but ...

      -1 is what you get for having update button near the vote button :(
Re: Framework for News Articles
by pg (Canon) on Mar 25, 2004 at 04:32 UTC

    My thought is more on the different side of this issue.

    This kind of project would alway be a kind of pain. The biggest concern is really that, it is too difficult to make your application stable, as the the way the content being organized can change all the time, and parse HTML is not a very meaningful of extracting data. The value of this sort of software is quite limited.

    Look at the big picture, the right way to resolve this problem is not to attempt it on the receiver side, but to have the web content organized in a more strctured way, and separate data from presentation more clearly, if the content provider wishes.

    From a technical point of view, XML island would be the best choice I can see at this time. In this sense, web content is RENDERED through HTML, but the "data" (the news here) presented as XML island.

    If you are interested in XML island, just google it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://339559]
Approved by matija
Front-paged by kutsu
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2024-03-28 09:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found