http://www.perlmonks.org?node_id=467819

Here's the thing: I was looking at playing around with some basic website spidering/parsing for interesting things. However, I only want to parse the actual content and most of the sites I'm looking at (Typically news type sites, reuters, theregister, that sort of thing) have lots of generic content on each page, such as titles, menus and so forth.

So the question is: how do ensure you just get the actual content and not the layout? I don't think there's any real definite answer for this sort of thing, which is why I'm posting in meditations, but perhaps someone will suprise me.

The only solution I've come up with so far is try to visit multiple pages, format them all the same and then diff them to find the common features, but this sounds a tad buggy and hard to do. Anyone have a better idea?
  • Comment on How would you extract *content* from websites?

Replies are listed 'Best First'.
Re: How would you extract *content* from websites?
by Ovid (Cardinal) on Jun 17, 2005 at 18:29 UTC

    Barring something useful like RSS feeds, you're going to have to do this on a site by site basis. What should ideally happen is your spider, when visiting a site, should load the rules for parsing that site. Maybe subclasses that override a &content method would be appropriate.

    Regrettably, I do a lot of work like this and it's easier said than done. One thing which can help is looking for "printer friendly" links. Those often lead to a page that strips a lot of the extraneous information off.

    Cheers,
    Ovid

    New address of my CGI Course.

Re: How would you extract *content* from websites?
by kirbyk (Friar) on Jun 17, 2005 at 18:22 UTC
    One tip: many news sites these days have RSS feeds, if not directly from them, from someone like Yahoo. I'm sure you can get your Reuters through there. An RSS feed is exactly what you want - content without layout.

    Anything else, a solution is going to be specific to their site, and only until they change their design. A lot of work. I don't see a way around it.

    Good luck!

    -- Kirby, WhitePages.com

Re: How would you extract *content* from websites?
by idsfa (Vicar) on Jun 17, 2005 at 18:19 UTC

    HTML::Strip, for example?

    use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;

    The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon

      The problem is that this is going to leave a lot of "non content" data such as menu link names, possible advertising text, etc. While it's a very poor guide, HTML can serve as "metadata" that allows you to navigate to the actual content. Remove that before getting to your content and the spider won't be able to make intelligent decisions.

      Cheers,
      Ovid

      New address of my CGI Course.

Re: How would you extract *content* from websites?
by TedPride (Priest) on Jun 17, 2005 at 18:21 UTC
    Remove everything to the end of the BODY tag. Remove all tags, replacing images with their alt text. Then compare the start and end of each page to every other page. Remove material common between x number of pages that's more than x number of words in length (or some combination of the two). This will be the header and footer material.

    What's left is the classic "longest substrings common between two pieces of text" problem. There was a discussion of that recently - let me see if I can find the thread...

Re: How would you extract *content* from websites?
by TedPride (Priest) on Jun 17, 2005 at 18:46 UTC
    No, the whole idea is that he wants to automatically separate content from material repeated between pages - headers, footers, menus, etc. A proper solution won't care if the design was changed, so long as it has a sufficient number of recent pages to work from.

    Here's the node I was talking about:
    Imploding URLs

    The connection may not be readily apparently, but the problem is essentially the same, only on a much larger scale. You can probably speed comparisons up some by storing all the words in an array and then converting them to a value corresponding to their subscript. You should only need two bytes per word. You can also speed things up by doing detailed comparisons only between pages that haven't had their common material determined yet. If page A and page B have common material x, and page C also has all of x, then you can be pretty sure that it doesn't need to be checked. And you can speed things up by starting comparisons with pages closest in location to the current page - differences in query string first, then in page name second, then a single folder, and so on.

Re: How would you extract *content* from websites?
by Your Mother (Archbishop) on Jun 17, 2005 at 19:23 UTC

    The diff thing is error prone on lots of sites because ads are randomized and menus often change, even if by a single link, per page. Ovid made some good points. Another thing I've relied on when doing this kind of thing is that content has entirely different semantics from navigation and junk.

    An article will be made of sentences and not just one or two but a dozen or more. Ads and navigation will rarely be complete sentences and never be more than one or two. I had pretty good success with this strategy building a news/story fetcher 3 years ago for sites without RSS. Plain text --> lines --> filter out everything but contiguous blocks of sentences --> choose the largest remaining item.

Re: How would you extract *content* from websites?
by Popcorn Dave (Abbot) on Jun 18, 2005 at 03:03 UTC
    If I understand your question correctly, you may be able to look for comment tags and grab what's between those.

    A few years ago I wrote a program to parse headlines from newspapers - this was pre RSS - and I initially got it down to 9 rules for 25 papers. Of course I was still learning Perl at this point, but I had to go through every web page and find the similiarities much like you're talking about. So what I ended up doing was building a config file with a start and end marker for every webpage I was looking at, and parsed my info from there.

    I've gone back and reworked the program somewhat but I used HTML::TokeParser and that made the job a whole lot easier, but I did have an advantage that the papers I was looking at were from the same news organizations, but from different towns, so a lot of the layouts were the same. I still use the config file though.

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
      That sounds reasonable, but how do you programatically determine the starting and ending comments?
        Boy, I wish I knew the answer to that. Like I said, I looked at the page layouts of the web sites I was after and built my config file with the comments to look for - starting and ending.

        Along the lines of what you're after I suppose you could just parse for comments and build a list of comment tags to look for. You had mentioned doing a diff on the files you wanted to look at, so that may be the way to start.

        Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: How would you extract *content* from websites?
by polettix (Vicar) on Jun 17, 2005 at 23:18 UTC
    You may find some inspiration looking at the Road Runner project; I haven't worked in the project, but could be an idea for my final thesys. There are some papers, the system prototype is implemented in Java but... TMTOWTDI, and one is surely Perl.

    Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')

    Don't fool yourself.
Re: How would you extract *content* from websites?
by ambrus (Abbot) on Jun 17, 2005 at 21:55 UTC

    I don't know, but if you figure out a good way to do that, you could earn a fortune. Search engine companies are working on this question too.

Re: How would you extract *content* from websites?
by artist (Parson) on Jun 17, 2005 at 20:09 UTC
    Convince them to provide RSS feeds.
Re: How would you extract *content* from websites?
by kaif (Friar) on Jun 21, 2005 at 07:52 UTC

    This is a problem I've thought a lot about and written many programs to do on a site-by-site basis. Although I haven't really come up with a good solution (and there probably isn't any), I currently scrape websites looking for images. Depending on how you look at it, this can be a considerably harder or easier thing. Basically, to decide which image on a given page is the "most interesting", I look at the filename (and host --- to see if they match), size (filtering out common ad sizes), and placement on the page (in my experience, on a page that has only one "useful" image, it's like to be at the end --- since all the ads are up front).