Re: How would you extract *content* from websites?
by Ovid (Cardinal) on Jun 17, 2005 at 18:29 UTC
|
Barring something useful like RSS feeds, you're going to have to do this on a site by site basis. What should ideally happen is your spider, when visiting a site, should load the rules for parsing that site. Maybe subclasses that override a &content method would be appropriate.
Regrettably, I do a lot of work like this and it's easier said than done. One thing which can help is looking for "printer friendly" links. Those often lead to a page that strips a lot of the extraneous information off.
| [reply] |
Re: How would you extract *content* from websites?
by kirbyk (Friar) on Jun 17, 2005 at 18:22 UTC
|
One tip: many news sites these days have RSS feeds, if not directly from them, from someone like Yahoo. I'm sure you can get your Reuters through there. An RSS feed is exactly what you want - content without layout.
Anything else, a solution is going to be specific to their site, and only until they change their design. A lot of work. I don't see a way around it.
Good luck!
| [reply] |
Re: How would you extract *content* from websites?
by idsfa (Vicar) on Jun 17, 2005 at 18:19 UTC
|
use HTML::Strip;
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse( $raw_html );
$hs->eof;
The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon
| [reply] [d/l] |
|
The problem is that this is going to leave a lot of "non content" data such as menu link names, possible advertising text, etc. While it's a very poor guide, HTML can serve as "metadata" that allows you to navigate to the actual content. Remove that before getting to your content and the spider won't be able to make intelligent decisions.
| [reply] |
Re: How would you extract *content* from websites?
by TedPride (Priest) on Jun 17, 2005 at 18:21 UTC
|
Remove everything to the end of the BODY tag. Remove all tags, replacing images with their alt text. Then compare the start and end of each page to every other page. Remove material common between x number of pages that's more than x number of words in length (or some combination of the two). This will be the header and footer material.
What's left is the classic "longest substrings common between two pieces of text" problem. There was a discussion of that recently - let me see if I can find the thread... | [reply] |
Re: How would you extract *content* from websites?
by TedPride (Priest) on Jun 17, 2005 at 18:46 UTC
|
No, the whole idea is that he wants to automatically separate content from material repeated between pages - headers, footers, menus, etc. A proper solution won't care if the design was changed, so long as it has a sufficient number of recent pages to work from.
Here's the node I was talking about:
Imploding URLs
The connection may not be readily apparently, but the problem is essentially the same, only on a much larger scale. You can probably speed comparisons up some by storing all the words in an array and then converting them to a value corresponding to their subscript. You should only need two bytes per word. You can also speed things up by doing detailed comparisons only between pages that haven't had their common material determined yet. If page A and page B have common material x, and page C also has all of x, then you can be pretty sure that it doesn't need to be checked. And you can speed things up by starting comparisons with pages closest in location to the current page - differences in query string first, then in page name second, then a single folder, and so on. | [reply] |
Re: How would you extract *content* from websites?
by Your Mother (Archbishop) on Jun 17, 2005 at 19:23 UTC
|
The diff thing is error prone on lots of sites because ads are randomized and menus often change, even if by a single link, per page. Ovid made some good points. Another thing I've relied on when doing this kind of thing is that content has entirely different semantics from navigation and junk.
An article will be made of sentences and not just one or two but a dozen or more. Ads and navigation will rarely be complete sentences and never be more than one or two. I had pretty good success with this strategy building a news/story fetcher 3 years ago for sites without RSS. Plain text --> lines --> filter out everything but contiguous blocks of sentences --> choose the largest remaining item.
| [reply] |
Re: How would you extract *content* from websites?
by Popcorn Dave (Abbot) on Jun 18, 2005 at 03:03 UTC
|
If I understand your question correctly, you may be able to look for comment tags and grab what's between those.
A few years ago I wrote a program to parse headlines from newspapers - this was pre RSS - and I initially got it down to 9 rules for 25 papers. Of course I was still learning Perl at this point, but I had to go through every web page and find the similiarities much like you're talking about. So what I ended up doing was building a config file with a start and end marker for every webpage I was looking at, and parsed my info from there.
I've gone back and reworked the program somewhat but I used HTML::TokeParser and that made the job a whole lot easier, but I did have an advantage that the papers I was looking at were from the same news organizations, but from different towns, so a lot of the layouts were the same. I still use the config file though.
Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
| [reply] |
|
That sounds reasonable, but how do you programatically determine the starting and ending comments?
| [reply] |
|
| [reply] |
Re: How would you extract *content* from websites?
by polettix (Vicar) on Jun 17, 2005 at 23:18 UTC
|
You may find some inspiration looking at the Road Runner project; I haven't worked in the project, but could be an idea for my final thesys. There are some papers, the system prototype is implemented in Java but... TMTOWTDI, and one is surely Perl.
Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')
Don't fool yourself.
| [reply] |
Re: How would you extract *content* from websites?
by ambrus (Abbot) on Jun 17, 2005 at 21:55 UTC
|
I don't know, but if you figure out a good way to do that, you could earn a fortune. Search engine companies are working on this question too.
| [reply] |
Re: How would you extract *content* from websites?
by artist (Parson) on Jun 17, 2005 at 20:09 UTC
|
Convince them to provide RSS feeds.
| [reply] |
Re: How would you extract *content* from websites?
by kaif (Friar) on Jun 21, 2005 at 07:52 UTC
|
This is a problem I've thought a lot about and written many programs to do on a site-by-site basis. Although I haven't really come up with a good solution (and there probably isn't any), I currently scrape websites looking for images. Depending on how you look at it, this can be a considerably harder or easier thing. Basically, to decide which image on a given page is the "most interesting", I look at the filename (and host --- to see if they match), size (filtering out common ad sizes), and placement on the page (in my experience, on a page that has only one "useful" image, it's like to be at the end --- since all the ads are up front).
| [reply] |