Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: How would you extract *content* from websites?

by Ovid (Cardinal)
on Jun 17, 2005 at 18:29 UTC ( [id://467835]=note: print w/replies, xml ) Need Help??


in reply to How would you extract *content* from websites?

Barring something useful like RSS feeds, you're going to have to do this on a site by site basis. What should ideally happen is your spider, when visiting a site, should load the rules for parsing that site. Maybe subclasses that override a &content method would be appropriate.

Regrettably, I do a lot of work like this and it's easier said than done. One thing which can help is looking for "printer friendly" links. Those often lead to a page that strips a lot of the extraneous information off.

Cheers,
Ovid

New address of my CGI Course.

  • Comment on Re: How would you extract *content* from websites?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://467835]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-20 00:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found