Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: How would you extract *content* from websites?

by Popcorn Dave (Abbot)
on Jun 18, 2005 at 03:03 UTC ( [id://467929]=note: print w/replies, xml ) Need Help??


in reply to How would you extract *content* from websites?

If I understand your question correctly, you may be able to look for comment tags and grab what's between those.

A few years ago I wrote a program to parse headlines from newspapers - this was pre RSS - and I initially got it down to 9 rules for 25 papers. Of course I was still learning Perl at this point, but I had to go through every web page and find the similiarities much like you're talking about. So what I ended up doing was building a config file with a start and end marker for every webpage I was looking at, and parsed my info from there.

I've gone back and reworked the program somewhat but I used HTML::TokeParser and that made the job a whole lot easier, but I did have an advantage that the papers I was looking at were from the same news organizations, but from different towns, so a lot of the layouts were the same. I still use the config file though.

Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
  • Comment on Re: How would you extract *content* from websites?

Replies are listed 'Best First'.
Re^2: How would you extract *content* from websites?
by BUU (Prior) on Jun 18, 2005 at 03:10 UTC
    That sounds reasonable, but how do you programatically determine the starting and ending comments?
      Boy, I wish I knew the answer to that. Like I said, I looked at the page layouts of the web sites I was after and built my config file with the comments to look for - starting and ending.

      Along the lines of what you're after I suppose you could just parse for comments and build a list of comment tags to look for. You had mentioned doing a diff on the files you wanted to look at, so that may be the way to start.

      Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://467929]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-24 17:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found