Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Plucker Perl Spider (need help)

by hacker (Priest)
on Nov 18, 2001 at 13:23 UTC ( #126104=perlmeditation: print w/ replies, xml ) Need Help??

My monk brethren:

For those who do not know, or have not heard of Plucker, Plucker is an Open Source PalmOS® application which allows you to scrape web content from any website and converts it for use on your Palm for viewing. It supports color, bookmarks, hrefs, images, mailto, and all the elements you expect in HTML (except CSS, Javascript, frames). An example screenshot of it being used Palm-side can be found here, and several others can be found here.

For those familiar with AvantGo and similar applications, this is very close to the same functionality, except we use the power of the desktop to render and convert content (which gives us quite a few advantages over the other alternatives). We have two mailing lists for people wishing to gain an understanding of the project team and goals.

Plucker originally did not handle images, and only did text, and the original desktop-side parser was written in sed and awk. I started the original conversion over from sed/awk to perl (I am not "Pat" named in the script, he was a friend helping me with the conversion), and a parallel development to replace the sed/awk with Python was also started. My day job took me away for too long, and the Python spider (desktop parser) has replaced the sed/awk spider, and the perl parser has stagnated a bit.

My goal with this meditation is to enlist the help of those with the skills to help bring the perl spider back up to snuff (it's about 90% of the way there now, barring a complete rewrite, of course). I'd like to see if I can use LWP::Parallel to speed the gather/build processes as much as possible.

The architecture is quite simple. You have a small file, home.html, which resides in ~/.plucker, and contains a list of links you wish to "pluck". The format is very simple, something like:

<a href="http://www.geeknews.com/" MAXDEPTH=2 BPP=4>Geeknews.com</a>

There are some tags in the HREF which are parsed out by the desktop parser and used to configure your content accordingly. In this example, MAXDEPTH refers to the maximum depth of links to follow from $parent, and BPP is the depth of the images that you wish to include (4bpp in this case, or 16 shades of grey).

Another method is to just hand the spider itself a URL object, and have it fetch it directly, and parse it. Something like:

$ ./fetch.pl http://www.geeknews.com --maxdepth=2 --bpp=4

This matches the other home.html example above.

Ideally, several arrays (or hash references?) need to exist.

  • One to contain the links extracted from the page(s) found (@foundlinks)
  • Another one to contain the links which have been tested and are valid (HEAD on the url?)
    • Out of this will fall @badlinks and @goodlinks
    • @badlinks should be written to a file, and loaded the next time the spider is launched, so anything in @seenlinks which matches anything in @badlinks can be prioritized to the bottom.
  • Another one which contains duplicate links found, @dupelinks (if you have a depth of 2, and your parent page references a child which has links pointing back to the parent, no need to gather these more than once)
  • Lastly, before actually doing the conversion, @seenlinks, which should contain valid data, including the full HTML/content of the page fetched.

To go through the process in pseudocode, you grab the URL(s) from STDIN or ~/.plucker/home.html, pull the links off of the parent page, and traverse gathering links until you reach $maxdepth. Each page in @foundlinks is tested in a parallel process for validity (is the link up, or down?) and stored in @badlinks or @goodlinks. As links reach @goodlinks, they are pulled off and gathered by the spider (actual content is fetched). As these pages are retrieved, they are added to @seenlinks, ready for the conversion to Plucker format. The Plucker format is open and well-documented. Once the links and data is local, each link is given a number (a RecordID) and then converted to the binary format necessary for Plucker to read it. The file is called a PDB, or Palm Database (a serial collection of records), and is then sync'd to the Palm using whatever means required (pilot-link is one way). I'm sure I missed a step in here somewhere, but that's the short version.

I would like to be heading up the team that attacks this problem, so I'll be the POC (Point Of Contact) from an architecture standpoint, as well as just a curious onlooker and contributor to the code that I can help with. I am only a Sage when it comes to perl hackery of this sort, but learn fast.

Interested perl hackers, please give me a shout.

Comment on Plucker Perl Spider (need help)
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://126104]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (9)
As of 2014-09-02 10:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (21 votes), past polls