|Perl: the Markov chain saw|
Plucker Perl Spider (need help)by hacker (Priest)
|on Nov 18, 2001 at 13:23 UTC||Need Help??|
My monk brethren:
For those familiar with AvantGo and similar applications, this is very close to the same functionality, except we use the power of the desktop to render and convert content (which gives us quite a few advantages over the other alternatives). We have two mailing lists for people wishing to gain an understanding of the project team and goals.
Plucker originally did not handle images, and only did text, and the original desktop-side parser was written in sed and awk. I started the original conversion over from sed/awk to perl (I am not "Pat" named in the script, he was a friend helping me with the conversion), and a parallel development to replace the sed/awk with Python was also started. My day job took me away for too long, and the Python spider (desktop parser) has replaced the sed/awk spider, and the perl parser has stagnated a bit.
My goal with this meditation is to enlist the help of those with the skills to help bring the perl spider back up to snuff (it's about 90% of the way there now, barring a complete rewrite, of course). I'd like to see if I can use LWP::Parallel to speed the gather/build processes as much as possible.
The architecture is quite simple. You have a small file, home.html, which resides in ~/.plucker, and contains a list of links you wish to "pluck". The format is very simple, something like:
<a href="http://www.geeknews.com/" MAXDEPTH=2 BPP=4>Geeknews.com</a>
There are some tags in the HREF which are parsed out by the desktop parser and used to configure your content accordingly. In this example, MAXDEPTH refers to the maximum depth of links to follow from $parent, and BPP is the depth of the images that you wish to include (4bpp in this case, or 16 shades of grey).
Another method is to just hand the spider itself a URL object, and have it fetch it directly, and parse it. Something like:
$ ./fetch.pl http://www.geeknews.com --maxdepth=2 --bpp=4
This matches the other home.html example above.
Ideally, several arrays (or hash references?) need to exist.
To go through the process in pseudocode, you grab the URL(s) from STDIN or ~/.plucker/home.html, pull the links off of the parent page, and traverse gathering links until you reach $maxdepth. Each page in @foundlinks is tested in a parallel process for validity (is the link up, or down?) and stored in @badlinks or @goodlinks. As links reach @goodlinks, they are pulled off and gathered by the spider (actual content is fetched). As these pages are retrieved, they are added to @seenlinks, ready for the conversion to Plucker format. The Plucker format is open and well-documented. Once the links and data is local, each link is given a number (a RecordID) and then converted to the binary format necessary for Plucker to read it. The file is called a PDB, or Palm Database (a serial collection of records), and is then sync'd to the Palm using whatever means required (pilot-link is one way). I'm sure I missed a step in here somewhere, but that's the short version.
I would like to be heading up the team that attacks this problem, so I'll be the POC (Point Of Contact) from an architecture standpoint, as well as just a curious onlooker and contributor to the code that I can help with. I am only a Sage when it comes to perl hackery of this sort, but learn fast.
Interested perl hackers, please give me a shout.