|Perl: the Markov chain saw|
Some light PerlMonks reading by the campfireby hacker (Priest)
|on Feb 18, 2007 at 17:02 UTC||Need Help??|
Ok, now that I've got your attention... what I'm wondering, is the best way to capture a bulk of PM for reading "offline", where there is no net connectivity (think WAY offline, as in.. no power for hundreds of miles), for potentially months at a time.
I'm going to start small, just try to see if its even feasible, and then expand it. I've done some similar projects in this area over the last few years, which have been quite successful.
I also looked around the monastery here, and found these somewhat-relevant nodes:
There are quite a few useful replies in there, and some referencing ThePen (which is down as I type this). Some talk about spidering the site, others about converting from XML to html, others to just pulling a database dump and reusing that.
Ideally, the best approach would be to dump the node tables and replies to some form of XML, like Wikimedia projects do. They have a tool called mwdumper (written in Java) that will take the XML export and pump it back into MySQL (I just did this for the latest Wikipedia database this weekend, it was over 4.5 million separate rows and took 20 hours to import, whew!).
But it doesn't have to be that complex... even just the XML dumps with some sort of linking to each of the replies, would be perfect.
Now I can also spider ThePen during off-hours (when it comes back online) and store the plain HTML that way, but that introduces load, latency, bandwidth issues and so on. I'd rather avoid that strain on someone else's server, because I know what its like when someone does it to my public servers.
Has there been any movement on the implementation of "nodeballs" yet in PM? The Everything Engine powering PerlMonks supports it, so I guess its just a matter of a concensus, and a vote, and enabling it?
What say ye?