|Problems? Is your data what you think it is?|
In my browser, I get a search feature by holding down <ctrl> and pressing F at the same time :)
How does it work? It is a series of scripts that have to be called in the right order. It's a bit messy, because it was very much an exploratory process.
The first pass is a script that just keeps walking down the chain of snippet pages until it can't find any more links. When it is run the second time around, it walks down the pages until it encounters a link that it has already seen. It sleeps 15 seconds between fetching each page (Perl Monk HTML hackers would do well to take notice of that last point). Similarly, the process is cronned at 4:15 UTC as I figure that's a pretty quiet time for yoda (the machine perlmonks.org is running on).
A second script then kicks in which cleans up some yucky inconsistencies, like reaped nodes with no titles, and reformatting the data to make it easy to process afterwards. This was a bugger to get right. I first tried to do it all in the first script, but it turned out to be simpler to let the fetching script do as little as possible, just fetch and dump, and let another script do the cleaning. It's awkward to carry state around between HTML::Parser callbacks.
A third script then takes the cleansed file and loads it into different hashes, to print them out in various sorted orders all different, for the various HTML views. For instance, it's at this stage where I calculate the number of nodes a person has written, how many nodes written by monks whose nicks start with 't', which makes it easy to emit the correct rowspan attributes to get everything to line up.
A fourth script then walks through all the files generated by the third pass and encodes them as HTML. No, I didn't use any HTML-generating modules. Naughty me, I did it all by hand, serves me right for not paying attention to what modules jcwren has installed on the server. This script creates the files in a directory named /pmsinew under my document root, and then when it has finished, it names /pmsi to /pmsiold and /pmsinew to /pmsi, and then proceeds to unlink the /pmsiold directory. Which means it should be pretty hard to come across a half-constructed index.
It would be a euphemism to say that the code lacks elegance in certain places. I was more interested in hacking up something quickly than showing The Right Way to do things. I do promise to rewrite the scripts and maybe even drop in a comment here and there.
When I do, I'll post the link from the /pmsi/ homepage.--
g r i n d e r