|Syntactic Confectionery Delight|
I've almost completed a web site rebuilding exercise and now need to check no damage has been done. There are about a dozen 'types' of pages. I have a script that parses the divs and comments and builds an array similar to the following (simplified):
The dots and numbers indicate nested divs. In this case the two comments are important and are used by other scripts but a known issue is that the 'article end' comment can be outside of its appropriate div. Another is that the footer div could be within the article div rather than outside of it. There may (probably!) be other snags which is why I'm looking for a general approach to validating such structures.
The plan is to build an array of arrays and write it to, say, a bar delimited flat file db. I would then have a series of scripts that would use this to check various rules (mandatory divs, optional divs etc.).
In addition to the snags already mentioned I would be suspicious of files that had unique structures and would want to identify them and have a look.
My idea at the moment is to load each structure into a hash and do a series of lookups.
Speed and efficiency are not priorities but rather easily tweakable scripts to test for various circumstances.
If anyone has suggestions on what keywords/techniques to look for (my jargon foo is very poor) or any experiences with similar tasks it would be much appreciated.
Local copy of a static website on stand alone winXP/ActiveState. Approx 80MB, 3k pages.
All praise to File::Find and HTML::TokeParser::Simple! :-)