http://www.perlmonks.org?node_id=503836

wfsp has asked for the wisdom of the Perl Monks concerning the following question:

I've almost completed a web site rebuilding exercise and now need to check no damage has been done. There are about a dozen 'types' of pages. I have a script that parses the divs and comments and builds an array similar to the following (simplified):
.lft open : 1 .lft close: 1 .top open : 1 .top close: 1 .mdl open : 1 ..article open : 2 <!-- article start --> <!-- article end --> ..article close: 2 ..footer open : 2 ..footer close: 2 .mdl close: 1
The dots and numbers indicate nested divs. In this case the two comments are important and are used by other scripts but a known issue is that the 'article end' comment can be outside of its appropriate div. Another is that the footer div could be within the article div rather than outside of it. There may (probably!) be other snags which is why I'm looking for a general approach to validating such structures.

The plan is to build an array of arrays and write it to, say, a bar delimited flat file db. I would then have a series of scripts that would use this to check various rules (mandatory divs, optional divs etc.).

In addition to the snags already mentioned I would be suspicious of files that had unique structures and would want to identify them and have a look.

My idea at the moment is to load each structure into a hash and do a series of lookups.

Speed and efficiency are not priorities but rather easily tweakable scripts to test for various circumstances.

If anyone has suggestions on what keywords/techniques to look for (my jargon foo is very poor) or any experiences with similar tasks it would be much appreciated.

Notes

Local copy of a static website on stand alone winXP/ActiveState. Approx 80MB, 3k pages.

All praise to File::Find and HTML::TokeParser::Simple! :-)

Replies are listed 'Best First'.
Re: Validating HTML structures
by b374 (Initiate) on Oct 31, 2005 at 00:27 UTC

    HTML::Tree is your friend. I used it very extensive in converting old pages to new formats, pulling out content, and even XML parsing.

    As for identifying various structures, the easiest approach I found was the following:

    Presuming you have like 10,000 pages of somehow similar structure but have suspicions there might be some different. Build the parse tree for each page and eliminate anything which isn't structural from the tree. Dump the html with no spaces, no comments to a string and make a hash out of that string. For each hash in this stage keep a list of files that matched it and the "blueprint" in some structures of your choice (mine are files for example)... After this step you'll get some classes of pages, each matching one blueprint.

    After you get the structures based on the blocks on the pages, repeat the same procedure for each block and contained information.

    The case in which this method got the best results was extracting information out of about 18,000 pages which contained something like 3 tables of key-values pairs, some free form content (descriptions, reviews, comments), images and some other blocks. The layout was a mess, part of the pages being initially generated through 3-4 different scripts and then updated manually over a few years with anything from text editors to *cough* FrontPage... the result was pretty impressive... in just about a day (actually a night), over 99% of the pages were in a nice database.

Re: Validating HTML structures
by petdance (Parson) on Oct 31, 2005 at 18:00 UTC