Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I've almost completed a web site rebuilding exercise and now need to check no damage has been done. There are about a dozen 'types' of pages. I have a script that parses the divs and comments and builds an array similar to the following (simplified):
.lft open : 1 .lft close: 1 .top open : 1 .top close: 1 .mdl open : 1 ..article open : 2 <!-- article start --> <!-- article end --> ..article close: 2 ..footer open : 2 ..footer close: 2 .mdl close: 1
The dots and numbers indicate nested divs. In this case the two comments are important and are used by other scripts but a known issue is that the 'article end' comment can be outside of its appropriate div. Another is that the footer div could be within the article div rather than outside of it. There may (probably!) be other snags which is why I'm looking for a general approach to validating such structures.

The plan is to build an array of arrays and write it to, say, a bar delimited flat file db. I would then have a series of scripts that would use this to check various rules (mandatory divs, optional divs etc.).

In addition to the snags already mentioned I would be suspicious of files that had unique structures and would want to identify them and have a look.

My idea at the moment is to load each structure into a hash and do a series of lookups.

Speed and efficiency are not priorities but rather easily tweakable scripts to test for various circumstances.

If anyone has suggestions on what keywords/techniques to look for (my jargon foo is very poor) or any experiences with similar tasks it would be much appreciated.

Notes

Local copy of a static website on stand alone winXP/ActiveState. Approx 80MB, 3k pages.

All praise to File::Find and HTML::TokeParser::Simple! :-)


In reply to Validating HTML structures by wfsp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others exploiting the Monastery: (3)
    As of 2020-06-05 01:18 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      Do you really want to know if there is extraterrestrial life?



      Results (35 votes). Check out past polls.

      Notices?