Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I've almost completed a web site rebuilding exercise and now need to check no damage has been done. There are about a dozen 'types' of pages. I have a script that parses the divs and comments and builds an array similar to the following (simplified):
.lft open : 1 .lft close: 1 .top open : 1 .top close: 1 .mdl open : 1 ..article open : 2 <!-- article start --> <!-- article end --> ..article close: 2 ..footer open : 2 ..footer close: 2 .mdl close: 1
The dots and numbers indicate nested divs. In this case the two comments are important and are used by other scripts but a known issue is that the 'article end' comment can be outside of its appropriate div. Another is that the footer div could be within the article div rather than outside of it. There may (probably!) be other snags which is why I'm looking for a general approach to validating such structures.

The plan is to build an array of arrays and write it to, say, a bar delimited flat file db. I would then have a series of scripts that would use this to check various rules (mandatory divs, optional divs etc.).

In addition to the snags already mentioned I would be suspicious of files that had unique structures and would want to identify them and have a look.

My idea at the moment is to load each structure into a hash and do a series of lookups.

Speed and efficiency are not priorities but rather easily tweakable scripts to test for various circumstances.

If anyone has suggestions on what keywords/techniques to look for (my jargon foo is very poor) or any experiences with similar tasks it would be much appreciated.

Notes

Local copy of a static website on stand alone winXP/ActiveState. Approx 80MB, 3k pages.

All praise to File::Find and HTML::TokeParser::Simple! :-)


In reply to Validating HTML structures by wfsp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-03-29 13:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found