Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

HTML::Tree is your friend. I used it very extensive in converting old pages to new formats, pulling out content, and even XML parsing.

As for identifying various structures, the easiest approach I found was the following:

Presuming you have like 10,000 pages of somehow similar structure but have suspicions there might be some different. Build the parse tree for each page and eliminate anything which isn't structural from the tree. Dump the html with no spaces, no comments to a string and make a hash out of that string. For each hash in this stage keep a list of files that matched it and the "blueprint" in some structures of your choice (mine are files for example)... After this step you'll get some classes of pages, each matching one blueprint.

After you get the structures based on the blocks on the pages, repeat the same procedure for each block and contained information.

The case in which this method got the best results was extracting information out of about 18,000 pages which contained something like 3 tables of key-values pairs, some free form content (descriptions, reviews, comments), images and some other blocks. The layout was a mess, part of the pages being initially generated through 3-4 different scripts and then updated manually over a few years with anything from text editors to *cough* FrontPage... the result was pretty impressive... in just about a day (actually a night), over 99% of the pages were in a nice database.


In reply to Re: Validating HTML structures by b374
in thread Validating HTML structures by wfsp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others contemplating the Monastery: (9)
    As of 2020-06-03 17:15 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      Do you really want to know if there is extraterrestrial life?



      Results (27 votes). Check out past polls.

      Notices?