Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
the example html is truly appaling, using outdated html attributes for presentation, and not providing any elements that specify document structure. it is just the conditions that these sorts of html's live in, that make it conducive to break frequently and unexpectedly...as the person owning the document uses frontpage (or worse) msword to generate html content, with outdated html attributes all over the place. as there's no real structural html elements, with just look/feel elements and data intetmixed, changes to document by owner are usually very naive in terms of valid/sane html. for example you may start seeing at some stage several empty opening and closing font tags. still looks exactly same as before, but the naive generation of html using msword or older frontpage, now breaks the scraping code and you're back at square one trying to figure it out.
whilst scraping (sometimes very bad) html is inevitable, the way you go about it can make some difference. basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor.
i've worked with a federated searching java based engine for some time, and it is exactly when the vendor wrote scraping code to match frequently changing html (e.g. html attributes) that often ended up breaking these scrapers. so instead of moving onto bigger and better things..you end up maitaining a whole lot of scrapers that break all the time, and you're at the pointy end of the "fix it now".
in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon.
btw when i mentioned "naive" html document owner, i don't mean to be nasty, just means they don't know or do any better for whatever reason. it's naive in terms of using html with regard to spec and current best practice.
the hardest line to type correctly is: stty erase ^H

In reply to Re^2: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!) by aquarium
in thread how to quickly parse 50000 html documents? by brengo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others rifling through the Monastery: (9)
    As of 2014-10-01 01:14 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (386 votes), past polls