Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
the example html is truly appaling, using outdated html attributes for presentation, and not providing any elements that specify document structure. it is just the conditions that these sorts of html's live in, that make it conducive to break frequently and the person owning the document uses frontpage (or worse) msword to generate html content, with outdated html attributes all over the place. as there's no real structural html elements, with just look/feel elements and data intetmixed, changes to document by owner are usually very naive in terms of valid/sane html. for example you may start seeing at some stage several empty opening and closing font tags. still looks exactly same as before, but the naive generation of html using msword or older frontpage, now breaks the scraping code and you're back at square one trying to figure it out.
whilst scraping (sometimes very bad) html is inevitable, the way you go about it can make some difference. basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor.
i've worked with a federated searching java based engine for some time, and it is exactly when the vendor wrote scraping code to match frequently changing html (e.g. html attributes) that often ended up breaking these scrapers. so instead of moving onto bigger and better end up maitaining a whole lot of scrapers that break all the time, and you're at the pointy end of the "fix it now".
in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon.
btw when i mentioned "naive" html document owner, i don't mean to be nasty, just means they don't know or do any better for whatever reason. it's naive in terms of using html with regard to spec and current best practice.
the hardest line to type correctly is: stty erase ^H

In reply to Re^2: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!) by aquarium
in thread how to quickly parse 50000 html documents? by brengo

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    [Corion]: Heh - I just realized, $work management just declared "no raises for anybody this year", and I'm not even angry about that. Even though that implicitly means "raises for the people with an automatic 5% raise in their contract", which the newly-merged ...
    [Corion]: ... coworkers have. But I guess I've gone more mellow since I get to relax more, and such stuff doesn't make me as angry as it used to.
    [Corion]: $boss will still get to listen to my interpretation :-D
    [Eily]: hey, I'm just behind Larry in SioB \o/
    [Corion]: Eily: Wheee ;)
    [Eily]: I'll add that to my résumé

    How do I use this? | Other CB clients
    Other Users?
    Others chanting in the Monastery: (7)
    As of 2018-01-22 11:10 GMT
    Find Nodes?
      Voting Booth?
      How did you see in the new year?

      Results (233 votes). Check out past polls.