Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon.

There is a simple maxim taught to me by my first boss in programming: don't do what won't benefit you.

All we have to go on is that bad html snippet the OP posted. In all likelihood, all he has to go on is that html snippet grabbed from whatever website it came from. We could try to predict what might happen in the future and cater for it, but the highest probability is that whatever we guess will be wrong.

The only sensible thing to do is work with what we know. And what we know for now is that the simple regex used works. If, in the future it changes, then the 5 minutes it took to construct the program above maybe be required to be repeated. If it then changes again, maybe there would be some pattern to the change that might suggest a better approach. But, it might never change; and any effort expended now to try and cater for unknown changes that might never happen would be entirely wasted.

If these numbers were embedded in a plain text document, no one here would blink an eye about using regex. But add a few <> into the mix and suddenly many start trotting out cargo-cult wisdoms: "Don't parse HTML/XML/XHTML/whatever with regex"; completely missing that most of the time nobody wants to parse the html; just extract some small subset of text from a larger set of text. Ie. They want to do exactly what regex are designed to do.

basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor.

I'll take your word for the quality or lack thereof of the html, because I neither know nor care. It's just text within text to me.

For now, what I've suggested to the OP works. And it works 500 times more quickly that his existing solution. If he gets to use it once before the sources changes, he can afford to spend 3 working days re-writing it and still have gained. And it took me less than 5 minutes to write this version and maybe 10 to test it; most of which was taken up generating 1000 test pages. If he gets to use it 10 times, he's saved himself enough time to take a month's vacation.

It's simple. It works. Job done. And if it requires change next week, or next month or next year, it is simple enough that it won't require deep knowledge of half a dozen co-dependant packages and APIs in order to fix it.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^3: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!) by BrowserUk
in thread how to quickly parse 50000 html documents? by brengo

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others wandering the Monastery: (7)
    As of 2018-01-23 10:17 GMT
    Find Nodes?
      Voting Booth?
      How did you see in the new year?

      Results (243 votes). Check out past polls.