Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
If the .html is indeed as regular as portrayed (and all the job specs are in a single html page as I think you have suggested), this seems almost trivial.

Based on the sample data you've shown and your (somewhat conflicted??) description of what your boss wants, I'm going to assume that you want to capture as much of the first <p> as included in the <jobname and jobserial <spans... and then skip over the (possibly outdated) incumbents, resuming your capture with the first blockquote.

What I'm hoping this labored phasing suggests is that designing a regex (or group of same) is at least as much about analysis of the source data as about coding.

In other words, it matters little whether you use a non-greedy lookahead or a negated class or something else to skip the 2nd and 3rd blockquotes (each of which happens to be immediately followed by an <a href... -- which makes them easy to distingish and thus eases the way to satisfying your "without grabbing" requirement) or any one of several other techniques that leap to mind.

Similarly, analyasis of the initial info (again, assuming regularity) tells you you want to start capturing with the line following <p><b><span class="jobname">
and the numeric data immediately following ="jobserial">( (or, if you prefer, the digits between the parentheses after ="jobserial"> (by which I mean to suggest an alternate algorithm/regex technique).

Following any (or, better, several!) of the approaches suggested by the above may not be what you actually had in mind, but might still serve "to expand (your) Perl skills...."

Of course, if you have text editor that will remove .html tags and supports regexen, one approach might be to simply capture the webpage source (by whatever means: save_as from a browser; LWP, etc), open the file in the editor, delete the tags and use two simple regexen to replace

Go to the top of this page.
Check for open positions now!

In reply to Re: Parsing HTML files to recover data... by ww
in thread Parsing HTML files to recover data... by UrbanHick

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and not a whimper to be heard...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (8)
    As of 2018-06-24 22:39 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (126 votes). Check out past polls.