Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Amen grinder++. I whole heartedly agree.

I came to much the same conclusion in Being a heretic and going against the party line. after having only been using perl for a relatively short time. My experiences since have done little to change my mind.

Back in that old post I tried to make a distinction between the need to parse HTML and the need to extract something that just happens to be embedded within stuff that happens to be HTML. This distinction was roundly set upon as being wrong. I still hold with this distinction.

The dictionary definition of parse is

  1. To break (a sentence) down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part.
  2. To describe (a word) by stating its part of speech, form, and syntactical relationships in a sentence.
    1. To examine closely or subject to detailed analysis, especially by breaking up into components: “What are we missing by parsing the behavior of chimpanzees into the conventional categories recognized largely from our own behavior?” (Stephen Jay Gould).
    2. To make sense of; comprehend: I simply couldn't parse what you just said.

Whilst the dictionary definition of extract is:

  1. To draw or pull out, often with great force or effort: <cite>extract a wisdom tooth; used tweezers to extract the splinter.</cite>
  2. To obtain despite resistance: <cite>extract a promise.</cite>
  3. To obtain from a substance by chemical or mechanical action, as by pressure, distillation, or evaporation.
  4. To remove for separate consideration or publication; excerpt.
    1. To derive or obtain (information, for example) from a source.
    2. To deduce (a principle or doctrine); construe (a meaning).
    3. To derive (pleasure or comfort) from an experience.
  5. Mathematics. To determine or calculate (the root of a number).

From my perspective, when the need is to locate and capture one or more pieces of information from within any amount or structure of other stuff, without regard to the structural or semantic positioning of those pieces within the overall structure, the term extraction more applicable than parsing. If I need to understand the structure, derive semantic meaning from the structure or verify its correctness, then I need to parse, otherwise I just need to extract. After all, Practical Extraction and Reporting is what that Language was first designed to do.

My final, and strongest argument lies in a simple premise. If the information I was after was embedded amongst a lot of Arabic, Greek or Chinese, then noone would expect me to find and use a module that understood those languages, just to extract the bits I needed.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller



In reply to Re: Scraping HTML: orthodoxy and reality by BrowserUk
in thread Scraping HTML: orthodoxy and reality by grinder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (9)
    As of 2014-09-20 08:53 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (157 votes), past polls