Beefy Boxes and Bandwidth Generously Provided by pair Networks httptech
Don't ask to ask, just ask
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Although I agree that choosing for a regexp approach or a context free grammar approach depends on the problem at hand, I'd like to stress that halley made a very important point:

Rules are meant to be broken, but you've to understand them before you can break them... safely.

Although a lot of Monks will know the distiction between a regular language and a context free language (and I'm sure grinder and BrowserUK do), I'm rather sure that some don't. In the latter case, unfortunately those Monks simply don't know the rules and have lots of opportunity to mess up.

I'd like to paraphrase: "a little thinking is a dangerous thing" if the process is not supported by a proper amount background knowledge.

It is possible to approximate a context free grammar with a regular expression, a nice survey article about that has been written by Mark-Jan Nederhof. There are several good books about formal languages, but I'd particularly recommend Sipser's since it is well written and is nice to read.

Conclusion: even if you know the rules, but don't understand them, don't try and break them. More importantly: try and understand the rules you're following.

Just my 2 cents, -gjb-


In reply to Re: Scraping HTML: orthodoxy and reality by gjb
in thread Scraping HTML: orthodoxy and reality by grinder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (3)
    As of 2014-04-21 00:55 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      April first is:







      Results (489 votes), past polls