Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: A few random questions from Learning Perl 3

by Anonymous Monk
on Jan 06, 2003 at 04:44 UTC ( #224539=note: print w/replies, xml ) Need Help??


in reply to A few random questions from Learning Perl 3

I read a post a few days ago about someone saying something like "shame on you for using regex like that". I just finally understand what regex is and that statement doesn't make much sense to me. Regex is like patterns and substitutions, are there specific times when you're not supposed to use them or there is something better to use?

Yeah, you hear this crap sometimes, especially when you are trying to parse HTML. Basically, there are modules to handle the parsing, and these modules (e.g. HTML::Parse) tend to be a very reliable and are a easy way to get the job done. HOWEVER, there is nothing wrong with using a regex. After all, Perl is optimized for regex. Many times using a regex will result in faster code than using a module. HOWEVER, most people do not fully understand regex, so they often overlook something. Basically, I love regex. Use them! Even for parsing HTML! Just make sure you know what you are doing! And have fun! But if your job is on the line, go ahead and use a module.

  • Comment on Re: A few random questions from Learning Perl 3

Replies are listed 'Best First'.
Re: Re: A few random questions from Learning Perl 3
by gjb (Vicar) on Jan 06, 2003 at 06:09 UTC

    It might be useful to read up a bit on the theory of formal languages. You'll see that there's a whole family of languages, each described by a certain mathematical formalism. Regular languages are an example, and as you can guess they're described by regular expressions. Unfortunately, HTML is not a regular language and hence can not be described by regular expressions since they're just not powerful enough.

    By way of example, consider <em>hello beautiful HTML <em>world</em></em>: easy to write a regular expression to get the inner "world", isn't it? Now consider <em>hello <em>beautiful<em>HTML world</em></em></em>, if you want to match something, again you can write a regular expression... as long as you know the maximum number of times the <em>...</em> tags will be embedded.

    HTML allows unbounded nesting of tags, so this means that you can't write a general regular expression that describes every possible nesting situation. Regular expression are simply not powerful enough, you'll need at least context free languages, hence a tool such as HTML::Parser or for general cases something like Parse::RecDescent.

    Now you can argue:

    1. yeah right, but real world HTML is not that complicated, or
    2. you can fiddle with embedded code and cuts in regular expressions.
    As to the first argument: you don't always know this in advance if you don't control the HTML generation yourself, people are bound to do weird things, mostly not even on purpose.
    As to the second argument: true, but these are still experimental features (as the docs specify for 5.6.1) and they're not at all obvious to use, even up to the point that it is easier to use a more powerful tool than get the particular regular expression right. (Note from a formal language theory point of view: embedded code, cuts and the like increase Perl "regular expressions" beyond regular languages.)

    Given this story, your claim that one can deal with all problems HTML by using regular expressions shows some unwarranted optimism on your part. Obviously there's no reason to believe me, so I'll suggest a number of references on the subject:

    And who knows, maybe our own mstone will write a MOPT on the subject one of these days? (Hint, hint ;-)

    Just my 2 cents, -gjb-

    Update: Thanks TheHobbit for reiterating the points I actually mention in my text if you bother to read it carefully. (?{...}) and /e are called code embedding.

      Hi,
      I'll add some considerations which looks needed. This will also be an answer to the 'Anonymous' below, who thinks he or she can hide and insult people without even disturbing him ore herself to register into the community...

      Stricly speaking, Perl regex are realy much more powerfull than those described in the wonderfull books you refer to. To understand regex as they are used in perl (but also in other langages & tools) I'd rathere refer to

      A basic thing that one always see written about regex is that the can not count. Meaning that you must know the maximum number of times the <em>...</em> will be embedded..

      While this is true of 'standard' regex, this is not true for Perl regex. By using carefull combination of the /e modifier and of the (?{}) programmatic pattern you can do using regex, everithing a parser will do.

      IMHO, using a regex or another approach is a matter of taste, and a careful crafted and optimized regex will be more efficent than a sloppy written rec descent parser.

      Just my 5 (euro) cents.

      Cheers


      Leo TheHobbit
        By using carefull combination of the /e modifier and of the (?{}) programmatic pattern you can do using regex, everithing a parser will do.

        My guess is that you probably mean the (??{...}) assertion.

        (?{...}) merely executes, whereas
        (??{...}) executes and interpolates.

        (A possibly confusing mnemonic would be that one ? would be like one q, which doesn't interpolate. Double ? would be like double q, which interpolates. It's different types of interpolations (one interpolates into the construction, one interpolates its result), so ignore this if it doesn't make sense to you.)

        A bit generalized you may say that:
        (?{...}) is used for debugging and/or setting state.
        (??{...}) is used for generating patterns at "match-time".

        Beware of using =~ inside either of these assertions though. The engine is known to often blow upon that.

        Update:
        A good example that uses both these assertion is to be found at Re: Capturing brackets within a repeat group [plus dynamic backreferences].

        Hope I've helped,
        ihb
        I am the previous anonymous monk.

        Perl 6's regex syntax will make using it for parsing reasonable. But while you theoretically can do that with a lot of care and using constructs that very people know about, the odds are strongly that the average programmer who thinks that they can just misunderstands and underestimates the difficulties.

        Therefore on the odds I stand by my previous comments.

      You're right, and you're wrong... I'm fairly certian that while ordinary regular expressions aren't up to parsing HTML, even on a theorical basis. Perl regular expressions are a whole 'nother breed. Regular expressions with backreferences are NP-complete; it's been proven at least twice. (Well, three times, but one of them is buggy.) I suspect I'm missing somthing here... if anybody knows what (other then my mind), I'd love to hear it.


      Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

        NP-completeness is a property of an algorithm. It implies that no algorithm is known to solve the problem in polynomial time.
        This means that if you increase the length of the input for the problem, the execution time will increase exponentially. (Of course there are input cases which are polynomial, but many of interest are not). Essentially, it means that brute force is the only known method to tackle the problem exactly.

        The question is on the relation between the behavior of an algorithm to decide on a language and the class to which this language belongs. For regular languages and context free languages polynomial time algorithms are known, but does this necessarily mean that since regular expressions with backreferences are proven to be NP-complete that the language they describe are a superset of regular and context free languages?

        It certainly means it is hard to decide whether or not a certain string is an element of the language described by a regular expression with backreferences. But what does it tell us about the expressive power?

        The expression /^(.*)\1$/ defines the language {ww | w in sigma*}, known neither to be regular, nor context free. On the other hand, regular expressions with backreference can't describe {a^n b^n | n >= 0} which is definitely context free.

        So on the one hand, regular expressions with backreference describe languages that are not context free, but can't describe all context free languages either! This example illustrates that one has to be very careful when judging expressive power from algorithmic complexity. A high complexity is a sign that the expressive power must be high in some cases, but doesn't guarantee that everything can be done.

        Incidently, the code below shows two Perl regular expressions that describe non-regular languages:

        { a^n b^n | n >= 0} /^ (a*) (??{sprintf("b{%d}", (length($1)))}) $/x
        which is context free as mentioned above and
        { a^n b^n c^n | n >= 0 } /^ (a*) (??{sprintf("b{%d}", (length($1)))}) (??{sprintf("c{%d}", (length($1)))}) $/x
        which is context sensitive.

        Just my 2 cents, -gjb-

Re: Re: A few random questions from Learning Perl 3
by Anonymous Monk on Jan 06, 2003 at 06:35 UTC
    If you still think that regular expressions are appropriate for parsing HTML then I guarantee that you don't understand regular expressions as well as you think that you do, and you have written some really crappy parsing code.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://224539]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2020-02-21 09:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (94 votes). Check out past polls.

    Notices?