lenrobert has asked for the wisdom of the Perl Monks concerning the following question:

I am aware of OR ( | ), but is there logical NOT in the PERL regex syntax?

The task would be the following: to extract the relative links (i.e. the href property of the "a" element) from an HTML file, even if it is not enclosed in quotation marks. This means I don't want to retrieve hyperlinks beginning with /, or # or javascript:

I would express the following string, and capture (extract) the content of the second parenthesis.

(  <a href="   OR <a href=) THEN NOT(/  OR   #   OR   javascript:  OR  \s   OR  "  ) THEN ( \s   OR  "  )

The best regexp I could do is this, but it does not handle the case of / # javascript: etc.

/(<a href="|<a href=)([^"]*?)(\s|")/gi)

Does anyone know the answer, and help me? Thanks in advance,


Replies are listed 'Best First'.
Re: Boolean operators in PERL regexp?
by friedo (Prior) on Feb 24, 2005 at 18:45 UTC
    Parsing HTML with regular expressions can only end in tears. I would suggest checking out HTML::Parser.

    I think the closest thing to what you're looking for is the zero-width negative lookahead assertion. For example, /foo(?!bar)/ matches "foo" followed by anything that isn't "bar". See perlre for more information; negative lookaheads are difficult to master and can become quite complex.

Re: Boolean operators in PERL regexp?
by Enlil (Parson) on Feb 24, 2005 at 18:55 UTC
    m#<a href="?(?!/|#|javascript)([^">]*)#
    or something near that. Note that there are much better tools for parsing HTML than regular expressions. Try HTML::TokeParser or HTML::Parser, and save yourself a headache.


Re: Boolean operators in PERL regexp?
by ikegami (Pope) on Feb 24, 2005 at 19:02 UTC

    I recommend against using these (a parser would work better), but here's a snippet from my scratchpad:

    /^(?:(?!$re).)*$/ # NOT re /$re1|$re2/ # re1 OR re2 /^(?=.*$re1)(?=.*$re2)/ # re1 AND re2

    NOT must be anchored on both ends, but it doesn't have to be with ^ and $.

    AND doesn't have to be anchored, but if the start is anchored (with ^ or by some other means), it should speed up the case where there is no match.

    The .* in AND may need to be replaced so it doesn't look too far ahead.

      Friedo, Enlinl, Ikegami, thank you for your responses.

      Yes, my problem might be easily solved with an HTML parser, but beyond my question in my opinion a logical NOT would be essential for regexps. Lookahead and lookbehind assertions are very good, the only fatal problem with them, that they are non-capturing, i.e. I cannot extract non-matching. Maybe there is a reason for this, I don't know.

        Whatcha talking about?
        NOT /XYZ/ vvvvvvvvvvvvv >perl -e "$_ = 'ABCDEFGHIJKL'; print(/ABC((?:(?!XYZ).)*)JKL/);" DEFGHI