Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Perl & regex help

by Anonymous Monk
on Jan 30, 2013 at 22:02 UTC ( #1016162=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm stuck on trying to create a regex that will match & but not &

The problem is that I need to encode HTML entities in a string, but in some cases parts of the string will already have had their HTML entities encoded.

--TWH

Comment on Perl & regex help
Select or Download Code
Replies are listed 'Best First'.
Re: Perl & regex help
by choroba (Canon) on Jan 30, 2013 at 22:30 UTC
    To avoid &, use a negative look-ahead:
    /&(?!amp;)/
    i.e. the & not followed by amp;.
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Ah, yes. Perhaps a more general form for the html entities, e.g.,:

      /&(?!.+?;)/

      Update: I have no idea what I was 'thinking' here. Excellent and appreciated corrections to this below...

        That won't just exclude HTML entities from being matched, it will exclude any & character that is in the same line as a semicolon somewhere to the right of it, because .+? also matches whitespace.

        Instead, you should match for HTML/XML entities specifically. There are three forms that they can take, and the corresponding regexes for matching them would be:

          /&#[0-9]+;/ - character referenced by decimal number

          /&#x[0-9a-f]+;/i - character referenced by hexadecimal number

          /&[a-z]+;/i - character referenced by name

        Putting it together, you get this regex for matching an HTML entity:
          /&(?:#(?:[0-9]+|x[0-9a-f]+)|[a-z]+);/i

        Although that's kinda messy and pedantic, and you can probably get away with using this simplified version:
          /&#?[0-9a-z]+;/i
        (Unlike the more pedantic version, it would match some false positives such as &#amp; or &1a2b3c;, but what are the chances such constructs will appear in the input document?)

        To do what the OP requested, wrap everything after the & in a negative look-ahead bracket like choroba suggested:

        # 10 20 30 40 50 # ---------'---------'---------'---------'---------'---- my $str = "& ... & ... & ... &no_entity; ... & ... ;"; while ($str =~ /&(?!#?[0-9a-z]+;)/gi) { print "Found ampersand at position ".pos($str)."\n"; }

        Output:

        Found ampersand at position 32 Found ampersand at position 48

        (i.e. it only matches the last two & characters in $str)

        It fails on a string like this: "I saw a dog & a cat;".

        A safer solution, would be: /&(?!#?[a-zA-Z0-9]+;)/
Re: Perl & regex help
by Kenosis (Priest) on Jan 30, 2013 at 22:27 UTC

    Please enclose the items you want matched/non matched within <code> tags, so we can see what they are.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1016162]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2015-07-30 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (270 votes), past polls