Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^3: Perl & regex help

by smls (Friar)
on Jan 30, 2013 at 23:53 UTC ( #1016184=note: print w/replies, xml ) Need Help??


in reply to Re^2: Perl & regex help
in thread Perl & regex help

That won't just exclude HTML entities from being matched, it will exclude any & character that is in the same line as a semicolon somewhere to the right of it, because .+? also matches whitespace.

Instead, you should match for HTML/XML entities specifically. There are three forms that they can take, and the corresponding regexes for matching them would be:

  /&#[0-9]+;/ - character referenced by decimal number

  /&#x[0-9a-f]+;/i - character referenced by hexadecimal number

  /&[a-z]+;/i - character referenced by name

Putting it together, you get this regex for matching an HTML entity:
  /&(?:#(?:[0-9]+|x[0-9a-f]+)|[a-z]+);/i

Although that's kinda messy and pedantic, and you can probably get away with using this simplified version:
  /&#?[0-9a-z]+;/i
(Unlike the more pedantic version, it would match some false positives such as &#amp; or &1a2b3c;, but what are the chances such constructs will appear in the input document?)

To do what the OP requested, wrap everything after the & in a negative look-ahead bracket like choroba suggested:

# 10 20 30 40 50 # ---------'---------'---------'---------'---------'---- my $str = "& ... & ... & ... &no_entity; ... & ... ;"; while ($str =~ /&(?!#?[0-9a-z]+;)/gi) { print "Found ampersand at position ".pos($str)."\n"; }

Output:

Found ampersand at position 32 Found ampersand at position 48

(i.e. it only matches the last two & characters in $str)

Replies are listed 'Best First'.
Re^4: Perl & regex help
by Kenosis (Priest) on Jan 31, 2013 at 00:06 UTC

    I think your regex is the best, as it clearly aligns with the ISO specs. Nice work.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1016184]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2016-10-01 23:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many different varieties (color, size, etc) of socks do you have in your sock drawer?






    Results (9 votes). Check out past polls.