comment on

That won't just exclude HTML entities from being matched, it will exclude any & character that is in the same line as a semicolon somewhere to the right of it, because .+? also matches whitespace.

Instead, you should match for HTML/XML entities specifically. There are three forms that they can take, and the corresponding regexes for matching them would be:

/&#[0-9]+;/ - character referenced by decimal number

/&#x[0-9a-f]+;/i - character referenced by hexadecimal number

/&[a-z]+;/i - character referenced by name

Putting it together, you get this regex for matching an HTML entity:
/&(?:#(?:[0-9]+|x[0-9a-f]+)|[a-z]+);/i

Although that's kinda messy and pedantic, and you can probably get away with using this simplified version:
/&#?[0-9a-z]+;/i
(Unlike the more pedantic version, it would match some false positives such as &#amp; or &1a2b3c;, but what are the chances such constructs will appear in the input document?)

To do what the OP requested, wrap everything after the & in a negative look-ahead bracket like choroba suggested:

#                  10        20        30        40        50
#          ---------'---------'---------'---------'---------'----
my $str = "&#38; ... &#x26; ... &amp; ... &no_entity; ... & ... ;";

while ($str =~ /&(?!#?[0-9a-z]+;)/gi) {
  print "Found ampersand at position ".pos($str)."\n";
}
[download]

Output:

Found ampersand at position 32
Found ampersand at position 48
[download]

(i.e. it only matches the last two & characters in $str)

In reply to Re^3: Perl & regex help by smls
in thread Perl & regex help by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


"be consistent"
	PerlMonks