Yet another approach --general, but neither brief nor elegant in exchange for providing (IMO) a clear set of steps with some explanation of the key regexen.
#!/usr/bin/perl
use 5.016;
use strict;
use warnings;
# 1040996
my @strs = ("foo bar >lt;amp; blivitz",
"FOO BAR BLIVITZ, >apos; "sect; ",
"no entities here",
"But there are ¢pound; entities for 'cent' and 'pound
+' here.",
);
for my $str(@strs) {
if ( $str =~ /(&[^ ]+)/ ) { # match any ampersand followed by on
+e or
# more NON-spaces (aka \S; see below
+)
my $found = $1;
say "DEBUG: found semicolon(s) at |$found| in \"$str\"";
if ($str =~ /
[^&;]*? # anything that's neither "&" nor ";"
(&.+) # followed by an ampersand and multiple ch
+ars
(?!\S) # until prev capture is followed by someth
+ing
# NOT-a-space ("negative lookahead")
/gx ) { # globally, extended notation, end conditi
+ons, begin actions
my $substr = $1;
say "\$substr: $substr\n";
(my $fixed = $substr ) =~ s/(;)([a-z])/$1&$2/g;
say "\$fixed: $fixed \n";
}
} else {
say "\n\t No html entities found.\n";
}
}
=head
C:\>1040996.pl
DEBUG: found semicolon(s) at |>lt;amp;| in "foo bar >lt;
+amp; blivitz"
$substr: >lt;amp; blivitz
$fixed: &><& blivitz
DEBUG: found semicolon(s) at |>apos;| in "FOO BAR BLIVITZ, >apos
+; "sect; "
$substr: >apos; "sect;
$fixed: >' "§
No html entities found.
DEBUG: found semicolon(s) at |¢pound;| in "But there are ¢po
+und; entities for 'cent' and 'pound' here."
$substr: ¢pound; entities for 'cent' and 'pound' here.
$fixed: ¢£ entities for 'cent' and 'pound' here.
=cut
BUT... this is imperfect (note the captured, trailing "blivitz" in the first example, and the loss of non-entity material at the beginning of each array element. That's easily enough fixed with a little more code.
ANOTHER BUT: This breaks on some edge cases... such as a line where the bad entities immediately precede the EOL.
Update: added word "general" in first graf.
If you didn't program your executable by toggling in binary, it wasn't really programming!
|