Question for regex experts

evgen-i has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Question for regex experts by AnomalousMonk (Archbishop) on Jun 27, 2013 at 12:50 UTC
One possible approach: `>perl -wMstrict -le "my $bad = 'x &gt;lt;amp; y >amp;lt; z &>< y lt;gt; z'; print qq{'$bad'}; ;; my $tity = qr{ (?: amp \| gt \| lt) ; }xms; ;; (my $fixed = $bad) =~ s{ (?: (?: \G (?<! \A)) \| &) $tity \K (?= $tity) } '&'xmsg; print qq{'$fixed'}; " 'x &gt;lt;amp; y >amp;lt; z &>< y lt;gt; z' 'x &><& y >&< z &>< y lt;gt; z'` [download]	[reply] [d/l]
Re: Question for regex experts by ww (Archbishop) on Jun 27, 2013 at 13:15 UTC
Yet another approach --general, but neither brief nor elegant in exchange for providing (IMO) a clear set of steps with some explanation of the key regexen. #!/usr/bin/perl use 5.016; use strict; use warnings; # 1040996 my @strs = ("foo bar &gt;lt;amp; blivitz", "FOO BAR BLIVITZ, >apos; "sect; ", "no entities here", "But there are ¢pound; entities for 'cent' and 'pound +' here.", ); for my $str(@strs) { if ( $str =~ /(&[^ ]+)/ ) { # match any ampersand followed by on +e or # more NON-spaces (aka \S; see below +) my $found = $1; say "DEBUG: found semicolon(s) at \|$found\| in \"$str\""; if ($str =~ / [^&;]? # anything that's neither "&" nor ";" (&.+) # followed by an ampersand and multiple ch +ars (?!\S) # until prev capture is followed by someth +ing # NOT-a-space ("negative lookahead") /gx ) { # globally, extended notation, end conditi +ons, begin actions my $substr = $1; say "\$substr: $substr\n"; (my $fixed = $substr ) =~ s/(;)([a-z])/$1&$2/g; say "\$fixed: $fixed \n"; } } else { say "\n\t No html entities found.\n"; } } =head C:\>1040996.pl DEBUG: found semicolon(s) at \|&gt;lt;amp;\| in "foo bar &gt;lt; +amp; blivitz" $substr: &gt;lt;amp; blivitz $fixed: &><& blivitz DEBUG: found semicolon(s) at \|>apos;\| in "FOO BAR BLIVITZ, >apos +; "sect; " $substr: >apos; "sect; $fixed: >' "§ No html entities found. DEBUG: found semicolon(s) at \|¢pound;\| in "But there are ¢po +und; entities for 'cent' and 'pound' here." $substr: ¢pound; entities for 'cent' and 'pound' here. $fixed: ¢£ entities for 'cent' and 'pound' here. =cut [download] BUT... this is imperfect (note the captured, trailing "blivitz" in the first example, and the loss of non-entity material at the beginning of each array element. That's easily enough fixed with a little more code. ANOTHER BUT:* This breaks on some edge cases... such as a line where the bad entities immediately precede the EOL. Update: added word "general" in first graf. If you didn't program your executable by toggling in binary, it wasn't really programming!	[reply] [d/l]
Re: Question for regex experts by Sandy (Curate) on Jun 27, 2013 at 12:18 UTC
`perl -e '$x=qw(&apm;gt;lt;amp;);$x=~s/;(.)/;&$1/g;print $x,"\n";'` [download]	[reply] [d/l]
Re^2: Question for regex experts by evgen-i (Novice) on Jun 27, 2013 at 12:29 UTC
Thanks, but this is just a solution for this particular example; I need a general solution for any codes.	[reply]
Re^3: Question for regex experts by sumeetgrover (Monk) on Jun 27, 2013 at 12:37 UTC
You need to implement a lookup table that contains a list of all possible, or otherwise a subset, of HTML character codes. Therefore, each time your program encounters code; instead of &code; , it will need to add an ampersand at the start.	[reply]
Re: Question for regex experts by hdb (Monsignor) on Jun 27, 2013 at 12:53 UTC
I am not sure my proposal will not create havoc elsewhere in your html (ie how robust it is) but have a look: `use strict; use warnings; my $html = "&gt;lt;amp; something else <gt;amp;something else ; +"; $html =~ s\|&(([^ ;]+;)+)\| join '', map { "&$_;" } split /;/, $1 \|ge; print "$html\n";` [download] UPDATE: Probably `\w` is more robust than `[^ ;]`.	[reply] [d/l] [select]
Re: Question for regex experts by Eily (Monsignor) on Jun 27, 2013 at 12:46 UTC
You can replace the entities even if they are already right with `s{ \&? \b (amp\|lt\|gt) ; }{&$1;}xg;`. And you can generate the OR list (in parenthesis) with a map. Edit : Here the &? catches the ampersand if it's there so that the substitution does not add another one. The \b makes sure that you have something that looks like a failed HTML entity. You wouldn't want "Tom bought a lamp; he loves lamps" to become "Tom bought a l& he loves lamps" for example.	[reply] [d/l]
Re: Question for regex experts by evgen-i (Novice) on Jun 27, 2013 at 13:07 UTC
How about this (a suggestion of my colleague)? `$_ = "test;a &lt;>amp;kap;&da;ma; &amp alfa&romeo & mich;"; while ( s/([&][^ ]*);(?!( \|&\|$))/\1AMPSEMICOLON/ ) {}; s/AMPSEMICOLON/;\&/g; # Output: test;a &<>&&kap;&da;&ma; &amp alfa&romeo & mich;` [download] As you see, this is not restricted to a particular list of html codes, and that is what I want!	[reply] [d/l]
Re^2: Question for regex experts by AnomalousMonk (Archbishop) on Jun 27, 2013 at 13:41 UTC
... not restricted to a particular list of html codes ... I'm not sure that's really such a good idea (it should be easy to get a list of all HTML entities you could possibly be interested in), but here's a generalization: `>perl -wMstrict -le "my $bad = 'x &foo;xx;yz;qwe; y &de;fghj;h; z &>< y lt;gt; z'; print qq{'$bad'}; ;; my $tity = qr{ [[:alpha:]]+ ; }xms; ;; (my $fixed = $bad) =~ s{ (?: (?: \G (?<! \A)) \| &) $tity \K (?= $tity) } '&'xmsg; print qq{'$fixed'}; " 'x &foo;xx;yz;qwe; y &de;fghj;h; z &>< y lt;gt; z' 'x &foo;&xx;&yz;&qwe; y &de;&fghj;&h; z &>< y lt;gt; z'` [download]	[reply] [d/l]


There's more than one way to do things
	PerlMonks