Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Question for regex experts

by evgen-i (Novice)
on Jun 27, 2013 at 11:51 UTC ( [id://1040996]=perlquestion: print w/replies, xml ) Need Help??

evgen-i has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I would like to correct sequences of "wrong" html entities which look like this:
>lt;amp;
So basically, it is a sequence of html escape codes, of which only the first one starts with an &. I need to insert an & before each of the following (without spaces) codes. How can I do that? It looks like I need some left context for this (i.e. the condition: the whole word starts with an &), but I don't know how to specify it. So far I can insert the & sign only on the first wrong code (gt). Thank you! Eugene.

Replies are listed 'Best First'.
Re: Question for regex experts
by AnomalousMonk (Archbishop) on Jun 27, 2013 at 12:50 UTC

    One possible approach:

    >perl -wMstrict -le "my $bad = 'x &amp;gt;lt;amp; y &gt;amp;lt; z &amp;&gt;&lt; y lt;gt; z'; print qq{'$bad'}; ;; my $tity = qr{ (?: amp | gt | lt) ; }xms; ;; (my $fixed = $bad) =~ s{ (?: (?: \G (?<! \A)) | &) $tity \K (?= $tity) } '&'xmsg; print qq{'$fixed'}; " 'x &amp;gt;lt;amp; y &gt;amp;lt; z &amp;&gt;&lt; y lt;gt; z' 'x &amp;&gt;&lt;&amp; y &gt;&amp;&lt; z &amp;&gt;&lt; y lt;gt; z'
Re: Question for regex experts
by ww (Archbishop) on Jun 27, 2013 at 13:15 UTC
    Yet another approach --general, but neither brief nor elegant in exchange for providing (IMO) a clear set of steps with some explanation of the key regexen.
    #!/usr/bin/perl use 5.016; use strict; use warnings; # 1040996 my @strs = ("foo bar &amp;gt;lt;amp; blivitz", "FOO BAR BLIVITZ, &gt;apos; &quot;sect; ", "no entities here", "But there are &cent;pound; entities for 'cent' and 'pound +' here.", ); for my $str(@strs) { if ( $str =~ /(&[^ ]+)/ ) { # match any ampersand followed by on +e or # more NON-spaces (aka \S; see below +) my $found = $1; say "DEBUG: found semicolon(s) at |$found| in \"$str\""; if ($str =~ / [^&;]*? # anything that's neither "&" nor ";" (&.+) # followed by an ampersand and multiple ch +ars (?!\S) # until prev capture is followed by someth +ing # NOT-a-space ("negative lookahead") /gx ) { # globally, extended notation, end conditi +ons, begin actions my $substr = $1; say "\$substr: $substr\n"; (my $fixed = $substr ) =~ s/(;)([a-z])/$1&$2/g; say "\$fixed: $fixed \n"; } } else { say "\n\t No html entities found.\n"; } } =head C:\>1040996.pl DEBUG: found semicolon(s) at |&amp;gt;lt;amp;| in "foo bar &amp;gt;lt; +amp; blivitz" $substr: &amp;gt;lt;amp; blivitz $fixed: &amp;&gt;&lt;&amp; blivitz DEBUG: found semicolon(s) at |&gt;apos;| in "FOO BAR BLIVITZ, &gt;apos +; &quot;sect; " $substr: &gt;apos; &quot;sect; $fixed: &gt;&apos; &quot;&sect; No html entities found. DEBUG: found semicolon(s) at |&cent;pound;| in "But there are &cent;po +und; entities for 'cent' and 'pound' here." $substr: &cent;pound; entities for 'cent' and 'pound' here. $fixed: &cent;&pound; entities for 'cent' and 'pound' here. =cut

    BUT... this is imperfect (note the captured, trailing "blivitz" in the first example, and the loss of non-entity material at the beginning of each array element. That's easily enough fixed with a little more code.

    ANOTHER BUT: This breaks on some edge cases... such as a line where the bad entities immediately precede the EOL.

    Update: added word "general" in first graf.


    If you didn't program your executable by toggling in binary, it wasn't really programming!

Re: Question for regex experts
by Sandy (Curate) on Jun 27, 2013 at 12:18 UTC
    perl -e '$x=qw(&apm;gt;lt;amp;);$x=~s/;(.)/;&$1/g;print $x,"\n";'
      Thanks, but this is just a solution for this particular example; I need a general solution for any codes.

        You need to implement a lookup table that contains a list of all possible, or otherwise a subset, of HTML character codes.

        Therefore, each time your program encounters

        code;
        instead of
        &code;
        , it will need to add an ampersand at the start.

Re: Question for regex experts
by hdb (Monsignor) on Jun 27, 2013 at 12:53 UTC

    I am not sure my proposal will not create havoc elsewhere in your html (ie how robust it is) but have a look:

    use strict; use warnings; my $html = "&amp;gt;lt;amp; something else &lt;gt;amp;something else ; +"; $html =~ s|&(([^ ;]+;)+)| join '', map { "&$_;" } split /;/, $1 |ge; print "$html\n";

    UPDATE: Probably \w is more robust than [^ ;].

Re: Question for regex experts
by Eily (Monsignor) on Jun 27, 2013 at 12:46 UTC

    You can replace the entities even if they are already right with s{ \&? \b (amp|lt|gt) ; }{&$1;}xg;. And you can generate the OR list (in parenthesis) with a map. Edit : Here the &? catches the ampersand if it's there so that the substitution does not add another one. The \b makes sure that you have something that looks like a failed HTML entity. You wouldn't want "Tom bought a lamp; he loves lamps" to become "Tom bought a l& he loves lamps" for example.

Re: Question for regex experts
by evgen-i (Novice) on Jun 27, 2013 at 13:07 UTC
    How about this (a suggestion of my colleague)?
    $_ = "test;a &amp;lt;&gt;amp;kap;&da;ma; &amp alfa&romeo & mich;"; while ( s/([&][^ ]*);(?!( |&|$))/\1AMPSEMICOLON/ ) {}; s/AMPSEMICOLON/;\&/g; # Output: test;a &amp;&lt;&gt;&amp;&kap;&da;&ma; &amp alfa&romeo & mich;
    As you see, this is not restricted to a particular list of html codes, and that is what I want!
      ... not restricted to a particular list of html codes ...

      I'm not sure that's really such a good idea (it should be easy to get a list of all HTML entities you could possibly be interested in), but here's a generalization:

      >perl -wMstrict -le "my $bad = 'x &foo;xx;yz;qwe; y &de;fghj;h; z &amp;&gt;&lt; y lt;gt; z'; print qq{'$bad'}; ;; my $tity = qr{ [[:alpha:]]+ ; }xms; ;; (my $fixed = $bad) =~ s{ (?: (?: \G (?<! \A)) | &) $tity \K (?= $tity) } '&'xmsg; print qq{'$fixed'}; " 'x &foo;xx;yz;qwe; y &de;fghj;h; z &amp;&gt;&lt; y lt;gt; z' 'x &foo;&xx;&yz;&qwe; y &de;&fghj;&h; z &amp;&gt;&lt; y lt;gt; z'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1040996]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-18 13:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found