Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Regex with HTML::Entities

by Horst.Lohnstein (Initiate)
on Nov 23, 2021 at 06:38 UTC ( #11139041=perlquestion: print w/replies, xml ) Need Help??

Horst.Lohnstein has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a question concerning Regex with HTML::Entities. I try to replace patterns with a keyword with some HTML-Code: The keword is encapsulated between ✶ (a little star). The keyword itself is of the form: Adjektive (Nominalflexion~84. The whole pattern looks like: {✶Adjektive (Nominalflexion)~87✶}

use HTML::Entities; # $text is some Text from a mysql database my $sep = decode_entities('✶'); my @v = ($text =~ /\{$sep(.*?)$sep\}/sg); # alle Verweise # @v contains all patterns (which means that they match)

The matched elements in @v have the form: {✶Adjektive (Nominalflexion)~87✶} Going through all elements in @v, I try to replace the matched elements in the following way:

$b = "Adjektive (Nominalflexion)~87"; $c = "\{$sep$b$sep\}"; $r = "<div>some Text $b some other text</div>"; $text =~ s/$c/$r/s;

I tried a lot of variants including quotes (?:...), and what not. Nothing worked! Is there anyone, who can stop my wasting of time with these questions? Thanks in advance! Best regards, Horst

Replies are listed 'Best First'.
Re: Regex with HTML::Entities
by Fletch (Chancellor) on Nov 23, 2021 at 07:33 UTC

    Not sure I'm completely following but it jumps out at me in your second block that the string in $b contains regular expression metacharacters (specifically parens) so that's prossibly the problem. Your "Adjektive (Nominalflexion)~87" is going to be treated as looking for the string "Adjektive" followed by a SPACE followed by the string "Nominalflexion" (which will be captured because of the parens) followed by "~87".

    If you use \Q\E escapes to setup as $c = "\Q{$sep$b$sep}\E" that should appropriately escape the metacharacters and let them match literally.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Hi Fletch, thank you for your advice! I checked it with \Q...\E and also with (?: ...) which omits the capturing of the expressions in the parens, but nothing appears to help. I was wondering whether the tilde ~ produced trouble, but when quoted that should be the case. Best, Horst

        Don't know what to tell you other than try providing a SSCCE that can actually be run. This below works as I expect it so you're doing something strange or (entirely possible) your problem statement's being misread.

        (Also the <code> formatting is doing weird things but I actually have literal ✶ in my source and the output where it's being replaced with the entity below everywhere save the initialization of $wonky_char. Not sure what's the right way to get literal UTF8 chars in sample code using utf8.)

        #!/usr/bin/env perl use 5.034; use HTML::Entities qw( decode_entities ); use utf8; my $input = qq{{&#10038;Adjektive (Nominalflexion)~87&#10038;}}; my $wonky_char = decode_entities( q{&#10038;} ); binmode( STDOUT, q{:utf8} ); say qq{\$input: $input}; say qq{\$wonky_char: $wonky_char}; my $to_match = "Adjektive (Nominalflexion)~87"; my $new_string = $input =~ s{\{$wonky_char(\Q$to_match\E)$wonky_char\}}{<div>I found +'$1'</div>}r; say qq{\$new_string: $new_string}; my $cleaner_regex_sample = $input =~ s{ \{ $wonky_char (\Q$to_match\E) $wonky_char \} }{<div>Al +so found '$1'</div>}rx; say qq{cleaner: $cleaner_regex_sample}; exit 0; __END__ $input: {&#10038;Adjektive (Nominalflexion)~87&#10038;} $wonky_char: &#10038; $new_string: <div>I found 'Adjektive (Nominalflexion)~87'</div> cleaner: <div>Also found 'Adjektive (Nominalflexion)~87'</div>

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

        Please show the output of

        printf "%vX\n", $text;

        I bet your text doesn't actually contain ✶. Did you decode your inputs? You probably have it in its encoded form.


        By the way,

        my $sep = decode_entities('&#10038;');

        is a complicated way of writing

        my $sep = "\N{U+2736}";

        or

        my $sep = "\x{2736}";

        or

        use utf8;
        my $sep = "✶";
        
Re: Regex with HTML::Entities
by LanX (Sage) on Nov 23, 2021 at 12:07 UTC
    This works for me:
    use v5.12; use warnings; use HTML::Entities; use Data::Dump qw/pp dd/; use utf8; my $sep = decode_entities('&#10038;'); my $pat = "{${sep}Adjektive (Nominalflexion)~87$sep}"; my $text = join " foo \n", ($pat) x 2; pp $text; my $b = "Adjektive (Nominalflexion)~87"; my $c = "\Q{$sep$b$sep}\E"; my $r = "<div>some Text $b some other text</div>"; $text =~ s/$c/$r/sg; pp $text;

    -*- mode: compilation; default-directory: "d:/tmp/pm/" -*- Compilation started at Tue Nov 23 13:06:24 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/nominal_flexion.pl "{\x{2736}Adjektive (Nominalflexion)~87\x{2736}} foo \n{\x{2736}Adjekt +ive (Nominalflexion)~87\x{2736}}" "<div>some Text Adjektive (Nominalflexion)~87 some other text</div> fo +o \n<div>some Text Adjektive (Nominalflexion)~87 some other text</div +>" Compilation finished at Tue Nov 23 13:06:24

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Regex with HTML::Entities
by Horst.Lohnstein (Initiate) on Nov 23, 2021 at 17:38 UTC
    Dear Monks, thanks a lot for your fast and more than helpful comments and proposals. Meanwhile I solved the problem, and -- as ikegami assumed -- the $text-variable was in utf-8 while the $sepb-variable was not. Encoding this variable together with the appropriate quoting suggested by the other monks solved the problem immediately. So, thank you very much for your useful advices!!! Best regards, Horst

      Encoding this variable together with the appropriate quoting suggested by the other monks solved the problem immediately

      You should have made sure both strings were *decoded*.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11139041]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2021-12-04 16:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (30 votes). Check out past polls.

    Notices?