Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

\b in Unicode regex

by Arik123 (Sexton)
on May 22, 2017 at 06:45 UTC ( #1190836=perlquestion: print w/replies, xml ) Need Help??

Arik123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I do

$string =~ /$_/

and it matches. I do

$string =~ /\b$_\b/

and it doesn't match, for the same values of $string and $_. I think it should match, since there's a hyphen or a dot after $_ in $string, which I think should match \b. Both $string and $_ are Unicode. Could it be that \b doesn't function for Unicode strings?

Replies are listed 'Best First'.
Re: \b in Unicode regex
by shmem (Chancellor) on May 22, 2017 at 06:52 UTC

    Why don't you write down what $string and $_ contain, so we don't have to guess? See I know what I mean. Why don't you?

    $s = "hüh-hott"; $t = "hüh"; print "matches: '$&'\n" if $s =~ /$t/; print "matches too: '$&'\n" if $s =~ /\b$t\b/; __END__ matches: 'hüh' matches too: 'hüh'
    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      The actual strings are quite a mess. I just wanted to know whether there's some issue with \b in Unicode. If you insist, then $string is something like

      8^1589-20170113-102647-ויחי-דב&#15 +12;י_הספד_על_הר +ב_משה_שפירא.mp3 +^עברית^הרב מ&#1 +504;שה גולד^ויח +י-דברי הספד &#1 +506;ל הרב משה ש&#1508 +;ירא, טו' טבת, &#1514 +;שע'ז^שיעורי&#1 +501; בתנ"ך ובפר&#1513 +;ת השבוע|שיע&#1 +493;רים בפרשת ה +שבוע|שיעור&#149 +7;ם קודמים|בר&# +1488;שית|ויחי

      and $_ is just

      שפירא

      (it's hebrew, and I'm afraid your broweser might mess up the right-to-left presentation, or even just show the Unicode numbers instead of the characters themselves. My browser makes a mess here. That's why I didn't think posting the strings would help).

        \b works for me, even with Hebrew:
        #! /usr/bin/perl
        use warnings;
        use strict;
        use utf8;
        my $string = 'שָׁלוֹם';
        print $string =~ /\bש/, "\n";
        

        (I had to use <pre> instead of <code> to make UTF-8 work.)

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        Given your strings, they match fine with or without \b:

        #!/usr/bin/perl -CS use HTML::Entities; my $string = decode_entities <DATA>; $_ = decode_entities "&#1513;&#1508;&#1497;&#1512;&#1488;"; print "matches: '$&'\n" if $string =~ /$_/; print "matches too: '$&'\n" if $string =~ /\b$_\b/; __DATA__ 8^1589-20170113-102647-&#1493;&#1497;&#1495;&#1497;-&#1491;&#1489;&#15 +12;&#1497;_&#1492;&#1505;&#1508;&#1491;_&#1506;&#1500;_&#1492;&#1512; +&#1489;_&#1502;&#1513;&#1492;_&#1513;&#1508;&#1497;&#1512;&#1488;.mp3 +^&#1506;&#1489;&#1512;&#1497;&#1514;^&#1492;&#1512;&#1489; &#1502;&#1 +504;&#1513;&#1492; &#1490;&#1493;&#1500;&#1491;^&#1493;&#1497;&#1495; +&#1497;-&#1491;&#1489;&#1512;&#1497; &#1492;&#1505;&#1508;&#1491; &#1 +506;&#1500; &#1492;&#1512;&#1489; &#1502;&#1513;&#1492; &#1513;&#1508 +;&#1497;&#1512;&#1488;, &#1496;&#1493;' &#1496;&#1489;&#1514;, &#1514 +;&#1513;&#1506;'&#1494;^&#1513;&#1497;&#1506;&#1493;&#1512;&#1497;&#1 +501; &#1489;&#1514;&#1504;"&#1498; &#1493;&#1489;&#1508;&#1512;&#1513 +;&#1514; &#1492;&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1 +493;&#1512;&#1497;&#1501; &#1489;&#1508;&#1512;&#1513;&#1514; &#1492; +&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1493;&#1512;&#149 +7;&#1501; &#1511;&#1493;&#1491;&#1502;&#1497;&#1501;|&#1489;&#1512;&# +1488;&#1513;&#1497;&#1514;|&#1493;&#1497;&#1495;&#1497; __END__

        Output:

        matches: 'שפירא'
        matches too: 'שפירא'
        

        So, no issue with \b and unicode regex here.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        Thanks a lot, Monks.

        Knowing that there's no issue wuth \b, I kept investigating. Turned out that one of the strings wasn't really utf8 (for some reason, my terminal insisted on printing it as utf8, though). utf8::decode solved the problem.

Re: \b in Unicode regex
by kennethk (Abbot) on May 22, 2017 at 15:00 UTC
    Do you mean
    $string =~ /$_/; $string =~ /\b$_\b/;
    or do you really mean
    $string =~ /\Q$_\E/; $string =~ /\b\Q$_\E\b/;
    As soon as your variable contains Metacharacters, they are not the same. See quotemeta, Quoting metacharacters.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: \b in Unicode regex
by kcott (Bishop) on May 23, 2017 at 06:56 UTC

    G'day Arik123,

    Two pieces of information, from perlrebackslash, to note.

    From the "Character classes" section:

    "\w s a character class that matches any single word character (letters, digits, Unicode marks, and connector punctuation (like the underscore))." [my emphasis]

    From the "Assertions" section:

    "\b ... matches at any place between a word (something matched by \w) and a non-word character" [my emphasis again]

    In your reply with actual data, you're effectively trying to match "XXXXX", which occurs in your string as "_XXXXX.". Both '_' and 'X' match "\w": "\b" does not match between '_' and 'X'.

    As already demonstrated twice[1,2], there is no Unicode issue here.

    — Ken

      The string I tried to match (that $_) is actually found twise in $string. In the first time it's indeed preceded by _, but in the second time it's between a space and a ,

      That you all for your time, again.

        I was certain that I checked that before posting my reply; however, I went back and doubled checked just now.

        &#1513;&#1508;&#1497;&#1512;&#1488;

        occurs only once, in the substring

        &#1492;_&#1513;&#1508;&#1497;&#1512;&#1488;.mp3

        We can only comment on the data you show us.

        — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1190836]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2020-11-30 23:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?