Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: \b in Unicode regex

by shmem (Chancellor)
on May 22, 2017 at 06:52 UTC ( [id://1190837]=note: print w/replies, xml ) Need Help??


in reply to \b in Unicode regex

Why don't you write down what $string and $_ contain, so we don't have to guess? See I know what I mean. Why don't you?

$s = "hüh-hott"; $t = "hüh"; print "matches: '$&'\n" if $s =~ /$t/; print "matches too: '$&'\n" if $s =~ /\b$t\b/; __END__ matches: 'hüh' matches too: 'hüh'
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

Replies are listed 'Best First'.
Re^2: \b in Unicode regex
by Arik123 (Beadle) on May 22, 2017 at 07:30 UTC

    The actual strings are quite a mess. I just wanted to know whether there's some issue with \b in Unicode. If you insist, then $string is something like

    8^1589-20170113-102647-ויחי-דב&#15 +12;י_הספד_על_הר +ב_משה_שפירא.mp3 +^עברית^הרב מ&#1 +504;שה גולד^ויח +י-דברי הספד &#1 +506;ל הרב משה ש&#1508 +;ירא, טו' טבת, &#1514 +;שע'ז^שיעורי&#1 +501; בתנ"ך ובפר&#1513 +;ת השבוע|שיע&#1 +493;רים בפרשת ה +שבוע|שיעור&#149 +7;ם קודמים|בר&# +1488;שית|ויחי

    and $_ is just

    שפירא

    (it's hebrew, and I'm afraid your broweser might mess up the right-to-left presentation, or even just show the Unicode numbers instead of the characters themselves. My browser makes a mess here. That's why I didn't think posting the strings would help).

      \b works for me, even with Hebrew:
      #! /usr/bin/perl
      use warnings;
      use strict;
      use utf8;
      my $string = 'שָׁלוֹם';
      print $string =~ /\bש/, "\n";
      

      (I had to use <pre> instead of <code> to make UTF-8 work.)

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Given your strings, they match fine with or without \b:

      #!/usr/bin/perl -CS use HTML::Entities; my $string = decode_entities <DATA>; $_ = decode_entities "&#1513;&#1508;&#1497;&#1512;&#1488;"; print "matches: '$&'\n" if $string =~ /$_/; print "matches too: '$&'\n" if $string =~ /\b$_\b/; __DATA__ 8^1589-20170113-102647-&#1493;&#1497;&#1495;&#1497;-&#1491;&#1489;&#15 +12;&#1497;_&#1492;&#1505;&#1508;&#1491;_&#1506;&#1500;_&#1492;&#1512; +&#1489;_&#1502;&#1513;&#1492;_&#1513;&#1508;&#1497;&#1512;&#1488;.mp3 +^&#1506;&#1489;&#1512;&#1497;&#1514;^&#1492;&#1512;&#1489; &#1502;&#1 +504;&#1513;&#1492; &#1490;&#1493;&#1500;&#1491;^&#1493;&#1497;&#1495; +&#1497;-&#1491;&#1489;&#1512;&#1497; &#1492;&#1505;&#1508;&#1491; &#1 +506;&#1500; &#1492;&#1512;&#1489; &#1502;&#1513;&#1492; &#1513;&#1508 +;&#1497;&#1512;&#1488;, &#1496;&#1493;' &#1496;&#1489;&#1514;, &#1514 +;&#1513;&#1506;'&#1494;^&#1513;&#1497;&#1506;&#1493;&#1512;&#1497;&#1 +501; &#1489;&#1514;&#1504;"&#1498; &#1493;&#1489;&#1508;&#1512;&#1513 +;&#1514; &#1492;&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1 +493;&#1512;&#1497;&#1501; &#1489;&#1508;&#1512;&#1513;&#1514; &#1492; +&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1493;&#1512;&#149 +7;&#1501; &#1511;&#1493;&#1491;&#1502;&#1497;&#1501;|&#1489;&#1512;&# +1488;&#1513;&#1497;&#1514;|&#1493;&#1497;&#1495;&#1497; __END__

      Output:

      matches: 'שפירא'
      matches too: 'שפירא'
      

      So, no issue with \b and unicode regex here.

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      Thanks a lot, Monks.

      Knowing that there's no issue wuth \b, I kept investigating. Turned out that one of the strings wasn't really utf8 (for some reason, my terminal insisted on printing it as utf8, though). utf8::decode solved the problem.

        You actually had the opposite problem: You had UTF-8, but the regex engine expects a string of Unicode Code Points[1]. utf8::decode provides the latter from the former.


        1. More specifically, it's \w, \b, \d, etc that are defined in terms of UCP.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1190837]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2025-06-23 06:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.