Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: \b in Unicode regex

by Arik123 (Sexton)
on May 22, 2017 at 07:30 UTC ( #1190838=note: print w/replies, xml ) Need Help??


in reply to Re: \b in Unicode regex
in thread \b in Unicode regex

The actual strings are quite a mess. I just wanted to know whether there's some issue with \b in Unicode. If you insist, then $string is something like

8^1589-20170113-102647-ויחי-דב&#15 +12;י_הספד_על_הר +ב_משה_שפירא.mp3 +^עברית^הרב מ&#1 +504;שה גולד^ויח +י-דברי הספד &#1 +506;ל הרב משה ש&#1508 +;ירא, טו' טבת, &#1514 +;שע'ז^שיעורי&#1 +501; בתנ"ך ובפר&#1513 +;ת השבוע|שיע&#1 +493;רים בפרשת ה +שבוע|שיעור&#149 +7;ם קודמים|בר&# +1488;שית|ויחי

and $_ is just

שפירא

(it's hebrew, and I'm afraid your broweser might mess up the right-to-left presentation, or even just show the Unicode numbers instead of the characters themselves. My browser makes a mess here. That's why I didn't think posting the strings would help).

Replies are listed 'Best First'.
Re^3: \b in Unicode regex
by choroba (Archbishop) on May 22, 2017 at 07:41 UTC
    \b works for me, even with Hebrew:
    #! /usr/bin/perl
    use warnings;
    use strict;
    use utf8;
    my $string = 'שָׁלוֹם';
    print $string =~ /\bש/, "\n";
    

    (I had to use <pre> instead of <code> to make UTF-8 work.)

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re^3: \b in Unicode regex
by shmem (Chancellor) on May 22, 2017 at 08:55 UTC

    Given your strings, they match fine with or without \b:

    #!/usr/bin/perl -CS use HTML::Entities; my $string = decode_entities <DATA>; $_ = decode_entities "&#1513;&#1508;&#1497;&#1512;&#1488;"; print "matches: '$&'\n" if $string =~ /$_/; print "matches too: '$&'\n" if $string =~ /\b$_\b/; __DATA__ 8^1589-20170113-102647-&#1493;&#1497;&#1495;&#1497;-&#1491;&#1489;&#15 +12;&#1497;_&#1492;&#1505;&#1508;&#1491;_&#1506;&#1500;_&#1492;&#1512; +&#1489;_&#1502;&#1513;&#1492;_&#1513;&#1508;&#1497;&#1512;&#1488;.mp3 +^&#1506;&#1489;&#1512;&#1497;&#1514;^&#1492;&#1512;&#1489; &#1502;&#1 +504;&#1513;&#1492; &#1490;&#1493;&#1500;&#1491;^&#1493;&#1497;&#1495; +&#1497;-&#1491;&#1489;&#1512;&#1497; &#1492;&#1505;&#1508;&#1491; &#1 +506;&#1500; &#1492;&#1512;&#1489; &#1502;&#1513;&#1492; &#1513;&#1508 +;&#1497;&#1512;&#1488;, &#1496;&#1493;' &#1496;&#1489;&#1514;, &#1514 +;&#1513;&#1506;'&#1494;^&#1513;&#1497;&#1506;&#1493;&#1512;&#1497;&#1 +501; &#1489;&#1514;&#1504;"&#1498; &#1493;&#1489;&#1508;&#1512;&#1513 +;&#1514; &#1492;&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1 +493;&#1512;&#1497;&#1501; &#1489;&#1508;&#1512;&#1513;&#1514; &#1492; +&#1513;&#1489;&#1493;&#1506;|&#1513;&#1497;&#1506;&#1493;&#1512;&#149 +7;&#1501; &#1511;&#1493;&#1491;&#1502;&#1497;&#1501;|&#1489;&#1512;&# +1488;&#1513;&#1497;&#1514;|&#1493;&#1497;&#1495;&#1497; __END__

    Output:

    matches: 'שפירא'
    matches too: 'שפירא'
    

    So, no issue with \b and unicode regex here.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re^3: \b in Unicode regex
by Anonymous Monk on May 23, 2017 at 09:23 UTC

    Thanks a lot, Monks.

    Knowing that there's no issue wuth \b, I kept investigating. Turned out that one of the strings wasn't really utf8 (for some reason, my terminal insisted on printing it as utf8, though). utf8::decode solved the problem.

      You actually had the opposite problem: You had UTF-8, but the regex engine expects a string of Unicode Code Points[1]. utf8::decode provides the latter from the former.


      1. More specifically, it's \w, \b, \d, etc that are defined in terms of UCP.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1190838]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2020-11-30 23:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?