Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Substitute some Unicodes with their escapes

by jjmoka (Beadle)
on Jun 09, 2020 at 11:31 UTC ( [id://11117864]=note: print w/replies, xml ) Need Help??


in reply to Re: Substitute some Unicodes with their escapes
in thread Substitute some Unicodes with their escapes

Yes $_table{'Ö'} is an example. There is no use utf8;. No identifiers neither Unicode strings are builtin into the module. The hash is sourced via this function:
my %_table = (); sub load_map { open( UTF8, "<:encoding(utf8)", 'tab.bin' ) || die "can't open t +ab.bin : $!"; while( <UTF8> ) { chomp; my $offset = index($_,' '); my $bin = substr($_,$offset+1); my $esc = substr($_,0,$offset); $_table{$bin} = $esc; } close( UTF8 ); }
A Dump of a (short) received string is this:
SV = PVMG(0x12fca50) at 0x134f698 REFCNT = 4 FLAGS = (PADMY,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x1361200 " <entry>R\303\226CHLING EN +GINEERING PLASTICS (UK) LIMITED</entry>"\0 [UTF8 " + <entry>R\x{d6}CHLING ENGINEERING PLASTICS (UK) LIMITED</entry +>"] CUR = 85 LEN = 88
Going to think about that SSCCE. Thanks

UPDATE: here the SSCCE:

use strict; use utf8; # used only for this SSCCE to set scalar $SGML at line 4 +4 use Devel::Peek qw (Dump); use Encode qw(encode_utf8); binmode(STDOUT, ":utf8"); my %_table = (); # ----------------------------------- sub load_map { while( <DATA> ) { chomp; my $offset = index($_,' '); my $bin = substr($_,$offset+1); my $esc = substr($_,0,$offset); $_table{$bin} = $esc; } } # ----------------------------------- sub _mapchar { my ($char) = @_; if ( $char !~ /[\r\n\s]/) { my $nbytes = length encode_utf8($char); if ($nbytes > 1) { $char = exists $_table{$char} ? $_table{$char} : '?'; } } return $char; } # ----------------------------------- sub escapeUTF8 { my ( $sgml_r) = @_; Dump $$sgml_r; $$sgml_r =~ s/(.)/_mapchar($1)/eg; } load_map(); my $SGML='RÖCHLING'; print "1: $SGML\n"; escapeUTF8(\$SGML); print "2: $SGML\n"; __DATA__ &dollar; $ &Ouml; Ö &raquo; » ~

it works, but still the regex is on every char

Replies are listed 'Best First'.
Re^3: Substitute some Unicodes with their escapes
by choroba (Cardinal) on Jun 09, 2020 at 13:53 UTC
    As noted before, you don't need to encode each character. Also, you can build a regex that matches all the keys of the hash, than you don't need to call any subroutine from /e. This way, you only replace the characters you know how to replace, so it will work even for the $ which is ASCII 36.
    my $chars = join "", keys %_table; sub escapeUTF8 # Now needs a better name! { my ( $sgml_r) = @_; $$sgml_r =~ s/([\Q$chars\E])/$_table{$1}/g; }

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      That's a good spot. Thanks
Re^3: Substitute some Unicodes with their escapes
by haukex (Archbishop) on Jun 09, 2020 at 23:49 UTC
    UPDATE: here the SSCCE

    Thank you very much, that's very helpful!

    use utf8; # used only for this SSCCE to set scalar $SGML at line 44

    Just to nitpick this comment: the pragma is also necessary so that the DATA section is read as UTF-8 as well.

    it works, but still the regex is on every char

    Yes, that's true. There are a couple of different approaches on how to solve this - you could use the modules that Corion suggested (but that would replace the entire functionality of the code you inherited; you'd have to be sure that there isn't any tricky legacy behavior that you need to preserve), you could build a regex dynamically to match only those characters that have an entry in the hash (but in the root node you said "A builtin ? is returned for the Unicodes missing in that hash."), or my approach to answering this question so far has been to preserve as much of the original behavior as makes sense while still modernizing a bit.

    To that end, the regex that I suggested seems to work fine on this small bit of sample data. Also, note that in this case, the whole if length encode_utf8($char) > 1 logic isn't needed, because in UTF-8, the bytes 0x00-0x7F map 1:1 to ASCII and are always single bytes, while any characters >= 0x80 are guaranteed to be multibyte.

    if ( $char !~ /[\r\n\s]/ )

    Note you have to be careful with this one: under Unicode matching rules, \s will match Unicode whitespace characters as well, so for example if you were to have a table entry &nbsp;  , because of this regex it wouldn't be applied! You probably want the /a modifier, and the regex could be simplified to just \s. However, because [^\x00-\x7F] only matches on non-ASCII characters anyway, the $char !~ /\s/a test will always be true anyway, and so it can be omitted as well. In fact, in the below code I've inlined the entire sub _mapchar.

    By the way, in the root node you said you're using the bytes pragma, note that its documentation says "Use of this module for anything other than debugging purposes is strongly discouraged."

      Wow! That is a great cup of Perl. Pure beauty. Really, thank you very much. I've learnt so much today from this place and ALL the people who helped me. I'm never sure to post questions all around. I try to stay on the browse/read/study and try myself. But, man, I'm amazed of how much I learnt from this open shared of knowledge! Nice one, guys. Thanks again.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11117864]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-19 05:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found