Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Matching/replacing a unicode character only works after decode()

by FloydATC (Chaplain)
on Jul 25, 2014 at 10:10 UTC ( #1095020=perlquestion: print w/ replies, xml ) Need Help??
FloydATC has asked for the wisdom of the Perl Monks concerning the following question:

After hours of struggling with a seemingly simple problem involving utf8 I finally made it work -- but I have no clue why. Allow me to explain.

I have a script which scrapes a particular web site for data about communication links. The data is utf-8 encoded and contains (among a great many other things) the unicode character 'GREEK SMALL LETTER MU' (U+03BC) which (after being scraped, put through several scripts, stored in MySQL, later extracted and presented on a web page) renders as "μ". My co-workers didn't really mind but after a couple of years it started to annoy me so much I reached the point where I just had to fix it. Today was that day.

I wanted to replace all occurences of this character with either unicode character 'MICRO SIGN' (U+00B5) which renders as expected, or (even better) simply with the HTML entity µ.

The method in question produces clickable links to present each commlink in many different contexts.

sub commlink { my $self = shift; return "" unless $self->{'id'}; my $label = $self->{'label'}; $label =~ s/\x{00b5}/\&micro;/g; $label =~ s/\x{03bc}/\&mu;/g; # Looks almost exactly the same as &mi +cro; return "<A href=\"commlink.html?id=".$self->{'id'}."\" class=\"".$se +lf->{'state'}."\">".$label."</A>"; }

I knew the data stored in MySQL was utf8, the string was untoched and the web page charset was specified as utf8. If I tried to change it, the norwegian characters on the same page would become garbled so I knew the encoding setting was sent and detected properly.

So... utf8 in, no encoding/decoding or string mangling prior to the regex... and still the regex didn't match.

The solution?

sub commlink { my $self = shift; return "" unless $self->{'id'}; my $label = $self->{'label'}; $label = decode('utf8', $label); # Why? It's already utf8 and I need + it to stay utf8 $label =~ s/\x{00b5}/\&micro;/g; $label =~ s/\x{03bc}/\&mu;/g; # Looks almost exactly the same as &mi +cro; return "<A href=\"commlink.html?id=".$self->{'id'}."\" class=\"".$se +lf->{'state'}."\">".$label."</A>"; }

My question is... Why?! Before decoding the utf8 string, how could the string go from input to output unchanged but fail to match the regex? Why do I need to decode the utf8 string to match an utf8 character when the string already prints as an utf8 character? This is so confusing...

UPDATE:

OK, thanks for the pointers. It sounds so very very simple in theory, but in practice... This system is made up from more than 50 different scripts and modules that shuffle data back and forth and present it via HTML, SVG, generates javascript, text messages, emails and what have you. After I started trying to fix things to "do it right" then absolutely everything broke. I'm going to need weeks to get on top of this.

This is exactly why I have always hated Unicode. Why, oh why could I not have left this stupid bug alone.

-- FloydATC

Time flies when you don't know what you're doing

Comment on Matching/replacing a unicode character only works after decode()
Select or Download Code
Re: Matching/replacing a unicode character only works after decode()
by Anonymous Monk on Jul 25, 2014 at 10:19 UTC
Re: Matching/replacing a unicode character only works after decode()
by hippo (Curate) on Jul 25, 2014 at 10:24 UTC

    The correct order of operations for working with encoded data (whether utf8 or any other encoding) is:

    1. Input
    2. Decode
    3. Operate
    4. Encode
    5. Output

    If you don't decode your input you'll be comparing apples and elephants which is why your regex fails to match. However, if you do no operations on the data at all, then you can skip the middle three steps because your perl script in that case is just essentially a pipe between your input (eg. database) and your output (eg. web page).

    This is all explained in much better detail in A UTF8 round trip with MySQL. HTH.

Re: Matching/replacing a unicode character only works after decode()
by Anonymous Monk on Jul 25, 2014 at 12:24 UTC
    Why?! Before decoding the utf8 string, how could the string go from input to output unchanged but fail to match the regex?
    Basically, that's because Perl by default thinks that a binary string is in Latin-1, rather then UTF-8. And that's a problem - every string in any encoding (UTF-8 or anything else) is valid Latin-1.

    Charater \xb5 is one byte in Latin-1, but two bytes in UTF-8. And \x3bc is just too big for a one byte encoding.

    Why do I need to decode the utf8 string to match an utf8 character
    If you have some string in UTF-8, and want to apply regexes to it, or get it's length in characters, etc... you always have to do that. Because backwards compatibility. Perl is old. Other languages (Python, Ruby) broke compatiblity to get better Unicode. Perl didn't.

      It would only a backwards compatibility issue if you accept that UTF-8 is the ONLY encoding used in computing. Itís not even a common default yet. Rubyís Unicode support was terrible and is only made passable by installing some specific gems and even then itís not as good as Perl. Hereís an overview from tchrist: http://dheeb.files.wordpress.com/2011/07/gbu.pdf.

      Christiansen also once published a Yes/No style table of all the languages and Perl was by far the best among Java/Python/Ruby/PHP. Iím sorry I could not find this table again to link.

        It would only a backwards compatibility issue if you accept that UTF-8 is the ONLY encoding used in computing.
        In the year 2014 UTF-8 is a more useful default than Latin-1, I'd say. BUT, the real problem is implicit upgrading from / downgrading to Latin-1. This is very similar to what Perl does with numbers / numeric-looking strings. The difference is not all strings look like numbers, but absolutely any binary string looks like Latin-1 (and some Unicode strings can be downgraded to Latin-1 without warnings).

        Consider this:

        perl -MDevel::Peek -wE 'my $r = qr/\x{03bc}/; Dump $r' ... FLAGS = (OBJECT,FAKE,UTF8) PV = 0x10eff20 "(?^u:\\x{03bc})" [UTF8 "(?^u:\\x{03bc})"]
        Now, what happens when UTF-8 regex meets a binary string? My guess is that the string gets upgraded to (Perl's internal) UTF-8... FROM (what Perl thinks is) Latin-1, like it happens in other situations. Which is a wrong thing to do.
        Rubyís Unicode support was terrible
        It's still terribad. But at least, Ruby default to UTF-8 in it's source, for example.

        There is a big difference between excellent Unicode support (which Perl has, of course) and convenient Unicode support. You know, something that is not a pain in the ass. For example: what can go wrong with

        open my $file, '<', '/bogus_file' or die "Can't open: $!\n";

        ?

      No, it's because the regex engine expects Unicode code points (the result of decode), not UTF-8 (what MySQL returned).

      It has nothing to do with backwards compatibility, or with being an old language. e.g. Java's regex library similarly expects chars, not bytes.

Re: Matching/replacing a unicode character only works after decode()
by graff (Chancellor) on Jul 26, 2014 at 00:20 UTC
    I pasted that two-character sequence from the OP into a fileÖ
    cat > /tmp/junk.txt őľ ^D
    With the line-feed added at the end, the file was 5 bytes -- just what I would expect for those two characters, given that they are encoded in utf8 (two bytes per character).

    If I use any given character-encoding conversion process on that file, and convert from utf8 to iso-8859-1 (or cp1252, which is roughly equivalent), I get the expected 3-byte result:

    0xCE 0xBC 0x0A # now 1-byte/char, the line-feed is unchanged
    If you look up those two byte values in a Latin-1 table, you'll see that they are the "Latin capital letter I with circumflex" and the "vulgar fraction one quarter"; but when that two-byte sequence is interpreted as a utf8 character, it turns out to be U+03BC, "Greek small letter mu".

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1095020]
Approved by AppleFritter
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2014-10-31 05:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (214 votes), past polls