Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Malformed UTF-8 character

by BillKSmith (Monsignor)
on Nov 29, 2022 at 20:53 UTC ( #11148448=perlquestion: print w/replies, xml ) Need Help??

BillKSmith has asked for the wisdom of the Perl Monks concerning the following question:

I know that this question is slightly off-topic, but still seems relevant. I am unable to download and run 1nickt's solution Re: Regex: matching any Number then a hyphen to a recent question because of a single non-ascii character. While displaying the node in the browser 'Chrome' on Windows 7, I click on the 'download' button. The file displays correctly. I right-click and select 'save as' then save as I.pl. When I run the file, I get the following error:
C:\Users\Bill\forums\monks>perl I.pl 1..3 Malformed UTF-8 character: \x96 (unexpected continuation byte 0x96, wi +th no prec eding start byte) at I.pl line 9. Malformed UTF-8 character (fatal) at I.pl line 9.
Using Internet explorer is slightly different, but no better. I have also tried cut-and-paste into the editor 'gvim'. It does not even display correctly. No luck saving it to a file. What is the recommended way to download and edit files containing UTF-8 characters from perlmonks into windows?

Sorry if I have overlooked the tutorial that I need.

Bill

Replies are listed 'Best First'.
Re: Malformed UTF-8 character
by pryrt (Abbot) on Nov 29, 2022 at 21:25 UTC
    When I go to ?abspart=1;part=1;displaytype=displaycode;node_id=11148406 and SaveAs, and open it in Notepad++, Notepad++ sees the encoding as "ANSI" (which on my system is "Windows-1252"); when I run that, it gives me "Malformed UTF-8 character" error, because it's the single byte 0x96 but the use utf8 line has told the interpreter that the file should be interpreted as UTF-8... and UTF-8 doesn't have a single-byte 0x96. If I copy the contents manually from the browser, and instead paste into a new file in Notepad++ (which defaults to UTF-8 for me) and save it and run, it runs just fine. Alternatively, if I comment out use utf8 on the downloaded version, it also works.

    The problem is that the perlmonks website serves the pages as Content-Type: text/plain; charset=ISO-8859-1 (even though, technically, is at codepoint 0x96 in Windows-1252, but not in ISO-8859-1, where 0x96 is a control character), so any bytes that get saved use that encoding; but saying use utf8 tells perl to interpret bytes in the source code as UTF-8 -- so it tries to interpret the ISO-8850-1 or Windows-1252 bytes as UTF-8, and fails on codepoints above 127.

      You have put me on the right track. I have found gvim commands to tell it that input is in CP1252 and output should be in utf-8. This converts the 96 to e28093 (u-2013 EN DASH). The resulting file runs in perl and pastes pack into perlmonks correctly. The character still does not display correctly in gvim or the windows command prompt. The best solution probably it to download notepad++, but it seems like overkill to learn another editor to solve such a rare problem.
      Bill
        The best solution probably it to download notepad++, but it seems like overkill to learn another editor to solve such a rare problem.

        As much as it pains me to say it (given my Notepad++ fandom), it does seem like overkill. But iconv.exe comes with my Strawberry perl... and if it does with yours, then it can handle the translation. (Or gnuwin32's iconv). I believe one of the following two would properly translate the CP1252 encoding of the emdash into UTF-8.

        iconv -f ISO-8859-1 -t utf-8 savedfile > outfile.pl iconv -f CP1252 -t utf-8 savedfile > outfile.pl

        (Of course, the other fix is to not use utf8; after you download the script; perl will default to your native Windows encoding {if I understand things correctly}, so that should work -- at least, it did for me from that same downloaded source code.)

        if you use utf8 only for strings and not for variable names, you could convert your special characters to \N{...} notation, such als "\N{EN DASH}"

        Of course, it's your decision whether "Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk" is more readable than something like "Bj�rk" or not ;-)

        N.B.: For the \N escape to work in Perl older than 5.16, you need an explicit use charnames;
Re: Malformed UTF-8 character
by jo37 (Chaplain) on Nov 29, 2022 at 21:18 UTC

    At least I'm able to reproduce this error. Never got hit by this problem as I usually select & copy code and paste it into a new file.

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re: Malformed UTF-8 character
by ikegami (Patriarch) on Nov 30, 2022 at 14:08 UTC

    That indicates a scalar which become corrupted when Perl or XS code improperly decoded a string.

    For example, use utf8; doesn't validate if the source code is actually valid UTF-8, and produces corrupt scalars if it's not.

    $ not_utf8="$( printf "\x96" )" $ perl -e"use utf8; q{$not_utf8}" Malformed UTF-8 character: \x96 (unexpected continuation byte 0x96, wi +th no preceding start byte) at -e line 1. Malformed UTF-8 character (fatal) at -e line 1.

    (Fortunately, use utf8; catches the problem and bails.)

    Are you using use utf8; with a source file that isn't encoded using UTF-8?

    The likely culprit is a U+2013 EN DASH ("") encoded using cp1252.


    Using the :utf8 encoding layer can also produce corrupt scalars.

    $ printf "\x96" | perl -nle' use open ":std", ":utf8"; printf "%vX\n", $_; ' Malformed UTF-8 character: \x96 (unexpected continuation byte 0x96, wi +th no preceding start byte) in printf at -e line 1, <> line 1. 0

    That's why :encoding(UTF-8) should be used instead.

      I wish that I had recognized that your "likely suspect" was the key to the whole mystery.
      Bill
Re: Malformed UTF-8 character
by BillKSmith (Monsignor) on Dec 05, 2022 at 15:32 UTC
    Thanks for all the great suggestions. I should point out that the procedure in my original post is not my usual way of downloading from perlmonks. Rather, in the spirit of an SSCCE, it was a way I felt that any monk could duplicate the problem. I now know that the key to all of the methods is correctly specifying the encoding (cp1252) of the perlmonks download. The problem is so rare, that I do not need a general solution. I can use any of your suggestions when the need arrises.

    I do understand that some of your suggestions were 'workarounds' to make this particular program work with a different character. This was my first choice until I had a better understanding of the real problem.

    Bill

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148448]
Approved by 1nickt
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2023-06-05 11:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (24 votes). Check out past polls.

    Notices?