Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number

by igoryonya (Pilgrim)
on Nov 18, 2014 at 13:55 UTC ( [id://1107573]=perlquestion: print w/replies, xml ) Need Help??

igoryonya has asked for the wisdom of the Perl Monks concerning the following question:

Also, I get:
utf8 "\xD1" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number.

I have it with some file names piped from the find program. It happened only with some file names recently, for the first time of the few years that I've been using and developing this program.

Seems like some of the file names are corrupt.

When I print out such file names with my program, I get something like:

18.09.2012_-Протокол_вскрытия_конвертов_и_рассмотрения_заявок_на_участие_в_конк\xD1


Ф\xD1%80\xD1%8Dнк \xD0%9F\xD1%8C\xD1%8E\xD1%81елик. \xD0%9D\xD0%9B\xD0%9F. \xD0%9C\xD0%95Т\xD0%90 \xD0%9Cодел\xD1%8C.webm

The same file names displayed on the terminal by find before piping to my program display:

18.09.2012_-Протокол_вскрытия_конвертов_и_рассмотрения_заявок_на_участие_в_конк?


Ф?%80?%8Dнк ?%9F?%8C?%8E?%81елик. ?%9D?%9B?%9F. ?%9C?%95Т?%90 ?%9Cодел?%8C.webm

As I said, it's the first time I encountered such a problem after a few years of dayly usage of this program.

here is a sample piping launch of the program from the linux terminal:
find /some/path -type f|comparebin.pl /some/path/ /path/to_folder/with_similar_dir_tree/ -parameters

Update

I've just noticed, that the file names get truncated after I tried: find /some/path -type f -exec /path/comparebin.pl {} /path/to_folder/with_similar_dir_tree/ -parameters \;
Path, being provided by {} is being truncated significantly, maybe this is the problem that happens with stdout|stdin.
Seems like, there is a very small limit on how many characters can be piped or passed by {} or, maybe, the files are being truncated because of an invalid characters.
I guess, I have to resort to the usage of perl's internal find command.
I don't see anything wrong with that command, I just wanted my program to be flexible, so it could be used either way: by using it's internal directory traversal or paths being piped from some other program.

Update 2

Thank you all, who participated in my problem solving. To be honest, since I've been trying to convert my programs to unicode, my understanding about this topic was pretty vague, althoug many things. After solving my problem got clarified, there is still a lot to understand about utf8 and unicode in general. When I look at amount of the perl's unicode documentation, it's pretty daunting when I realize that I need to therally read and digest all it. Until now, I thought that unicode is an answer to all textual problems and everything should be in utf8, until I stumbled on this particular problem. Now, I am realizing, that there are excepthions.

At first, I didn't even have a clue, where to start to solve my problem, after talking to you. I understood, what needs to be done, but didn't understand, how. That frustrated me, because, I felt like unicode should be behind the curtains and I didn't want to saturate the fun of programming, which I love, with the daunting unicode "bookkeeping". Also, I keep confusin gthe encode and decode commands. Then I calmed down, skimmed the unicode, utf8 and encode documentation for the needed parts and started trying.

When I set up a check on every variable, involved in path/file name processing for utf8-ness (utf8::is_utf8) and if it is utf8, set the utf8 flag off (Encode::_utf8_off), along the path of the code, the final paths started resolving for existence (-e). I realize, that if I encounter some part of the path, converted to utf8 and set the flag off, if that path portion was corrupt, before became utf8, the final resulted path could not resolve for existence (-e), but I don't know how to process certain strings without them being converted to character mode, like regex substitution, always returning a value with utf8 flag set, for example, so, for now, I will live it as it is and work on the fix and read more of utf8 and unicode docs when I encounter such problem.

  • Comment on utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number

Replies are listed 'Best First'.
Re: utf8 "\xD1" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number
by ikegami (Patriarch) on Nov 18, 2014 at 14:50 UTC

    Seems like some of the file names are corrupt.

    That's correct. D1 is valid as the start of a two bytes UTF-8 sequence, but you found it at the end of the file name. Looks like the file name got truncated.

Re: utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number
by graff (Chancellor) on Nov 19, 2014 at 03:13 UTC
    What do you know about the process(es) creating these file names? That's likely to be the source of the problem.

    Assuming you are using a utf8-based terminal window, the question mark that you see in the terminal at the end of the file name is a symptom of a malformed character in a utf8 string (such as a start byte like \xD0 or \xD1 that is not followed by a valid continuation byte).

    The file system doesn't really care about how (or whether) the byte sequence used for a file name is interpreted via this or that character encoding - there are some characters in the ASCII range that can't be used in a file name (e.g. null or slash on unix/linux), but apart from that, any byte sequence is as good as any other, whether or not it makes sense when using any given character encoding.

    You should be able to rename the affected files - perl is especially handy for doing this: either you can infer the intended character(s), or you can simply replace bad bytes with something valid that yields a unique file name in the given directory. In order to rename the file, you have to treat the existing (bad) file name as a raw byte sequence, not as utf8 characters.

    (You might consider going to ASCII-only characters for file names - e.g. using a suitable Cyrillic-to-Latin transliteration - to avoid the problems that tend to come up with multi-byte characters in file names.)

      (You might consider going to ASCII-only characters for file names - e.g. using a suitable Cyrillic-to-Latin transliteration - to avoid the problems that tend to come up with multi-byte characters in file names.)
      What a... peculiar thing to say. There are no problems with multi-byte chars in file names. There might be problems with things truncating file names, and transliteration is a really bad way to fix that.

      The folders/files were recovered by using a testdisk program in linux from the accidentally deleted ntfs partition.

      When I test the passed corrupt file name to the perl program with the -e, it says that the file doesn't exist, although, if I use an internal perl's directory reading, it shows those files fine without any character problems and if I test files, listed by perl for existence, -e proves their existence.

      So, if I understand correctly, when I represent the path string, piped from the find process to my program, with a byte steam, it should test correctly for existence by using -e.

      I've been trying to implement a routine that will recover from such corruption and find the file correctly when passed from stdin. I want to keep the ability to pipe the names from the external source.

        So, if I understand correctly, when I represent the path string, piped from the find process to my program, with a byte steam, it should test correctly for existence by using -e.

        And it does if you stop trying transforming the input from UTF-8 (which it isn't) to Unicode Code Points.

        I've been trying to implement a routine that will recover from such corruption

        Much easier to remove the erroneous conversion attempt that's corrupting it.

Re: utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number
by Anonymous Monk on Nov 18, 2014 at 14:57 UTC
    Seems like some of the file names are corrupt.
    You can check it with
    find /YOUR/PATH -type f -name '18.19.2012*' | perl -ne 'printf "%vx\n" +, $_'
    Russian characters are two-byte and all start with d0 or d1, IIRC. Space is 20, - is 2d, newline is a.

      Adding -l to perl's command line options will print the file name without the trailing .a.

      Use printf "%v02x\n" if you want to pad each number to two digits.

      find /YOUR/PATH -type f -name '18.19.2012*' \ | perl -nle'printf "%v02x\n", $_'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1107573]
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2024-04-18 06:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found