Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Problem with utf8 after nearly 4096 bytes

by gvieira (Initiate)
on Sep 02, 2013 at 03:39 UTC ( #1051878=perlquestion: print w/ replies, xml ) Need Help??
gvieira has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, this is my first post here and it's blowing my mind for about a month.

I'm making an application with Perl/CGI who receives a file with a text, splits this text into words and compare them with a database. Those texts are in Portuguese and are part of a research in Linguistics, so the use of utf8 is essential.

I'm using the module utf8::all and he's wonderful... until nearly the byte 4096 of the text. I'm highlighting this cause I think that couldn't be a coincidence that the problem occurs exactly around 4kb. From this point to the end of the file, the perl becomes just incapable to recognize any accented letter and start to do some weird things like split the word on the accented letter like that: word (A)(B) became words (A) and (B).

That's the part of the code that I think may be interesting. The whole code have almost 400 lines.

use CGI; use utf8::all; my $c = new CGI; $c->import_names('P'); my $fh = $P::script; undef $/; my $text = <fh>; close $fh;

Any help would be great. This problem is disabling the whole research.

Thanks,

Gustavo Vieira.

Comment on Problem with utf8 after nearly 4096 bytes
Download Code
Re: Problem with utf8 after nearly 4096 bytes
by Anonymous Monk on Sep 02, 2013 at 03:42 UTC

    $c->import_names('P');

    stick with param()

    ... slurp ...

    Don't do that, read the file line by line

      Don't slurp

      Eh?

      I don't recall slurp turning off utf8. Why would it? It should just prevent breaking into lines, no?

Re: Problem with utf8 after nearly 4096 bytes
by gvieira (Initiate) on Sep 02, 2013 at 05:41 UTC

    I'm not sure if I understood your advice, but I changed this:

    undef $/; my $text = <$fh>;

    For this:

    while(<$fh>){ $text .= <$fh>; }

    But I still get the same result. Is that what you mean by read the file line by line? When I did this kind of read on "pure Perl" I used:

    open FL, "<"."file.txt"; binmode(FL, ":utf8");

    And it worked fine, but now I'm sending the file from a html page(<input type="file"...) so I don't know any other way to get the file handle besides that. Thanks for the help and sorry about the stupidity, I'm just a beginner on perl programming.

      Hi,

      it's just a guess from my side what "Anonymous Monk" wanted to avoid with his recommendation. When you read a file block wise it could happen that the last byte in your buffer is the first byte of a two or more byte representation of a character. An example, the German has the following UTF-8 representation: 0xc3 0x84

      xxxxxx 0xc3|0x84 xxxxxx -----------^ End of buffer

      This could lead to decoding errors. But when you read AND DECODE linewise you can be pretty sure that all bytes read until NL (or whatever your line ending is) can be decoded properly.

      Putting a decoding layer to your filehandle should also work with the handle you get from an upload, so

      binmode($fh, ":utf8");

      should be valid too.

      McA

      I'm not sure if I understood your advice,...

      I made two suggestions that would reduce memory usage, please try them both out, and if your problem is solved, then you know one of the two solved it ...

      Growing a string line/by/line does not decrease memory usage, that is still slurping the whole file into memory

      Another thing to check is your version of CGI.pm (get the latest for bug fixes), and check that no POST_MAX has been set

      Also, try a different file :)

      Your new version is still copying the entire file into mem. What you should try is changing your current approach:

      undef $/; my $bigfile = <$fh>; # Do some processing on $bigfile
      to something like this
      while (my $line = <$fh>) { # Do some processing on $line }

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
Re: Problem with utf8 after nearly 4096 bytes
by Anonymous Monk on Sep 02, 2013 at 17:15 UTC

    My guess is that you’ve told (Perl ...) that the text consists of utf-8, and that you are (or, it is) transferring the data in about-4096 byte chunks, and that a UTF-8 character sequence is being “cut in half” across that boundary.   Well, it seems to me that if you said that the data was UTF-8, (Perl) is going to presume that every chunk is that way, and that every chunk is complete (which it is not).   Perl therefore might try to “fix” the tail of one chunk, and/or it will be confused by the start of the next one.

    So, I suggest that you tell (Perl) that the data-stream is binary, so that (Perl) won’t attempt to do anything to what it sees, until such time as you know that you have received and have concatenated-together the entire stream of data, chunk by chunk.   This, and only this, ought to be a proper utf-8 stream ... which you can and should verify.

    I put (Perl) in quotes since several different software agents could make this sort of mistake.

Re: Problem with utf8 after nearly 4096 bytes
by gvieira (Initiate) on Sep 02, 2013 at 20:52 UTC

    Hi everyone,

    @McA

    I've already suposed if the problem was the borders of the buffer. But if were it, only the words in the border of the 4k buffer would be affected. Instead of that, every single accented letter is trunked, even if it was in the byte 4k + (4k/2), who is suposed to be in the middle of the buffer.

    The use of binmode($fh, ":utf8"); on the filehandle before put him on the array made the algorithm gives no result at all. I'll have to do some tests before have a better explanation of what happened.

    @Anonymous Monk

    I've tested both, as Random Walk explained and got the same result:

    my @list; while(my $l = <$fh>){ push(@list,split(/ /,$l)); }

    Also checked the version of CGI.pm. It was 3.52. I updated this on cpan using the command 'r CGI.pm' for version 3.63. Unfortunately, it doesn't solved the problem. I checked the POST_MAX too and it was equal to -1. I think that means unlimited, right?

    I have tried other files too, and they didn't work if they have more than 4k.

    @Random Walk

    Thanks for the explanation. Tried this and didn't work :/ The code I've tried is above.

    @Another Anonymous Monk

    I've thought this but I don't have any idea on how I could do that. I know a few ways of read files on Perl, but since I've been working with CGI I don't find any other way to do this reading. Any tip?

    Thank you all guys,

    Vieira.

Re: Problem with utf8 after nearly 4096 bytes
by gvieira (Initiate) on Sep 07, 2013 at 23:07 UTC

    Hi guys,

    I'm reviving this discussion because I'm so much closer of the solution.

    I finally discovered what is the real problem: Perl only recognizes the first 4096 bytes as utf8 because only the first block of text have the BOM. So he thinks that the first block is utf8 (and it goes ok) but recognizes all the other blocks as Unicode. So if I put a few en dashs ("") I can force the code to interpretate Unicode as utf8. Buuuut, I can't expect that the users to put these en dashs. The program will be used by some oldschool Linguistics researchers, with almost none computer knowledge.

    I tryed to force dashs by concatenating each line of the file with an en dash:

    while(my $l = <$fh>){ $text .= "".$l; }

    But I get this error:

    Wide character in print at (eval 12) line 94.

    Any idea of how could I force this (or another) character to make perl understand that the block is in utf8? If there's any other way to do it, would be awesome too.

    Thanks guys

    Vieira

      Perl only recognizes the first 4096 bytes as utf8 because only the first block of text have the BOM. So he thinks that the first block is utf8 (and it goes ok) but recognizes all the other blocks as Unicode.
      That cannot be. Perl does not use BOMs to determine encodings. If a file is opened with :encoding(utf-8) as utf::all does, the entire file is assumed to be in that encoding.

      Here's something to try: get rid of any BOMs completely. In your original code, add this after populating $text.

      $text =~ s/\x{FEFF}//g;
      See this old discussion: UTF-8 text files with Byte Order Mark.

      Also, make sure utf-8 encoding is correctly specified in your HTML header.

        You have a point. Even without BOM the program still recognize that the first block is in utf8. How could he don't think the same of the rest of the file?

        There any way I can print these en dashs on file? Maybe using her hex value?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1051878]
Approved by NetWallah
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2014-10-26 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (152 votes), past polls