Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Problem with utf8 after nearly 4096 bytes

by gvieira (Initiate)
on Sep 02, 2013 at 05:41 UTC ( #1051888=note: print w/ replies, xml ) Need Help??


in reply to Problem with utf8 after nearly 4096 bytes

I'm not sure if I understood your advice, but I changed this:

undef $/; my $text = <$fh>;

For this:

while(<$fh>){ $text .= <$fh>; }

But I still get the same result. Is that what you mean by read the file line by line? When I did this kind of read on "pure Perl" I used:

open FL, "<"."file.txt"; binmode(FL, ":utf8");

And it worked fine, but now I'm sending the file from a html page(<input type="file"...) so I don't know any other way to get the file handle besides that. Thanks for the help and sorry about the stupidity, I'm just a beginner on perl programming.


Comment on Re: Problem with utf8 after nearly 4096 bytes
Select or Download Code
Re^2: Problem with utf8 after nearly 4096 bytes
by McA (Curate) on Sep 02, 2013 at 07:34 UTC

    Hi,

    it's just a guess from my side what "Anonymous Monk" wanted to avoid with his recommendation. When you read a file block wise it could happen that the last byte in your buffer is the first byte of a two or more byte representation of a character. An example, the German has the following UTF-8 representation: 0xc3 0x84

    xxxxxx 0xc3|0x84 xxxxxx -----------^ End of buffer

    This could lead to decoding errors. But when you read AND DECODE linewise you can be pretty sure that all bytes read until NL (or whatever your line ending is) can be decoded properly.

    Putting a decoding layer to your filehandle should also work with the handle you get from an upload, so

    binmode($fh, ":utf8");

    should be valid too.

    McA

Re^2: Problem with utf8 after nearly 4096 bytes
by Anonymous Monk on Sep 02, 2013 at 08:00 UTC

    I'm not sure if I understood your advice,...

    I made two suggestions that would reduce memory usage, please try them both out, and if your problem is solved, then you know one of the two solved it ...

    Growing a string line/by/line does not decrease memory usage, that is still slurping the whole file into memory

    Another thing to check is your version of CGI.pm (get the latest for bug fixes), and check that no POST_MAX has been set

    Also, try a different file :)

Re^2: Problem with utf8 after nearly 4096 bytes
by Random_Walk (Parson) on Sep 02, 2013 at 09:59 UTC

    Your new version is still copying the entire file into mem. What you should try is changing your current approach:

    undef $/; my $bigfile = <$fh>; # Do some processing on $bigfile
    to something like this
    while (my $line = <$fh>) { # Do some processing on $line }

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1051888]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (12)
As of 2014-08-27 19:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (250 votes), past polls