Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Read and write UTF-8

by Norah (Novice)
on Oct 14, 2016 at 22:44 UTC ( [id://1174033]=note: print w/replies, xml ) Need Help??


in reply to Read and write UTF-8

Here is an example:

#!/usr/bin/perl use Encode qw/encode decode/; open (INFILE, "< :encoding(UTF-8)", "utf8.txt") || die "blah blah blah +"; open (OUTFILE, "> :encoding(UTF-8)", "oututf8.txt") || die "blah blah" +; while (<INFILE>) { $line = $_; chomp ($line); $linestart = substr($line,0,20); $outline = "First 20: "."$linestart"; print OUTFILE "$outline\n"; } close (INFILE);

Actually this one reads and writes the non-ASCII characters, but when there is a non-ASCII character in the record it doesn't count the correct # of characters.

Replies are listed 'Best First'.
Re^2: Read and write UTF-8
by Corion (Patriarch) on Oct 15, 2016 at 07:57 UTC

    You talk about characters - when using UTF-8, length and substr count characters, not octets, so in the output, you can find more than 21 octets. If you were already talking about characters, not octets, can you please show some short example input that exhibits the problem, preferrably together with a hexdump of the relevant portion of the file?

      If you want to extract a given number of bytes from a UTF-8 string, use bytes::substr :
      #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use open IO => ':encoding(UTF-8)', ':std'; my $string = join q(), map chr, 9312 .. 9321; say $string; say substr $string, 0, 7; say bytes::substr $string, 0, 7;

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re^2: Read and write UTF-8
by hippo (Bishop) on Oct 15, 2016 at 08:52 UTC
    it doesn't count the correct # of characters.

    With no sample data, that's pretty hard to verify. Have you read length() miscounting UTF8 characters? There's lots of good info in there (length works analagously to substr) along with easy examples of how to provide code complete with data which demonstrates the issue.

Re^2: Read and write UTF-8
by Norah (Novice) on Oct 14, 2016 at 23:19 UTC
    And to be clear - what it did here - it did the exact same thing when I removed the encoding from the open statements.
Re^2: Read and write UTF-8
by Norah (Novice) on Oct 15, 2016 at 17:30 UTC
    Here is simple input data:
    Year*JEDocSrcP_USERE_DATE P_DATE CurLine 2011617 GJ448 Bruce12/20/1101/01/11USD1500 2011617 GJ349áBruce12/20/1101/01/11USD1500 2011617 GJ350 Bruce12/20/1101/01/11USD1500 2011617 GJ351 Bruce12/20/1101/01/11USD1500
    The output looks like this:
    First 20: Year*JEDocSrcP_ First 20: 2011617 GJ448 Bruce1 First 20: 2011617 GJ349áBruce First 20: 2011617 GJ350 Bruce1 First 20: 2011617 GJ351 Bruce1
    Note that the asterisk * is really the UTF-8 heart symbol ♥. It wasn't displaying correctly here so I just put an asterisk there. I will work on the hex dump for you.

    but I think it is counting octets

      Note that the hexdump of your input data has a Byte Order Mark (BOM) at the front of it, which Perl counts at least as some characters.

      Discounting the BOM, I get the expected output with the following program:

      #!/usr/bin/perl -w use strict; use Encode qw/encode decode/; open (INFILE, "<:encoding(UTF-8)", "utf8.txt") || die "blah blah blah" +; open (OUTFILE, ">:encoding(UTF-8)", "oututf8.txt") || die "blah blah"; binmode STDOUT, ':encoding(UTF-8)'; print "Ruler : [12345678901234567890]\n"; while (my $line = <INFILE>) { chomp ($line); print "Input : [$line]\n"; my $linestart = substr($line,0,20); my $outline = $linestart; print "20 : [$outline]\n"; print "---\n"; print OUTFILE "$outline\n"; } close (INFILE);

      To remove the BOM at the start of your file, use maybe simply

      $line =~ s!^\N{BYTE ORDER MARK}!!;
        Thank you for the suggestion - but I get the same results. I think there is some system setting on the laptop that will not read in characters - it just does bytes.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1174033]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2024-04-24 22:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found