Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: Read and write UTF-8

by Norah (Novice)
on Oct 15, 2016 at 17:30 UTC ( [id://1174071]=note: print w/replies, xml ) Need Help??


in reply to Re: Read and write UTF-8
in thread Read and write UTF-8

Here is simple input data:
Year*JEDocSrcP_USERE_DATE P_DATE CurLine 2011617 GJ448 Bruce12/20/1101/01/11USD1500 2011617 GJ349áBruce12/20/1101/01/11USD1500 2011617 GJ350 Bruce12/20/1101/01/11USD1500 2011617 GJ351 Bruce12/20/1101/01/11USD1500
The output looks like this:
First 20: Year*JEDocSrcP_ First 20: 2011617 GJ448 Bruce1 First 20: 2011617 GJ349áBruce First 20: 2011617 GJ350 Bruce1 First 20: 2011617 GJ351 Bruce1
Note that the asterisk * is really the UTF-8 heart symbol ♥. It wasn't displaying correctly here so I just put an asterisk there. I will work on the hex dump for you.

but I think it is counting octets

Replies are listed 'Best First'.
Re^3: Read and write UTF-8
by Corion (Patriarch) on Oct 15, 2016 at 18:15 UTC

    Note that the hexdump of your input data has a Byte Order Mark (BOM) at the front of it, which Perl counts at least as some characters.

    Discounting the BOM, I get the expected output with the following program:

    #!/usr/bin/perl -w use strict; use Encode qw/encode decode/; open (INFILE, "<:encoding(UTF-8)", "utf8.txt") || die "blah blah blah" +; open (OUTFILE, ">:encoding(UTF-8)", "oututf8.txt") || die "blah blah"; binmode STDOUT, ':encoding(UTF-8)'; print "Ruler : [12345678901234567890]\n"; while (my $line = <INFILE>) { chomp ($line); print "Input : [$line]\n"; my $linestart = substr($line,0,20); my $outline = $linestart; print "20 : [$outline]\n"; print "---\n"; print OUTFILE "$outline\n"; } close (INFILE);

    To remove the BOM at the start of your file, use maybe simply

    $line =~ s!^\N{BYTE ORDER MARK}!!;
      Thank you for the suggestion - but I get the same results. I think there is some system setting on the laptop that will not read in characters - it just does bytes.

        Maybe now is a good time to show the exact code you are running and the exact input (again, as hexdump) you are giving it, and also to describe what method you are using to inspect the output.

        For me, on Perl 5.20, on Windows 7, with the Latin-1 codepage, I get the following output from the program I posted with the input file, which shows some more "characters" on output, but that is expected because my terminal is not set to UTF-8:

        Ruler : [12345678901234567890] Input : [YearÔÖÑJEDocSrcP_USE_DATE P_DATE CurLine] 20 : [YearÔÖÑJEDocSrcP_USE_D]

        On Perl 5.20, on Windows 7, with the UTF-8 codepage (via chcp 65001), I get the following output from the program I posted with the input file, which has 20 characters (not bytes) on output, as I expect:

        Ruler : [12345678901234567890]
        Input : [Year♥JEDocSrcP_USE_DATE P_DATE CurLine]
        20    : [Year♥JEDocSrcP_USE_D]
        ---

        The script I'm running is:

        #!/usr/bin/perl -w use strict; use Encode qw/encode decode/; open (INFILE, "<:encoding(UTF-8)", "utf8.txt") || die "blah blah blah" +; open (OUTFILE, ">:encoding(UTF-8)", "oututf8.txt") || die "blah blah"; binmode STDOUT, ':encoding(UTF-8)'; print "Ruler : [12345678901234567890]\n"; while (my $line = <INFILE>) { chomp ($line); $line =~ s!^\N{BYTE ORDER MARK}!!; print "Input : [$line]\n"; my $linestart = substr($line,0,20); my $outline = $linestart; print "20 : [$outline]\n"; print "---\n"; print OUTFILE "$outline\n"; } close (INFILE);

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1174071]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-18 01:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found