Byte counts and Seek function

by Dr Manhattan (Beadle)
on Aug 27, 2013 at 21:10 UTC
Hi all

I am parsing a file by sentences and outputting them with byte count at the start of each sentence.

my @byte_count; push (@byte_count, 0); my $length = scalar(@sentences); for (my $x = 0; $x < $length; $x++) { my $count; $count = length(Encode::encode utf8($sentence[$x])); $count += $byte_count[$x]; $count += 9; push (@byte_count, $count); } open (Out, ">:utf8", "Sentences and byte count.txt") or die "Can't ope +n"; for (my $x = 0; $x < $length; $x++) { printf Out "%08d$sentences[$x]\n", $byte_count[$x]; }

I add the 9 for the 8 digits at the beginning of each line plus the newline at the previous line

The problem is when I use the 'seek' function, it works perfectly fine for the first 3 lines en then somehow breaks. After the 3rd line when I seek, it outputs from somewhere around the middle of the previous entry

use warnings; use utf8; open (FILE, "<:utf8", "Sentences and byte count.txt") or die "Can't op +en"; seek(FILE, 656, 0); my $line = <FILE>; print "$line";

Somehow the byte counting is missing something somewhere. Any ideas?

Thanks in advance for any help, much appreciated

Re: Byte counts and Seek function
by chromatic (Archbishop) on Aug 27, 2013 at 21:59 UTC

    You're in for a world of pain if you try to mix byte counts with UTF-8, because a UTF-8 glyph may be represented by more than one byte's worth of codepoints. seek doesn't take variable-width encodings into account. It only counts bytes.

    (I don't know what your utf8 function does, so I can't comment on what your call to encode does.)

    Seems to me that it would be easier to use pos tell when you read in a sentence and keep that position around, rather than try to reconstruct it from the data you've read (and decoded, possibly normalized, et cetera).

      Are you sure you would use pos? I always thought seek should be used with tell.
        Yes, you're right. I was thinking of fgetpos in C for some reason (and even there I'd use ftell, so I don't know what I was thinking at all).

      utf8 (emphases added):

      utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
      The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
      program text in the current lexical scope ...

        That's the utf8 pragma. I know what it does in the posted code: nothing, because there are no non-ASCII characters appearing literally in the source code.

        What's the utf8 function in the OP's code do?

