Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Byte counts and Seek function

by Dr Manhattan (Beadle)
on Aug 27, 2013 at 21:10 UTC ( [id://1051189]=perlquestion: print w/replies, xml ) Need Help??

Dr Manhattan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I am parsing a file by sentences and outputting them with byte count at the start of each sentence.

my @byte_count; push (@byte_count, 0); my $length = scalar(@sentences); for (my $x = 0; $x < $length; $x++) { my $count; $count = length(Encode::encode utf8($sentence[$x])); $count += $byte_count[$x]; $count += 9; push (@byte_count, $count); } open (Out, ">:utf8", "Sentences and byte count.txt") or die "Can't ope +n"; for (my $x = 0; $x < $length; $x++) { printf Out "%08d$sentences[$x]\n", $byte_count[$x]; }

I add the 9 for the 8 digits at the beginning of each line plus the newline at the previous line

The problem is when I use the 'seek' function, it works perfectly fine for the first 3 lines en then somehow breaks. After the 3rd line when I seek, it outputs from somewhere around the middle of the previous entry

use warnings; use utf8; open (FILE, "<:utf8", "Sentences and byte count.txt") or die "Can't op +en"; seek(FILE, 656, 0); my $line = <FILE>; print "$line";

Somehow the byte counting is missing something somewhere. Any ideas?

Thanks in advance for any help, much appreciated

Replies are listed 'Best First'.
Re: Byte counts and Seek function
by chromatic (Archbishop) on Aug 27, 2013 at 21:59 UTC

    You're in for a world of pain if you try to mix byte counts with UTF-8, because a UTF-8 glyph may be represented by more than one byte's worth of codepoints. seek doesn't take variable-width encodings into account. It only counts bytes.

    (I don't know what your utf8 function does, so I can't comment on what your call to encode does.)

    Seems to me that it would be easier to use pos tell when you read in a sentence and keep that position around, rather than try to reconstruct it from the data you've read (and decoded, possibly normalized, et cetera).

      Are you sure you would use pos? I always thought seek should be used with tell.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Yes, you're right. I was thinking of fgetpos in C for some reason (and even there I'd use ftell, so I don't know what I was thinking at all).

      utf8 (emphases added):

      utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
      code
      ...
      The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
      program text in the current lexical scope ...

        That's the utf8 pragma. I know what it does in the posted code: nothing, because there are no non-ASCII characters appearing literally in the source code.

        What's the utf8 function in the OP's code do?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1051189]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (6)
As of 2024-04-18 04:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found