Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode

by tobyink (Abbot)
on Mar 12, 2013 at 18:43 UTC ( #1023035=note: print w/ replies, xml ) Need Help??


in reply to Re: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
in thread How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode

Nonsense. It's easy-peasy. Slurp the whole file into memory; convert to a character string, then then offer filehandle-like accessors to that string.

Obviously, you want to avoid slurping the whole file into memory, but that's "just" an optimization. Worry about that when you've got the easy-peasy implementation working right.

As it happens, with UTF-32 you do know whether you're in the middle of a character (as in: codepoint, rather than grapheme), because each character is exactly 32 bits. Just take the byte offset modulo 4. So UTF-32 is an easy case to optimize and avoid slurping the entire file.

UTF-8 is harder but not much. If the high bit is set on a byte, you're in a multibyte sequence. If the second highest bit is also set, you're at the start of a multibyte sequence, and then you can count how many bits there are until the first zero bit, and that tells you how many bytes are in the sequence.

So you optimize specific, common cases, and fall back to the slurping technique.

package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name


Comment on Re^2: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
Re^3: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
by hashperl (Initiate) on Mar 13, 2013 at 09:11 UTC
    Thanks for your reply, I have been reading up on the Unicode file types to get a better understanding for myself. Funny enough I was thinking of reading the whole file in just to get the script working and then look at optimisation. I like the idea of the using byte offset's and sounds like the solution I was looking for.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1023035]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2014-07-26 04:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls