Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode

by sundialsvc4 (Monsignor)
on Mar 12, 2013 at 17:30 UTC ( #1023024=note: print w/ replies, xml ) Need Help??


in reply to How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode

Realistically, how could that be “a bug report?”   Think about it ... every multi-byte character encoding scheme that has ever been invented (or that could be) involves significant-bytes that precede the data that they modify.   If you are reading the file from stern to stem, well, “either you read them or you didn’t.”

It stands to reason, therefore, that you must be the one to have read “a few more bytes than you need,” and, having read those bytes, you have to figure out whether (unlucky you ...) you started reading smack-dab in the middle of a multi-byte (MBCS) sequence or not.   There is no bright-line rule answer for this.   The only reliable strategy that I can think of is to rely upon some contextual knowledge about the data stream itself.   Find some string of (non-MBCS) sequence that you know will occur somewhere within the last n characters of the data.   Then, read some n+x (for some x...) bytes from the tail of the file, then use a regex to search within that data for that reliable sequence.   Advance suspiciously forward from there.

Bear in mind that the onus is upon your application, not merely to come up with the right answers if it can, but to reliably fail if it cannot.   Your application is the only player with the capability to do this.   The fact that the algorithm does “produce answers at all” must, itself, be a positive indication that those answers are in fact worthy to be trusted.


Comment on Re: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
Re^2: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
by tobyink (Abbot) on Mar 12, 2013 at 18:43 UTC

    Nonsense. It's easy-peasy. Slurp the whole file into memory; convert to a character string, then then offer filehandle-like accessors to that string.

    Obviously, you want to avoid slurping the whole file into memory, but that's "just" an optimization. Worry about that when you've got the easy-peasy implementation working right.

    As it happens, with UTF-32 you do know whether you're in the middle of a character (as in: codepoint, rather than grapheme), because each character is exactly 32 bits. Just take the byte offset modulo 4. So UTF-32 is an easy case to optimize and avoid slurping the entire file.

    UTF-8 is harder but not much. If the high bit is set on a byte, you're in a multibyte sequence. If the second highest bit is also set, you're at the start of a multibyte sequence, and then you can count how many bits there are until the first zero bit, and that tells you how many bytes are in the sequence.

    So you optimize specific, common cases, and fall back to the slurping technique.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
      Thanks for your reply, I have been reading up on the Unicode file types to get a better understanding for myself. Funny enough I was thinking of reading the whole file in just to get the script working and then look at optimisation. I like the idea of the using byte offset's and sounds like the solution I was looking for.
Re^2: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
by hashperl (Initiate) on Mar 13, 2013 at 09:13 UTC
    Sorry for the rely misprints this is my first post, just getting used to the options.
Re^2: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
by hashperl (Initiate) on Mar 13, 2013 at 09:18 UTC
    Thanks for your reply sundialsvc4, I think this was the conclusion that I was coming to but it's good to here it from someone else.

      I have found a simple solution for my particular case, as I know that I only want to match certain strings from the log file and none of the match strings will be none ASCII characters.

      I can use the ReadBackwards to read the line in then strip out none ASCII and null characters.

      my $ERRORLINES_IN = File::ReadBackwards->new($logfile) or die "$logfil +e : $!\n"; while (defined($logline = $ERRORLINES_IN->readline)) { $logline =~ s/[^[:ascii:]]//g; $logline =~ s/\0//g;

      It's not a perfect solution but it works for my case.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1023024]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-07-12 14:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (240 votes), past polls