Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode

by hashperl (Initiate)
on Mar 12, 2013 at 14:30 UTC ( #1022985=perlquestion: print w/replies, xml ) Need Help??
hashperl has asked for the wisdom of the Perl Monks concerning the following question:

I have just started working on Strawberry Perl, so that I can write some Perl scripts on Windows.

The first problem I have run into is reading "Unicode text, UTF-32, little-endian" files rather than standard "ASCII text".

To add to the problem I am using the "File::ReadBackwards" module as I need to read an application log file and I only want to read the newest errors on a large file.

use File::ReadBackwards; my $ERRORLINES_IN = File::ReadBackwards->new($logfile) or die "$logfil +e : $!\n"; while (defined($logline = $ERRORLINES_IN->readline)) {

From my snippet of code would someone be able to point me in the right direction for determining the file type and then opening the file type using the ReadBackwards module.

  • Comment on How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
  • Download Code

Replies are listed 'Best First'.
Re: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
by tobyink (Abbot) on Mar 12, 2013 at 15:02 UTC

    Sadly File::ReadBackwards doesn't provide the ability to specify encodings or PerlIO layers. It just reads the file as bytes. (File a bug report!)

    That doesn't mean your task is impossible - you just need to manually encode/decode in a few places.

    use 5.008; use strict; use warnings; use Encode qw( encode decode ); use File::Temp qw( tempfile ); use File::ReadBackwards qw(); # Pick a random filename (undef, my $filename) = tempfile(); # Create some content for an example { open my $fh, ">:encoding(UTF-32LE)", $filename or die "can haz file?? $!"; print $fh "$_\n" for qw/ foo bar baz /; close $fh; } # Now let's open it. Note that we need to tell File::ReadBackwards # that the line seperator is the UTF-32-encoded version of "\n". my $fh = "File::ReadBackwards"->new( $filename, encode("UTF-32LE", "\n") => 0, ) or die "can haz file?? $!"; # Read each line while (defined(my $line = $fh->readline)) { # Need to decode line from UTF-32 to Perl's internal encoding $line = decode("UTF-32LE", $line); print "GOT: $line"; } # Delete our temp file unlink $filename;
    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
      Thanks for your reply, I tried your code and that run fine, trying to incorporate that into my code isn't working for me at the moment but that may be my end. I may need a rethink and some more background reading.
Re: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
by sundialsvc4 (Abbot) on Mar 12, 2013 at 17:30 UTC

    Realistically, how could that be “a bug report?”   Think about it ... every multi-byte character encoding scheme that has ever been invented (or that could be) involves significant-bytes that precede the data that they modify.   If you are reading the file from stern to stem, well, “either you read them or you didn’t.”

    It stands to reason, therefore, that you must be the one to have read “a few more bytes than you need,” and, having read those bytes, you have to figure out whether (unlucky you ...) you started reading smack-dab in the middle of a multi-byte (MBCS) sequence or not.   There is no bright-line rule answer for this.   The only reliable strategy that I can think of is to rely upon some contextual knowledge about the data stream itself.   Find some string of (non-MBCS) sequence that you know will occur somewhere within the last n characters of the data.   Then, read some n+x (for some x...) bytes from the tail of the file, then use a regex to search within that data for that reliable sequence.   Advance suspiciously forward from there.

    Bear in mind that the onus is upon your application, not merely to come up with the right answers if it can, but to reliably fail if it cannot.   Your application is the only player with the capability to do this.   The fact that the algorithm does “produce answers at all” must, itself, be a positive indication that those answers are in fact worthy to be trusted.

      Nonsense. It's easy-peasy. Slurp the whole file into memory; convert to a character string, then then offer filehandle-like accessors to that string.

      Obviously, you want to avoid slurping the whole file into memory, but that's "just" an optimization. Worry about that when you've got the easy-peasy implementation working right.

      As it happens, with UTF-32 you do know whether you're in the middle of a character (as in: codepoint, rather than grapheme), because each character is exactly 32 bits. Just take the byte offset modulo 4. So UTF-32 is an easy case to optimize and avoid slurping the entire file.

      UTF-8 is harder but not much. If the high bit is set on a byte, you're in a multibyte sequence. If the second highest bit is also set, you're at the start of a multibyte sequence, and then you can count how many bits there are until the first zero bit, and that tells you how many bytes are in the sequence.

      So you optimize specific, common cases, and fall back to the slurping technique.

      package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
        Thanks for your reply, I have been reading up on the Unicode file types to get a better understanding for myself. Funny enough I was thinking of reading the whole file in just to get the script working and then look at optimisation. I like the idea of the using byte offset's and sounds like the solution I was looking for.
      Sorry for the rely misprints this is my first post, just getting used to the options.
      Thanks for your reply sundialsvc4, I think this was the conclusion that I was coming to but it's good to here it from someone else.

        I have found a simple solution for my particular case, as I know that I only want to match certain strings from the log file and none of the match strings will be none ASCII characters.

        I can use the ReadBackwards to read the line in then strip out none ASCII and null characters.

        my $ERRORLINES_IN = File::ReadBackwards->new($logfile) or die "$logfil +e : $!\n"; while (defined($logline = $ERRORLINES_IN->readline)) { $logline =~ s/[^[:ascii:]]//g; $logline =~ s/\0//g;

        It's not a perfect solution but it works for my case.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1022985]
Approved by tobyink
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2018-05-24 09:03 GMT
Find Nodes?
    Voting Booth?