Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Using File::Tail on a unicode file

by chrestomanci (Priest)
on Mar 07, 2012 at 09:52 UTC ( #958228=perlquestion: print w/replies, xml ) Need Help??
chrestomanci has asked for the wisdom of the Perl Monks concerning the following question:

Wise brothers, I seek your wisdom.

I am looking to monitor changes to a log file that has a Unicode (utf-16) encoding. The file is a log from a windows application, where the application will from time wake up, truncate the file and start emitting log messages to the file. When it finishes a job it will add a 'job done' message to the log and then go back to sleep. I would like to know when those job done messages are written so I can trigger other events.

My initial plan was to use File::Tail to monitor the file:

my $o_tail=File::Tail->new( name => $scan_log_file, maxinterval => 1, adjustafter => 1, ignore_nonexistant => 1, reset_tail => -1, tail => -1, ); while( my $line = $o_tail->read() ) { if( $line =~ m/finished/ ) { # Do stuff } }

The problem with this is that the file is utf-16, and the lines I get back are encoded rather than perl strings.

From the File::Tail docs, there is no way to provide an encoding as a parameter, and looking at the source it look like that module makes a lot of sysseek and sysread calls, that I guess would not work properly via a Unicode input layer. (Because the one to one relationship between characters and bytes would no longer hold true), so I think patching File::Tail to support Unicode would be a difficult undertaking.

Another approach I thought of is to accept the octet strings from File::Tail, and then pass them through $string = decode("utf-16", $octets) (From the Encode) module. A possible problem with this approach is how newlines are encoded, and how File::Tail copes with them.

A third approach would be to abandon the use of File::Tail, and instead to stat() the file at frequent intervals, and every time it changes, read the entire file using standard IO with a suitable encoding.

Do you have any thoughts or suggestions?

Replies are listed 'Best First'.
Re: Using File::Tail on a unicode file
by Eliya (Vicar) on Mar 07, 2012 at 10:41 UTC

    The following hack seems to work:

    #!/usr/bin/perl -w use strict; use File::Tail; use IO::Handle; use Encode; #use Devel::Peek; my $scan_log_file = "foo.u16"; unless (fork) { # emulate Windows file for testing open my $fh, ">:encoding(UTF-16le):crlf", $scan_log_file or die $! +; $fh->autoflush(1); for (1..5) { print $fh "foo bar\n"; sleep 1; } print $fh "finished\n"; exit; } else { sleep 1; my $o_tail=File::Tail->new( name => $scan_log_file, maxinterval => 1, adjustafter => 1, ignore_nonexistant => 1, reset_tail => -1, tail => -1, ); while( my $line = $o_tail->read() ) { $line =~ s/^\0//; # fix possible char misalignment $line = decode("UTF-16le", $line); $line =~ s/\r$//; # remove \r #Dump $line; print "$line\n"; if( $line =~ m/finished/ ) { print "done.\n"; last; } } }

      Thank you, that was exactly what I was looking for. ++

      Unfortunately I can't use it because I have just discovered that File::Tail does not work under windows, or at least not under Active State perl 5.12. It is not available in their package repository, and when I tried to install it by hand most of the tests failed.

      Instead I have switched to a simpler algorithm that just monitors the file mtime, and if it changes, reads the entire file. It is inelegant but works.

Re: Using File::Tail on a unicode file
by moritz (Cardinal) on Mar 07, 2012 at 10:00 UTC

    If the file is in UTF-16BE, then the newline recognition isn't a huge problem. The newline is then encoded as 0x00 0x0a</c>, so the line ending recognition doesn't break apart UTF-16 characters.

    Of course there is still the possibility of getting false positives for characters U+0a00 to U+0aFF - you have to judge for yourself if that is a problem for you.

Re: Using File::Tail on a unicode file
by Anonymous Monk on Mar 07, 2012 at 09:59 UTC
    Seems to me, you could stat, and then read the last 30 bytes, decode them, and if the file is complete (process has stopped logging and is sleeping again), copy the file and then process the copy (so it doesn't get truncated while you're parsing)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://958228]
Approved by moritz
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2018-06-25 03:42 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.