Module to read - modify - write text files in any unicode encoding

Rudif has asked for the wisdom of the Perl Monks concerning the following question:

Venerable monks

I was looking for a perl module that would do the following

- write a perl text string to a disk file in encoding specified (none/ansi, utf-8, utf-16le, possibly others)
- read a disk file into a perl string, detecting the encoding and remebering it
- modify the perl string (could be done by an attached callback, as required by the application script)
- write the possibly modified perl string back to disk, using the remembered encoding
- preserve the Windows crlf sequences through the read - modify - write cycle

IOW, I want the simplicity of File::Slurp, with the added intelligence for automatic handling of the unicode encodings.

This would help me when writing scripts for maintenance of large number of text files (c/cpp source files, project files, xml config files, whatever) which may use a variety of encodings.

I found in perl standard modules and in CPAN modules various unicode-related building blocks, but not the synthesis that I am looking for.

Therefore, I wrote a prototype module which implements, partially, what I want.

Questions :

Is there something similar in the wpw (wide perl world), that I missed ?
Can you help me with a couple of problems with my prototype code ?
Any other advice or help on tackling this problem ?

Specific problems in my prototype code :

While my File::AnyEncoding::write_file for UTF-16LE test file correctly converts "\n" into 0D 00 0A 00,
my File::AnyEncoding::read_file for UTF-16LE does not convert this back to "\n" but to "\r\n", why ?
I did try to play with various settings of binmode, but I failed to obtain the desired result.

My File::AnyEncoding::write_file for UTF-8 test converts "\n" into 0A only, while I expect 0D 0A,
why is this and how to fix it ?

Same question with encoding NONE (by which I mean no BOM, utf-8 encoding).

My module File::AnyEncoding and a test file t\AnyEncoding.t are reproduced below.
They should be placed in subdirs

    .\File\AnyEncoding.pm
    .\t\AnyEncoding.t
[download]

The module uses File::BOM and the test file uses File::Path and Data::HexDump.
My perl is 5.10 build 1003 from ActiveState, on a WinXP SP2 machine.

TIA
Rudif

#!/usr/bin/perl


=pod

SYNOPSIS

use File::AnyEncoding;  # unicode file writer - reader

my $fun1 = new File::AnyEncoding('utf-16le');

my $text1 = "Hello world";
my $filepath1 = "AnyEncoding-test1.txt";

$fun1->write_file($filepath1, $text1);   # writes file with specified 
+encoding

my $fun2 = new File::AnyEncoding();

my $text2 = $fun2->read_file($filepath1);  # remembers the encoding fo
+und in $filepath1

$text2 =~s/world/unicode/;  # modify file contents

my $filepath2 = "AnyEncoding-test2.txt";
$fun2->write_file($filepath2, $text2);  # writes file with encoding fo
+und in $filepath1

AUTHOR

Rudif c/o Perlmonks

=cut


package File::AnyEncoding;

use strict;


use File::BOM qw( :all );

our %supported_encoding = map { $_ => 1 } ( 'NONE', 'UTF-8', 'UTF-16LE
+' );

sub new {
    my $class = shift;
    my $enc = shift // 'utf8';
    my $self = {};
    bless $self, $class;
    $self->set_encoding($enc);
    return $self;
}

sub set_encoding {
    my $self = shift;
    my $enc = shift;
    unless ( defined $enc && defined $supported_encoding{ $enc } ) {
        $enc = 'NONE';
        #warn "defaulting to $enc";
    }
    $self->{encoding} = $enc;
}

sub get_encoding {
    my $self = shift;
    $self->{encoding};
}

sub write_file {
    my $self = shift;
    my $filepath = shift;
    my $text = join '', @_;
    my $enc = $self->{encoding};
    my $FH;
    if ( $enc eq 'NONE' ) {
        open $FH, ">", $filepath;
        #open $FH, ">:raw:encoding(UTF-8):crlf:utf8", $filepath;
    }
    else {
        open $FH, ">:raw:encoding($enc):crlf:utf8", $filepath;
        print $FH "\x{FEFF}";
    }
    print $FH $text;
    close $FH;
}

sub read_file {
    my $self = shift;
    my $filepath = shift;

    open my $FH, '<:bytes', "$filepath";
    my ( $enc, $spillage ) = get_encoding_from_filehandle($FH);

    $enc = $self->set_encoding($enc);

    if ( $enc eq 'NONE' ) {
        #binmode $FH, ":encoding(UTF-8)";
        close $FH;
        open $FH, '<', "$filepath";
    }
    else {
        binmode $FH, ":encoding($enc)";
    }
    my @lines = <$FH>;
    close $FH;
    wantarray ? @lines : join '', @lines;
}

1;
[download]


#!/usr/bin/perl

use strict;
$|++;

use lib('..');

use Data::HexDump;
use File::Path;

my $data = '.\data';

rmtree $data;    # remove old data if any
mkdir $data;

use Test::More tests => 28;

use File::AnyEncoding;    # under test

use_ok('File::AnyEncoding');

# test utf-16le encoding
{

    # create object and write test file with specified encoding

    my $encoding = 'UTF-16LE';
    my $fan1     = File::AnyEncoding->new($encoding);

    isa_ok( $fan1, 'File::AnyEncoding', '$fan1' );
    is( $fan1->get_encoding(), 'UTF-16LE', "specified encoding" );

    my $text1 = "Hello world \x{263A}\n\n";
    my $file1 = "$data\\$encoding.txt";

    $fan1->write_file( $file1, $text1 );

    my $expected1 = <<HERE;
00000000  FF FE 48 00 65 00 6C 00 - 6C 00 6F 00 20 00 77 00  ..H.e.l.l
+.o. .w.
00000010  6F 00 72 00 6C 00 64 00 - 20 00 3A 26 0D 00 0A 00  o.r.l.d. 
+.:&....
00000020  0D 00 0A 00                                        ....
HERE

    is( my_hexdump($file1), $expected1, "write_file $file1" );

    # reset encoding - should be detected in read_file
    $fan1->set_encoding('NONE');
    is( $fan1->get_encoding(), 'NONE', "default encoding" );

    # read test file and remember encoding

    my $text2 = $fan1->read_file($file1);

    is( $fan1->get_encoding(), 'UTF-16LE', "detected encoding" );

    my $expected2 = 'Hello world \x{263a}\x{d}\x{a}\x{d}\x{a}';
    is( my_reasciify($text2), $expected2, "read_file $file1" );

    my @lines2 = $fan1->read_file($file1);
    is( scalar @lines2, 2, "read_file $file1" );
    my $join2 = join( '', @lines2 );
    is( my_reasciify($join2), $expected2, "read_file $file1" );

    # modify text and write second file using the remebered encoding

    $text2 =~ s/world/WORLD/;
    $text2 =~ s/\x{263A}/\x{20AC}/;

    ( my $file2 = $file1 ) =~ s/.txt/-2.txt/;

    $fan1->write_file( $file2, $text2 );    # writes file with remembe
+red encoding

    my $expected3 = <<HERE;
00000000  FF FE 48 00 65 00 6C 00 - 6C 00 6F 00 20 00 57 00  ..H.e.l.l
+.o. .W.
00000010  4F 00 52 00 4C 00 44 00 - 20 00 AC 20 0D 00 0D 00  O.R.L.D. 
+.. ....
00000020  0A 00 0D 00 0D 00 0A 00                            ........
HERE

    is( my_hexdump($file2), $expected3, "write_file $file1" );
}

# test utf-8 encoding
{

    # create object and write test file with specified encoding

    my $encoding = 'UTF-8';
    my $fan1     = File::AnyEncoding->new($encoding);

    isa_ok( $fan1, 'File::AnyEncoding', '$fan1' );
    is( $fan1->get_encoding(), 'UTF-8', "specified encoding" );

    my $text1 = "Hello world \x{263A}\n\n";
    my $file1 = "$data\\$encoding.txt";

    $fan1->write_file( $file1, $text1 );

    my $expected1 = <<HERE;
00000000  EF BB BF 48 65 6C 6C 6F - 20 77 6F 72 6C 64 20 E2  ...Hello 
+world .
00000010  98 BA 0A 0A                                        ....
HERE

    is( my_hexdump($file1), $expected1, "write_file $file1" );

    # reset encoding - should be detected in read_file
    $fan1->set_encoding('NONE');
    is( $fan1->get_encoding(), 'NONE', "default encoding" );

    # read test file and remember encoding

    my $text2 = $fan1->read_file($file1);

    is( $fan1->get_encoding(), 'UTF-8', "detected encoding" );

    my $expected2 = 'Hello world \x{263a}\x{a}\x{a}';
    is( my_reasciify($text2), $expected2, "read_file $file1" );

    my @lines2 = $fan1->read_file($file1);
    is( scalar @lines2, 2, "read_file $file1" );
    my $join2 = join( '', @lines2 );
    is( my_reasciify($join2), $expected2, "read_file $file1" );

    # modify text and write second file using the remebered encoding

    $text2 =~ s/world/WORLD/;
    $text2 =~ s/\x{263A}/\x{20AC}/;

    ( my $file2 = $file1 ) =~ s/.txt/-2.txt/;

    $fan1->write_file( $file2, $text2 );    # writes file with remembe
+red encoding

    my $expected3 = <<HERE;
00000000  EF BB BF 48 65 6C 6C 6F - 20 57 4F 52 4C 44 20 E2  ...Hello 
+WORLD .
00000010  82 AC 0A 0A                                        ....
HERE

    is( my_hexdump($file2), $expected3, "write_file $file1" );
}

# test no encoding
{

    # create object and write test file with specified encoding

    my $encoding = 'NONE';
    my $fan1     = File::AnyEncoding->new($encoding);

    isa_ok( $fan1, 'File::AnyEncoding', '$fan1' );
    is( $fan1->get_encoding(), 'NONE', "specified encoding" );

    my $text1 = "Hello world \x{263A}\n\n";
    my $file1 = "$data\\$encoding.txt";

    $fan1->write_file( $file1, $text1 );

    my $expected1 = <<HERE;
00000000  48 65 6C 6C 6F 20 77 6F - 72 6C 64 20 E2 98 BA 0A  Hello wor
+ld ....
00000010  0A                                                 .
HERE

    is( my_hexdump($file1), $expected1, "write_file $file1" );

    # reset encoding - should be detected in read_file
    $fan1->set_encoding('UTF-8');
    is( $fan1->get_encoding(), 'UTF-8', "default encoding" );

    # read test file and remember encoding

    my $text2 = $fan1->read_file($file1);

    is( $fan1->get_encoding(), 'NONE', "detected encoding" );

    my $expected2 = 'Hello world \x{e2}\x{98}\x{ba}\x{a}\x{a}';
    is( my_reasciify($text2), $expected2, "read_file $file1" );

    my @lines2 = $fan1->read_file($file1);
    is( scalar @lines2, 2, "read_file $file1" );
    my $join2 = join( '', @lines2 );
    is( my_reasciify($join2), $expected2, "read_file $file1" );

    # modify text and write second file using the remebered encoding

    $text2 =~ s/world/WORLD/;
    $text2 =~ s/\x{263A}/\x{20AC}/;

    ( my $file2 = $file1 ) =~ s/.txt/-2.txt/;

    $fan1->write_file( $file2, $text2 );    # writes file with remembe
+red encoding

    my $expected3 = <<HERE;
00000000  48 65 6C 6C 6F 20 57 4F - 52 4C 44 20 E2 98 BA 0A  Hello WOR
+LD ....
00000010  0A                                                 .
HERE

    is( my_hexdump($file2), $expected3, "write_file $file1" );

}

exit 0;

# returns hexdump of the $file
sub my_hexdump {
    my $file = shift;
    my $f    = new Data::HexDump;
    unless ( -f $file ) {
        warn "no such file $file";
        return '---';
    }
    $f->file($file);
    my $str = '';
    while ( local $_ = $f->dump ) {
        $str .= $_;
    }
    $str =~ s/.*00000000/00000000/s;
    return $str;
}

# returns sprintf of characters in $string,
# replacing those not printable as ascii by their hex code point numer
+ic value
# similar to sub in File::BOM
sub my_reasciify {
    my $string = shift;
    $string = join "", map {
        my $ord = ord($_);

        #    ($ord > 127 || ($ord < 32 && $ord != 10))
        ( $ord > 127 || $ord < 32 )
          ? sprintf '\x{%x}', $ord
          : $_
    } split //, $string;
}

__END__
[download]

Comment on Module to read - modify - write text files in any unicode encoding Select or Download Code

Replies are listed 'Best First'.
Re: Module to read - modify - write text files in any unicode encoding by ikegami (Patriarch) on May 19, 2008 at 22:21 UTC
You use `:raw:encoding($enc):crlf:utf8` for writing. You use `:encoding($enc)` for reading. This lack in symmetry in your IO layers accounts for your lack of symmetry in CRLF handling. Use `:raw:encoding($enc):crlf:utf8` for reading too. use Data::Dumper qw( Dumper ); sub hexdump { (my $dump = uc unpack 'H', $_[0]) =~ s/(..)/$1 /g; return $dump; } sub txtdump { local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 0; return Dumper($_[0]); } { open(my $fh, '>:raw:encoding(UTF-16le):crlf:utf8', 'test') or die; my $data = "foo\nbar\n"; print("Orig: ", txtdump($data), "\n"); print $fh $data; } { open(my $fh, '<:raw', 'test') or die; local $/; my $data = <$fh>; print("File: ", hexdump($data), "\n"); } { open(my $fh, '<:raw:encoding(UTF-16le):crlf:utf8', 'test') or die; local $/; my $data = <$fh>; print("Read: ", txtdump($data), "\n"); } [download] `Orig: "foo\nbar\n" File: 66 00 6F 00 6F 00 0D 00 0A 00 62 00 61 00 72 00 0D 00 0A 00 Read: "foo\nbar\n"` [download] Update*: Added code.	[reply] [d/l] [select]
Re^2: Module to read - modify - write text files in any unicode encoding by Rudif (Hermit) on May 20, 2008 at 22:22 UTC
ikegami Your code just works, also when I apply it to UTF-8. Apart from the lack in symmetry in my IO layers, that you pointed out, I found another source of my confusion, which you probably noticed, but you did not comment on : my_hexdump() based on Data::Hexdump that I was using in tests is wrong - on Windows. Deep inside, Data::Hexdump reads the file without applying '<:raw', like you do. So, when reading the UTF-8 or plain ASCII sequence "\r\n", it converts it to "\n". In addition, I was using hdump.pl to dump my test files. It agreed with my_hexdump(), but they were both wrong!. Here is a correct file hexdump, based on your code : `sub hexdump { my $file = shift; open(my $fh, '<:raw', $file) or die; local $/; my $data = <$fh>; (my $dump = uc unpack 'H*', $data) =~ s/(..)/$1 /g; return $dump; }` [download] Thank you for the insight. Rudif	[reply] [d/l]
Re: Module to read - modify - write text files in any unicode encoding by almut (Canon) on May 19, 2008 at 22:09 UTC
While my File::AnyEncoding::write_file for UTF-16LE test file correctly converts "\n" into 0D 00 0A 00, my File::AnyEncoding::read_file for UTF-16LE does not convert this back to "\n" but to "\r\n", why ? I haven't (yet) looked at your code in detail, but my suspicion is that your problem has to do with a bug in the `crlf` PerlIO layer, which is that it doesn't correctly handle the multibyte encodings UTF-16, UTF-32, UCS-2 and UCS-4. I.e., it does its translation magic only on the byte sequences `0D 0A` (on input) or `0A` (on output). In other words, you'd have to rearrange the PerlIO layer stack in such a way that the `crlf` layer happens to be applied to an encoding (e.g. UTF-8) where the bytes `0D` and `0A` appear next to each other in the stream (no `00` in between) — like you already seem to be trying in your `write_file()` routine. Also see PerlIO: crlf layer on Windows interfering with UCS-2 unicode. Update: Maybe it's worth pointing out that you want the kludge (i.e. the layer stack `":raw:encoding($enc):crlf:utf8"`) only on Windows, where `\r\n` <—> `\n` translation is supposed to happen, and only if the encoding in question is representing `\r` and `\n` as more than one byte (or more precisely, not exactly as `0D` and `0A`, i.e. for example UTF-16, UTF-32, UCS-2, UCS-4 — note that UTF-8 and most other encodings are fine). In all other cases, you should use the normal `":encoding($enc)"` approach.	[reply] [d/l] [select]
Re^2: Module to read - modify - write text files in any unicode encoding by Rudif (Hermit) on May 20, 2008 at 21:39 UTC
almut I found your writeup is very useful, especially your explanation of the PerlIO layer stack : (Note that, when writing, layers are being applied from right-to-left, while when reading, they're being applied from left-to-right. IOW, the left hand side of the layer stack as shown corresponds to the external side (file), and the right hand side is the Perl-internal data representation.) I did experiment with the stack, but I was unsure about the order in which items on stack should be written, and the order in which they are applied. Now you clarified it. Rudif	[reply]
Re: Module to read - modify - write text files in any unicode encoding by moritz (Cardinal) on May 19, 2008 at 21:32 UTC
Encode is "core" since perl 5.7.3 and handles pretty much any encoding you can think of. You should use it. ~~To write a file with a specific encoding, it's enough to do things like this:~~ ~~`open (my $file_handle, '>:encoding(UTF-8)', $filename) or die $!; print $file_handle $text_string; close $file_handle;`~~ ~~[download]~~ (Update: I didn't see you already used that, sorry for the noise.) Guessing character encodings can be done with Encode::Guess, but it can never be done reliably. I know of no module that combines encoding guessing with file slurping, so it might be worth the effort. But don't roll any encoding handling code by hand, it's all been done before and properly tested. See perluniintro, perlunitut and perlunicode for details, I also wrote a short article on the subject. (Update: Fixed article link, thanks for reporting Rudif.	[reply] [d/l]
Re^2: Module to read - modify - write text files in any unicode encoding by Rudif (Hermit) on May 20, 2008 at 21:18 UTC
moritz, thank you for the perldocs links and for your article Character Encodings in Perl (which I found even though the suffix .de was missing from the href in your post). I also read the excellent Unicode backgrounder on Joel on Software site. But don't roll any encoding handling code by hand, it's all been done before and properly tested. Very right, except that some of it might have been properly tested on a Mac or on Linux, and not at all on Windows. Caveat emptor. Rudif	[reply]


Perl Monk, Perl Meditation
	PerlMonks