Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Module to read - modify - write text files in any unicode encoding

by Rudif (Hermit)
on May 19, 2008 at 21:17 UTC ( [id://687468]=perlquestion: print w/replies, xml ) Need Help??

Rudif has asked for the wisdom of the Perl Monks concerning the following question:

Venerable monks

I was looking for a perl module that would do the following

- write a perl text string to a disk file in encoding specified (none/ansi, utf-8, utf-16le, possibly others)
- read a disk file into a perl string, detecting the encoding and remebering it
- modify the perl string (could be done by an attached callback, as required by the application script)
- write the possibly modified perl string back to disk, using the remembered encoding
- preserve the Windows crlf sequences through the read - modify - write cycle

IOW, I want the simplicity of File::Slurp, with the added intelligence for automatic handling of the unicode encodings.

This would help me when writing scripts for maintenance of large number of text files (c/cpp source files, project files, xml config files, whatever) which may use a variety of encodings.

I found in perl standard modules and in CPAN modules various unicode-related building blocks, but not the synthesis that I am looking for.

Therefore, I wrote a prototype module which implements, partially, what I want.

Questions :

Is there something similar in the wpw (wide perl world), that I missed ?
Can you help me with a couple of problems with my prototype code ?
Any other advice or help on tackling this problem ?

Specific problems in my prototype code :

While my File::AnyEncoding::write_file for UTF-16LE test file correctly converts "\n" into 0D 00 0A 00,
my File::AnyEncoding::read_file for UTF-16LE does not convert this back to "\n" but to "\r\n", why ?
I did try to play with various settings of binmode, but I failed to obtain the desired result.

My File::AnyEncoding::write_file for UTF-8 test converts "\n" into 0A only, while I expect 0D 0A,
why is this and how to fix it ?

Same question with encoding NONE (by which I mean no BOM, utf-8 encoding).

My module File::AnyEncoding and a test file t\AnyEncoding.t are reproduced below.
They should be placed in subdirs

.\File\AnyEncoding.pm .\t\AnyEncoding.t
The module uses File::BOM and the test file uses File::Path and Data::HexDump.
My perl is 5.10 build 1003 from ActiveState, on a WinXP SP2 machine.

TIA
Rudif

#!/usr/bin/perl =pod SYNOPSIS use File::AnyEncoding; # unicode file writer - reader my $fun1 = new File::AnyEncoding('utf-16le'); my $text1 = "Hello world"; my $filepath1 = "AnyEncoding-test1.txt"; $fun1->write_file($filepath1, $text1); # writes file with specified +encoding my $fun2 = new File::AnyEncoding(); my $text2 = $fun2->read_file($filepath1); # remembers the encoding fo +und in $filepath1 $text2 =~s/world/unicode/; # modify file contents my $filepath2 = "AnyEncoding-test2.txt"; $fun2->write_file($filepath2, $text2); # writes file with encoding fo +und in $filepath1 AUTHOR Rudif c/o Perlmonks =cut package File::AnyEncoding; use strict; use File::BOM qw( :all ); our %supported_encoding = map { $_ => 1 } ( 'NONE', 'UTF-8', 'UTF-16LE +' ); sub new { my $class = shift; my $enc = shift // 'utf8'; my $self = {}; bless $self, $class; $self->set_encoding($enc); return $self; } sub set_encoding { my $self = shift; my $enc = shift; unless ( defined $enc && defined $supported_encoding{ $enc } ) { $enc = 'NONE'; #warn "defaulting to $enc"; } $self->{encoding} = $enc; } sub get_encoding { my $self = shift; $self->{encoding}; } sub write_file { my $self = shift; my $filepath = shift; my $text = join '', @_; my $enc = $self->{encoding}; my $FH; if ( $enc eq 'NONE' ) { open $FH, ">", $filepath; #open $FH, ">:raw:encoding(UTF-8):crlf:utf8", $filepath; } else { open $FH, ">:raw:encoding($enc):crlf:utf8", $filepath; print $FH "\x{FEFF}"; } print $FH $text; close $FH; } sub read_file { my $self = shift; my $filepath = shift; open my $FH, '<:bytes', "$filepath"; my ( $enc, $spillage ) = get_encoding_from_filehandle($FH); $enc = $self->set_encoding($enc); if ( $enc eq 'NONE' ) { #binmode $FH, ":encoding(UTF-8)"; close $FH; open $FH, '<', "$filepath"; } else { binmode $FH, ":encoding($enc)"; } my @lines = <$FH>; close $FH; wantarray ? @lines : join '', @lines; } 1;
#!/usr/bin/perl use strict; $|++; use lib('..'); use Data::HexDump; use File::Path; my $data = '.\data'; rmtree $data; # remove old data if any mkdir $data; use Test::More tests => 28; use File::AnyEncoding; # under test use_ok('File::AnyEncoding'); # test utf-16le encoding { # create object and write test file with specified encoding my $encoding = 'UTF-16LE'; my $fan1 = File::AnyEncoding->new($encoding); isa_ok( $fan1, 'File::AnyEncoding', '$fan1' ); is( $fan1->get_encoding(), 'UTF-16LE', "specified encoding" ); my $text1 = "Hello world \x{263A}\n\n"; my $file1 = "$data\\$encoding.txt"; $fan1->write_file( $file1, $text1 ); my $expected1 = <<HERE; 00000000 FF FE 48 00 65 00 6C 00 - 6C 00 6F 00 20 00 77 00 ..H.e.l.l +.o. .w. 00000010 6F 00 72 00 6C 00 64 00 - 20 00 3A 26 0D 00 0A 00 o.r.l.d. +.:&.... 00000020 0D 00 0A 00 .... HERE is( my_hexdump($file1), $expected1, "write_file $file1" ); # reset encoding - should be detected in read_file $fan1->set_encoding('NONE'); is( $fan1->get_encoding(), 'NONE', "default encoding" ); # read test file and remember encoding my $text2 = $fan1->read_file($file1); is( $fan1->get_encoding(), 'UTF-16LE', "detected encoding" ); my $expected2 = 'Hello world \x{263a}\x{d}\x{a}\x{d}\x{a}'; is( my_reasciify($text2), $expected2, "read_file $file1" ); my @lines2 = $fan1->read_file($file1); is( scalar @lines2, 2, "read_file $file1" ); my $join2 = join( '', @lines2 ); is( my_reasciify($join2), $expected2, "read_file $file1" ); # modify text and write second file using the remebered encoding $text2 =~ s/world/WORLD/; $text2 =~ s/\x{263A}/\x{20AC}/; ( my $file2 = $file1 ) =~ s/.txt/-2.txt/; $fan1->write_file( $file2, $text2 ); # writes file with remembe +red encoding my $expected3 = <<HERE; 00000000 FF FE 48 00 65 00 6C 00 - 6C 00 6F 00 20 00 57 00 ..H.e.l.l +.o. .W. 00000010 4F 00 52 00 4C 00 44 00 - 20 00 AC 20 0D 00 0D 00 O.R.L.D. +.. .... 00000020 0A 00 0D 00 0D 00 0A 00 ........ HERE is( my_hexdump($file2), $expected3, "write_file $file1" ); } # test utf-8 encoding { # create object and write test file with specified encoding my $encoding = 'UTF-8'; my $fan1 = File::AnyEncoding->new($encoding); isa_ok( $fan1, 'File::AnyEncoding', '$fan1' ); is( $fan1->get_encoding(), 'UTF-8', "specified encoding" ); my $text1 = "Hello world \x{263A}\n\n"; my $file1 = "$data\\$encoding.txt"; $fan1->write_file( $file1, $text1 ); my $expected1 = <<HERE; 00000000 EF BB BF 48 65 6C 6C 6F - 20 77 6F 72 6C 64 20 E2 ...Hello +world . 00000010 98 BA 0A 0A .... HERE is( my_hexdump($file1), $expected1, "write_file $file1" ); # reset encoding - should be detected in read_file $fan1->set_encoding('NONE'); is( $fan1->get_encoding(), 'NONE', "default encoding" ); # read test file and remember encoding my $text2 = $fan1->read_file($file1); is( $fan1->get_encoding(), 'UTF-8', "detected encoding" ); my $expected2 = 'Hello world \x{263a}\x{a}\x{a}'; is( my_reasciify($text2), $expected2, "read_file $file1" ); my @lines2 = $fan1->read_file($file1); is( scalar @lines2, 2, "read_file $file1" ); my $join2 = join( '', @lines2 ); is( my_reasciify($join2), $expected2, "read_file $file1" ); # modify text and write second file using the remebered encoding $text2 =~ s/world/WORLD/; $text2 =~ s/\x{263A}/\x{20AC}/; ( my $file2 = $file1 ) =~ s/.txt/-2.txt/; $fan1->write_file( $file2, $text2 ); # writes file with remembe +red encoding my $expected3 = <<HERE; 00000000 EF BB BF 48 65 6C 6C 6F - 20 57 4F 52 4C 44 20 E2 ...Hello +WORLD . 00000010 82 AC 0A 0A .... HERE is( my_hexdump($file2), $expected3, "write_file $file1" ); } # test no encoding { # create object and write test file with specified encoding my $encoding = 'NONE'; my $fan1 = File::AnyEncoding->new($encoding); isa_ok( $fan1, 'File::AnyEncoding', '$fan1' ); is( $fan1->get_encoding(), 'NONE', "specified encoding" ); my $text1 = "Hello world \x{263A}\n\n"; my $file1 = "$data\\$encoding.txt"; $fan1->write_file( $file1, $text1 ); my $expected1 = <<HERE; 00000000 48 65 6C 6C 6F 20 77 6F - 72 6C 64 20 E2 98 BA 0A Hello wor +ld .... 00000010 0A . HERE is( my_hexdump($file1), $expected1, "write_file $file1" ); # reset encoding - should be detected in read_file $fan1->set_encoding('UTF-8'); is( $fan1->get_encoding(), 'UTF-8', "default encoding" ); # read test file and remember encoding my $text2 = $fan1->read_file($file1); is( $fan1->get_encoding(), 'NONE', "detected encoding" ); my $expected2 = 'Hello world \x{e2}\x{98}\x{ba}\x{a}\x{a}'; is( my_reasciify($text2), $expected2, "read_file $file1" ); my @lines2 = $fan1->read_file($file1); is( scalar @lines2, 2, "read_file $file1" ); my $join2 = join( '', @lines2 ); is( my_reasciify($join2), $expected2, "read_file $file1" ); # modify text and write second file using the remebered encoding $text2 =~ s/world/WORLD/; $text2 =~ s/\x{263A}/\x{20AC}/; ( my $file2 = $file1 ) =~ s/.txt/-2.txt/; $fan1->write_file( $file2, $text2 ); # writes file with remembe +red encoding my $expected3 = <<HERE; 00000000 48 65 6C 6C 6F 20 57 4F - 52 4C 44 20 E2 98 BA 0A Hello WOR +LD .... 00000010 0A . HERE is( my_hexdump($file2), $expected3, "write_file $file1" ); } exit 0; # returns hexdump of the $file sub my_hexdump { my $file = shift; my $f = new Data::HexDump; unless ( -f $file ) { warn "no such file $file"; return '---'; } $f->file($file); my $str = ''; while ( local $_ = $f->dump ) { $str .= $_; } $str =~ s/.*00000000/00000000/s; return $str; } # returns sprintf of characters in $string, # replacing those not printable as ascii by their hex code point numer +ic value # similar to sub in File::BOM sub my_reasciify { my $string = shift; $string = join "", map { my $ord = ord($_); # ($ord > 127 || ($ord < 32 && $ord != 10)) ( $ord > 127 || $ord < 32 ) ? sprintf '\x{%x}', $ord : $_ } split //, $string; } __END__

Replies are listed 'Best First'.
Re: Module to read - modify - write text files in any unicode encoding
by ikegami (Patriarch) on May 19, 2008 at 22:21 UTC

    You use :raw:encoding($enc):crlf:utf8 for writing.
    You use :encoding($enc) for reading.
    This lack in symmetry in your IO layers accounts for your lack of symmetry in CRLF handling.

    Use :raw:encoding($enc):crlf:utf8 for reading too.

    use Data::Dumper qw( Dumper ); sub hexdump { (my $dump = uc unpack 'H*', $_[0]) =~ s/(..)/$1 /g; return $dump; } sub txtdump { local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 0; return Dumper($_[0]); } { open(my $fh, '>:raw:encoding(UTF-16le):crlf:utf8', 'test') or die; my $data = "foo\nbar\n"; print("Orig: ", txtdump($data), "\n"); print $fh $data; } { open(my $fh, '<:raw', 'test') or die; local $/; my $data = <$fh>; print("File: ", hexdump($data), "\n"); } { open(my $fh, '<:raw:encoding(UTF-16le):crlf:utf8', 'test') or die; local $/; my $data = <$fh>; print("Read: ", txtdump($data), "\n"); }
    Orig: "foo\nbar\n" File: 66 00 6F 00 6F 00 0D 00 0A 00 62 00 61 00 72 00 0D 00 0A 00 Read: "foo\nbar\n"

    Update: Added code.

      ikegami

      Your code just works, also when I apply it to UTF-8.

      Apart from the lack in symmetry in my IO layers, that you pointed out, I found another source of my confusion, which you probably noticed, but you did not comment on :

      my_hexdump() based on Data::Hexdump that I was using in tests is wrong - on Windows.
      Deep inside, Data::Hexdump reads the file without applying '<:raw', like you do. So, when reading the UTF-8 or plain ASCII sequence "\r\n", it converts it to "\n".

      In addition, I was using hdump.pl to dump my test files. It agreed with my_hexdump(), but they were both wrong!.

      Here is a correct file hexdump, based on your code :

      sub hexdump { my $file = shift; open(my $fh, '<:raw', $file) or die; local $/; my $data = <$fh>; (my $dump = uc unpack 'H*', $data) =~ s/(..)/$1 /g; return $dump; }
      Thank you for the insight.

      Rudif

Re: Module to read - modify - write text files in any unicode encoding
by almut (Canon) on May 19, 2008 at 22:09 UTC
    While my File::AnyEncoding::write_file for UTF-16LE test file correctly converts "\n" into 0D 00 0A 00, my File::AnyEncoding::read_file for UTF-16LE does not convert this back to "\n" but to "\r\n", why ?

    I haven't (yet) looked at your code in detail, but my suspicion is that your problem has to do with a bug in the crlf PerlIO layer, which is that it doesn't correctly handle the multibyte encodings UTF-16, UTF-32, UCS-2 and UCS-4. I.e., it does its translation magic only on the byte sequences 0D 0A (on input) or 0A (on output).

    In other words, you'd have to rearrange the PerlIO layer stack in such a way that the crlf layer happens to be applied to an encoding (e.g. UTF-8) where the bytes 0D and 0A appear next to each other in the stream (no 00 in between) — like you already seem to be trying in your write_file() routine.  Also see PerlIO: crlf layer on Windows interfering with UCS-2 unicode.

    Update: Maybe it's worth pointing out that you want the kludge (i.e. the layer stack ":raw:encoding($enc):crlf:utf8") only on Windows, where \r\n <—> \n translation is supposed to happen, and only if the encoding in question is representing \r and \n as more than one byte (or more precisely, not exactly as 0D and 0A, i.e. for example UTF-16, UTF-32, UCS-2, UCS-4 — note that UTF-8 and most other encodings are fine). In all other cases, you should use the normal ":encoding($enc)" approach.

      almut

      I found your writeup is very useful, especially your explanation of the PerlIO layer stack :

      (Note that, when writing, layers are being applied from right-to-left, while when reading, they're being applied from left-to-right. IOW, the left hand side of the layer stack as shown corresponds to the external side (file), and the right hand side is the Perl-internal data representation.)

      I did experiment with the stack, but I was unsure about the order in which items on stack should be written, and the order in which they are applied. Now you clarified it.

      Rudif

Re: Module to read - modify - write text files in any unicode encoding
by moritz (Cardinal) on May 19, 2008 at 21:32 UTC
    Encode is "core" since perl 5.7.3 and handles pretty much any encoding you can think of. You should use it.

    To write a file with a specific encoding, it's enough to do things like this:

    open (my $file_handle, '>:encoding(UTF-8)', $filename) or die $!; print $file_handle $text_string; close $file_handle;

    (Update: I didn't see you already used that, sorry for the noise.)

    Guessing character encodings can be done with Encode::Guess, but it can never be done reliably.

    I know of no module that combines encoding guessing with file slurping, so it might be worth the effort. But don't roll any encoding handling code by hand, it's all been done before and properly tested.

    See perluniintro, perlunitut and perlunicode for details, I also wrote a short article on the subject.

    (Update: Fixed article link, thanks for reporting Rudif.

      moritz,
      thank you for the perldocs links and for your article Character Encodings in Perl
      (which I found even though the suffix .de was missing from the href in your post).

      I also read the excellent Unicode backgrounder on Joel on Software site.

      But don't roll any encoding handling code by hand, it's all been done before and properly tested.

      Very right, except that some of it might have been properly tested on a Mac or on Linux, and not at all on Windows. Caveat emptor.

      Rudif

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://687468]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2024-04-19 11:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found