Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Text File Encoding under Windows

by pat_mc (Pilgrim)
on Mar 17, 2010 at 16:59 UTC ( #829222=perlquestion: print w/ replies, xml ) Need Help??
pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks,

I am having a problem with parsing a Windows text file with regular expressions. Somehow, the file won't match regexes that clearly should be matched by the contents of the file. I assume the problem is due to file encoding under Windows but simply can't get this to work OK.

The file contains hundreds of lines, some of whic in the format Text.1 // Text.2. I have been using the following code:
#! /usr/bin/perl -w use strict; use locale; use utf8; while ( <> ) { if ( /\/\/ / ) { # Apply long list of regexe-based substitutions print; } }
When I print the file in the console, all characters appear separated by a strange extra whitespace. I believe that as a result of this, the regexes don't match.

Since I could not get it to work under Windows, I tried to convert the Windows file to Unix format under Linux using the shell utility dos2unix. Also, I tried to convert character encodings using recode latin1..utf8. None of this worked.

Can you please advise how I can ensure that the Windows text file is read in and processed correctly?

Your help is much appreciated. Thanks in advance!

Cheers -

Pat

Comment on Text File Encoding under Windows
Select or Download Code
Re: Text File Encoding under Windows
by almut (Canon) on Mar 17, 2010 at 17:16 UTC
    When I print the file in the console, all characters appear separated by a strange extra whitespace.

    The file is most likely encoded as UTF-16 (or UCS-2, which for most practical purposes doesn't make much of a difference).  Try to open it with

    open my $fh, "<:encoding(UTF-16LE)", ... while (<$fh>) {

    ( :encoding(UTF-16) should work, too, if the file has a BOM (byte order mark), which it typically has. In this case, the BOM itself (\x{feff}) also won't be part of the data read via <$fh>. )

      Thanks, almut -

      This solved part of the problem. The file is now getting read in OK and the expected regex matches occur. However, the output is still causing problem because the non-ASCII characters in it do not get represented correctly. I have tried the following two approaches:

      1) Printing out to the DOS console and redirecting the output from there into a file. The result looks fine - if it were not for the special characters that get represented as EF,FC etc.

      2) Printing to a UTF-16 encoded file with the following code:
      #! /usr/bin/perl -w use strict; use locale; open INPUT, "<:encoding(UTF-16LE)", $ARGV[0]; open OUTPUT, ">:encoding(UTF-16LE)", "./Output_UTF-16"; while ( <INPUT> ) { # long list of regex-based replacements print OUTPUT $_; }
      The result was an output file which represented all special characters correctly but contained a line of empty boxes in every second line.

      Can you please advise what I need to do to fix both output variants?

      Thanks again for your help!
      Pat

        What is the desired (or required) output encoding, i.e. which program are you using to view or further process the output? (can it handle UTF-16?)

        What special characters are involved; may they also be represented in a non-unicode legacy encoding such as ISO-8859-1 (ISO-Latin1) or Windows CP1252?

        Maybe just try other output encodings

        open OUTPUT, ">:encoding(UTF-8)", ... open OUTPUT, ">:encoding(CP1252)", ... open OUTPUT, ">:encoding(ISO-8859-1)"... ...

        The latter two should be used in combination with :encoding(UTF-16) on the input side, because that would swallow the BOM, which you don't want in non-unicode output  (in case of UTF-8 the BOM is optional, so you can decide for yourself).

        P.S.: can you view the original input file correctly with the same program that's showing the empty boxes with the output file?  (btw, do you really mean "line" in "a line of empty boxes in every second line", or rather character "column"? — the former would be kinda strange...)

Re: Text File Encoding under Windows
by se@n (Initiate) on Mar 21, 2010 at 00:30 UTC

    I need to see specific examples to provide a complete answer. There are some things I would suggest. Don't trust text editors or the display when you view and print characters. If something strange is going on, then print out ordinal values ord($char). This will give you numeric values that you can trust. And it will show you any character that's not visible

    A character in the 32-126 range is normal. If it's less than 32, and it's not \n, then change it to ' '. $text =~ s/\s+/ /g; If it's above 126, then it's an 8-bit quantity that will mess up the regex's, and probably Windows. What you do with these values depends on the assigment. This will delete them:

    my $low = chr(127); my $high = chr(255); $text =~ s/[$low-$high]//g;

    Some of the 8-bit values represent standard punctuation, and you can change them into 7-bit quantities. If there are two three or four consecutive 8-bit characters, then you have to deal with 16-bit, 24-bit, 32-bit UTFs. There's a definition on Wikipadia. There might a package on CPAN. There's also a huge translation table online. http://www.utf8-chartable.de/unicode-utf8-table.pl

    Hope this helps. Sean

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://829222]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2014-12-22 02:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (110 votes), past polls