|Perl: the Markov chain saw|
Text File Encoding under Windowsby pat_mc (Pilgrim)
|on Mar 17, 2010 at 16:59 UTC||Need Help??|
pat_mc has asked for the
wisdom of the Perl Monks concerning the following question:
I am having a problem with parsing a Windows text file with regular expressions. Somehow, the file won't match regexes that clearly should be matched by the contents of the file. I assume the problem is due to file encoding under Windows but simply can't get this to work OK.
The file contains hundreds of lines, some of whic in the format Text.1 // Text.2. I have been using the following code:
When I print the file in the console, all characters appear separated by a strange extra whitespace. I believe that as a result of this, the regexes don't match.
Since I could not get it to work under Windows, I tried to convert the Windows file to Unix format under Linux using the shell utility dos2unix. Also, I tried to convert character encodings using recode latin1..utf8. None of this worked.
Can you please advise how I can ensure that the Windows text file is read in and processed correctly?
Your help is much appreciated. Thanks in advance!