Read text file - Encoding problem?

better has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

my script, written at the bottom of this post, works, if I read a text file, which I have created before with this script:

my $out = './data/IDs_created.txt';
open (OUT, ">>$out");
print "Enter ID: \n";
    while (<>) {
    print OUT "$_";
}
close OUT;
[download]

But the script doesn't work, if I use a text file instead, which I parsed from an excel csv file with this script:

use warnings;
use Text::CSV;
use Encode;

$file = './data/IDS.csv'; #Default
  
#Parsing CSV
my $csv = Text::CSV ->new ({binary =>1, eol => $/});
 
open (CSV, '<:encoding(utf8)', $file) or die "Cannot open $file: $!\n"
+;
open (OUT, '>:encoding(utf8)', './data/IDs_created.txt') or die "Kann 
+Datei nicht öffnen: $!\n";
 
 while (my $line = <CSV>) {
    chomp $line;
        if ($csv->parse($line)) {
        my @fields = $csv->fields ();
        chomp (@fields);
        print OUT "@fields\n";
        }
        else {
        warn "Line could not be parsed: $line\n";
        }
 }
print "CSV parsed and saved as text file: IDs_created.txt!";
close CSV;
close OUT;
[download]

And here is the main script: it searches and copies files from one directory into another:

#!/usr/bin/perl
#
#Script allows to search and copy files 
#Thanks to Anonymous Monk
#Works only with PERL created text file!!!
#
#tested: --ok!

use strict; 
use warnings; 
use autodie; 
use File::Find::Rule; 
use File::Slurp; 
use File::Basename;
use File::Copy;

my $startdir = shift or die Usage(); 
my $dirTarget = '/cygdrive/d/tmp/';
my $fnames = join '|', map quotemeta, read_file('./data/IDs_created.tx
+t' , qw/ chomp 1 /); 
my @fnames = find( file => name => qr{$fnames}, in => $startdir ); 
copy ($_, $dirTarget.basename ("$_")) for (@fnames);
[download]

Anonymous monk pointed out, that it might be a problem with encoding. But after hours of reading and unsuccessfully trying, I would appreciate any help on this question.

better

Comment on Read text file - Encoding problem? Select or Download Code

Replies are listed 'Best First'.
Re: Read text file - Encoding problem? by Kenosis (Priest) on Mar 17, 2013 at 02:05 UTC
Please note Text::CSV's documentation: `open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!"; while ( my $row = $csv->getline( $fh ) ) { $row->[2] =~ m/pattern/ or next; # 3rd field should match push @rows, $row; }` [download] Let the `$csv` object read from the file handle (and parse the line), instead of: `... while (my $line = <CSV>) { chomp $line; if ($csv->parse($line)) { ...` [download] In the documentation, `$row` is an array reference, pointing to the array that contains the line's parsed fields. You can just use `@$row` to get all those fields. It's not clear to me what `chomp (@fields);` is doing in your script. It looks like you're expecting `@fields` to contain the parsed fields from the currently-read line. But there wouldn't be any newlines in `@fields`, so `chomp`ing it is unnecessary. Also, as a side note, consider using lexical variables (`my`) for file handles, instead of barewords. For example: `open my $CSV, '<:encoding(utf8)', $file or die "Cannot open $file: $!\ +n"` [download]	[reply] [d/l] [select]
Re^2: Read text file - Encoding problem? by better (Acolyte) on Mar 17, 2013 at 11:32 UTC
Hi Kenosis, thanks again for your support. Of course, you are right reminding me to use lexical variables for file handles. I changed that and added 'use strict;', this time without getting errors like "requires explicit package name". The chomp command was intended to cut a newline, if there should be a field including one (it doesn't make a difference, so I deleted it). I can't follow you, implementing the matching command here. If I understand it correctly, it compares the strings given in a row with a certain pattern? What I want to achieve is, to get the string (ID) of the first field of the first coloumn and use it as a regex for matching the fh of read_dir. Than get the second string of the first coloumn etc. (all IDs are listed in the first coloum only). The problem why it is not working seems to be an unvisible thing at the end of each line. Following McAs hint to get the hex number of the text file: the result is that each line ends with the letter "d". This might cause the problem, because the script works with another text file, in which the lines don't end with "d". better	[reply]
Re: Read text file - Encoding problem? by McA (Priest) on Mar 17, 2013 at 02:10 UTC
Hi, just a hint to help yourself. Try to show the filenames in a hex representation. Than you can compare what you have in the file and what you get reading the directory. Make a simple example reducing the problem: `opendir my $dh, '.' or die "ERROR: Couldn't open: $!"; my @entries = readdir($dh); closedir $dh; foreach my $entry (@entries) { print "$entry\n"; print gethex($entry), "\n"; } sub gethex { my $v = shift; return join '', map { sprintf("%x-", ord) } split //, $v; }` [download] And like this code, open your csv file and read the entries in there. Another hint: I can't see a explicit decoding while using read_file from File::Slurp. What do you get there? Are you sure that the csv file is create using UTF-8? McA	[reply] [d/l]
Re^2: Read text file - Encoding problem? by better (Acolyte) on Mar 17, 2013 at 10:53 UTC
Hi McA, Thanks for that script. It seems that it is not a problem of encoding. I checked both text files, which are used to be read into a filehandle. There is a difference regular ocurring: Each line of the "bad" text file which was parsed from the csv and which is not working has a -d- at its end, while the lines of the "good" text file which is working with my script have not: eg: I C 7700 -> 49-20-43-20-37-37-30-30- #good I C 7700 -> 49-20-43-20-37-37-30-30-d #bad and what I get reading the directory: I C 7700.jpg -> 49-20-43-20-37-37-30-30-2e-4a-50-47 What does that mean? What stands "d" for? better	[reply]
Re^3: Read text file - Encoding problem? by better (Acolyte) on Mar 17, 2013 at 12:18 UTC
In finding out, how to remove this "d", which is invisibly attached at the end of each line, I included into your script: `chop $entry` chomp wouldn't do! The gethex function shows that "d" is removed without loosing the last letter (or number). Later I will continue working on the question, how to parse the "bad" text file without the invisible "d" into my main script and use these shortened strings there as regex better	[reply] [d/l]
Re^4: Read text file - Encoding problem? by poj (Abbot) on Mar 17, 2013 at 12:50 UTC
Re^5: Read text file - Encoding problem? by better (Acolyte) on Mar 17, 2013 at 15:14 UTC

Back to Seekers of Perl Wisdom