Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Read text file - Encoding problem?

by better (Acolyte)
on Mar 17, 2013 at 00:33 UTC ( #1023863=perlquestion: print w/replies, xml ) Need Help??
better has asked for the wisdom of the Perl Monks concerning the following question:


my script, written at the bottom of this post, works, if I read a text file, which I have created before with this script:
my $out = './data/IDs_created.txt'; open (OUT, ">>$out"); print "Enter ID: \n"; while (<>) { print OUT "$_"; } close OUT;

But the script doesn't work, if I use a text file instead, which I parsed from an excel csv file with this script:

use warnings; use Text::CSV; use Encode; $file = './data/IDS.csv'; #Default #Parsing CSV my $csv = Text::CSV ->new ({binary =>1, eol => $/}); open (CSV, '<:encoding(utf8)', $file) or die "Cannot open $file: $!\n" +; open (OUT, '>:encoding(utf8)', './data/IDs_created.txt') or die "Kann +Datei nicht öffnen: $!\n"; while (my $line = <CSV>) { chomp $line; if ($csv->parse($line)) { my @fields = $csv->fields (); chomp (@fields); print OUT "@fields\n"; } else { warn "Line could not be parsed: $line\n"; } } print "CSV parsed and saved as text file: IDs_created.txt!"; close CSV; close OUT;

And here is the main script: it searches and copies files from one directory into another:

#!/usr/bin/perl # #Script allows to search and copy files #Thanks to Anonymous Monk #Works only with PERL created text file!!! # #tested: --ok! use strict; use warnings; use autodie; use File::Find::Rule; use File::Slurp; use File::Basename; use File::Copy; my $startdir = shift or die Usage(); my $dirTarget = '/cygdrive/d/tmp/'; my $fnames = join '|', map quotemeta, read_file('./data/IDs_created.tx +t' , qw/ chomp 1 /); my @fnames = find( file => name => qr{$fnames}, in => $startdir ); copy ($_, $dirTarget.basename ("$_")) for (@fnames);

Anonymous monk pointed out, that it might be a problem with encoding. But after hours of reading and unsuccessfully trying, I would appreciate any help on this question.


Replies are listed 'Best First'.
Re: Read text file - Encoding problem?
by Kenosis (Priest) on Mar 17, 2013 at 02:05 UTC

    Please note Text::CSV's documentation:

    open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!"; while ( my $row = $csv->getline( $fh ) ) { $row->[2] =~ m/pattern/ or next; # 3rd field should match push @rows, $row; }

    Let the $csv object read from the file handle (and parse the line), instead of:

    ... while (my $line = <CSV>) { chomp $line; if ($csv->parse($line)) { ...

    In the documentation, $row is an array reference, pointing to the array that contains the line's parsed fields. You can just use @$row to get all those fields.

    It's not clear to me what chomp (@fields); is doing in your script. It looks like you're expecting @fields to contain the parsed fields from the currently-read line. But there wouldn't be any newlines in @fields, so chomping it is unnecessary.

    Also, as a side note, consider using lexical variables (my) for file handles, instead of barewords. For example:

    open my $CSV, '<:encoding(utf8)', $file or die "Cannot open $file: $!\ +n"

      Hi Kenosis,

      thanks again for your support.

      Of course, you are right reminding me to use lexical variables for file handles. I changed that and added 'use strict;', this time without getting errors like "requires explicit package name".

      The chomp command was intended to cut a newline, if there should be a field including one (it doesn't make a difference, so I deleted it).

      I can't follow you, implementing the matching command here. If I understand it correctly, it compares the strings given in a row with a certain pattern? What I want to achieve is, to get the string (ID) of the first field of the first coloumn and use it as a regex for matching the fh of read_dir. Than get the second string of the first coloumn etc. (all IDs are listed in the first coloum only).

      The problem why it is not working seems to be an unvisible thing at the end of each line. Following McAs hint to get the hex number of the text file: the result is that each line ends with the letter "d". This might cause the problem, because the script works with another text file, in which the lines don't end with "d".


Re: Read text file - Encoding problem?
by McA (Priest) on Mar 17, 2013 at 02:10 UTC


    just a hint to help yourself. Try to show the filenames in a hex representation. Than you can compare what you have in the file and what you get reading the directory. Make a simple example reducing the problem:

    opendir my $dh, '.' or die "ERROR: Couldn't open: $!"; my @entries = readdir($dh); closedir $dh; foreach my $entry (@entries) { print "$entry\n"; print gethex($entry), "\n"; } sub gethex { my $v = shift; return join '', map { sprintf("%x-", ord) } split //, $v; }
    And like this code, open your csv file and read the entries in there.

    Another hint: I can't see a explicit decoding while using read_file from File::Slurp. What do you get there? Are you sure that the csv file is create using UTF-8?


      Hi McA,

      Thanks for that script. It seems that it is not a problem of encoding. I checked both text files, which are used to be read into a filehandle. There is a difference regular ocurring: Each line of the "bad" text file which was parsed from the csv and which is not working has a -d- at its end, while the lines of the "good" text file which is working with my script have not:

      eg:  I C 7700 -> 49-20-43-20-37-37-30-30-    #good

         I C 7700 -> 49-20-43-20-37-37-30-30-d     #bad

      and what I get reading the directory:

        I C 7700.jpg -> 49-20-43-20-37-37-30-30-2e-4a-50-47

      What does that mean? What stands "d" for?


        In finding out, how to remove this "d", which is invisibly attached at the end of each line, I included into your script:

        chop $entry

        chomp wouldn't do!

        The gethex function shows that "d" is removed without loosing the last letter (or number).

        Later I will continue working on the question, how to parse the "bad" text file without the invisible "d" into my main script and use these shortened strings there as regex


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1023863]
Approved by ww
and the questions are moot...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2018-05-21 15:41 GMT
Find Nodes?
    Voting Booth?