Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Text::CSV_XS and encoding

by PeterKaagman (Sexton)
on Sep 16, 2018 at 14:40 UTC ( [id://1222464]=perlquestion: print w/replies, xml ) Need Help??

PeterKaagman has asked for the wisdom of the Perl Monks concerning the following question:

Hi there monks

The last couple of days I've been playing around with Text::CSV_XS to import some csv data. Which goes quite well up to the point where I encounter special characters.

I've made myself a test file:

Naam,Adres,Woonplaats Peter,Liër,ôlsten
and try to parse that with the following code:
#! /usr/bin/perl -w use strict; use Text::CSV_XS qw( csv ); use Data::Dumper; my @rows; open my $FH, "<", "./test2.csv" or die "./test.csv $!"; my $aoh = csv( in => $FH, headers => 'auto' ); close $FH; print Dumper($aoh);
which results in the following output:
$VAR1 = [ { 'Woonplaats' => "\x{f4}lsten", 'Adres' => "Li\x{eb}r", 'Naam' => 'Peter' } ];
I tried to resolv this by putting an encoding on the file open like "<:encoding(UTF-8)" or by adding "encoding => 'UTF-8' to the csv parameters. Neither have the desired effect.

Perhaps a complete new problem, allthough I think it has something to do with it: The stream I need to parse (originally) has some jibbrish at the start. I think that is a BOM. So I tried to put "detect_bom => 1" to the csv parameters as the man pages suggests. This results in an error:

# CSV_XS ERROR: 1000 - INI - Unknown attribute 'detect_bom' @ rec 0 po +s 0 INI - Unknown attribute 'detect_bom' at ./parse.pl line 12. shell returned 25

Could anyone shine a light on this for me?

Peter

Replies are listed 'Best First'.
Re: Text::CSV_XS and encoding
by tangent (Parson) on Sep 16, 2018 at 23:28 UTC
    The Data::Dumper output is correct, but it is showing the escaped characters - for example \x{f4} is the "latin small letter o with circumflex".

    I suggest you try printing to a file like this:

    open my $in, "<:encoding(UTF-8)", "in_file.csv" or die "in_file.csv $!"; my $aoh = csv( in => $in ); close $in; open my $out, ">:encoding(UTF-8)", "out_file.txt" or die "out_file.txt $!"; for my $hash (@$aoh) { while (my($key,$val) = each %$hash) { print $out "$key => $val\n"; } } close $out;

      Curious....

      Had to adapt your example a bit: you do need the "headers => 'auto'" to make the result an array of hashes.. put the output to the console and the file. On the console it is again screwed (did indeed try that before)... but vim reads out.txt just fine (did not try that before).

      Console:

      1.36 Naam => Peter Woonplaats => &#9618;lsten Adres => Li&#9618;r
      File
      Naam => Peter Woonplaats => ôlsten Adres => Liër

      Goal is to put the values of 4500 students and 600 staf members in a database... so I'll go ahead with that. See how it turns out

      For the curious: The ultimate goal is an interface which sits between our school information system and Microsofts School Data Sync. C# and Delphi were suggested as tooling. I went ahead with Perl

        >binmode STDOUT, "encoding(utf-8)"; to the rescue.


        Enjoy, Have FUN! H.Merijn
Re: Text::CSV_XS and encoding
by poj (Abbot) on Sep 16, 2018 at 15:05 UTC

    What version of Text::CSV_XS do you have. Try upgrading to latest version 1.36

    print $Text::CSV_XS::VERSION;

    poj

      Nope that is not it... :(

      pkn@precious:~/scripts/perl/csv$ ./parse.pl 1.36 $VAR1 = [];
      did manage to get the "detect_bom => 1" into it with the new version. But as a result it no longer parses the file (correctly).
      pkn@precious:~/scripts/perl/csv$ ./parse.pl 1.36 $VAR1 = [ { 'Naam' => 'Peter', 'Adres' => "Li\x{eb}r", 'Woonplaats' => "\x{f4}lsten" } ];
      withoud bom detection it parses the test file but the encoding is still screwed.

      Got some more reading to do. One thing I came across was:

      my $aoh = csv( in => $FH, headers => 'auto' );
      I thought having "headers => 'auto' in there would trigger an automagic detection of encoding. According to something I read in a man page this is not the case. Now if I could only remember what man page I was reading Text::CSV or Text::CSV_XS :S. Should not be to hard to find again.

      Version 1.21-1 from the Ubunto repro.
      Will reinstall from CPAN and try again.
      Thanks.... did not think of that.

        The detect_bom attribute was added in 1.22. I must admit the the ChangeLog was not very clear about that, as it was part of the new header works and naming all attributes to that didn't look very useful at the time. The docs clearly state:

        BOM (or Byte Order Mark) handling is available only inside the "header" method.

        The BOM-related changes in versions 1.25, 1.31, 1.33, 1.34, and 1.35 make its use more reliable. Note that BOM-handling is unreliable (or nor working at all) in perl-5.6.x.


        Enjoy, Have FUN! H.Merijn
Re: Text::CSV_XS and encoding
by BillKSmith (Monsignor) on Sep 16, 2018 at 20:21 UTC
    I ran you code "as is" on Strawberry perl 5.24.1 under Windows 7. The output in the command window was different than yours, but wrong. I then redirected the output to a file. That file is displayed correctly by both the editor "gvim" and the browser "Internet Explorer". To remove any ambiguity, I have converted that file to hex with the utility "xxd" which ships with vim.
    C:\Users\Bill\forums\monks>xxd tmp.txt 00000000: 2456 4152 3120 3d20 5b0d 0a20 2020 2020 $VAR1 = [.. 00000010: 2020 2020 207b 0d0a 2020 2020 2020 2020 {.. 00000020: 2020 2020 2741 6472 6573 2720 3d3e 2027 'Adres' => ' 00000030: 4c69 eb72 272c 0d0a 2020 2020 2020 2020 Li.r',.. 00000040: 2020 2020 2757 6f6f 6e70 6c61 6174 7327 'Woonplaats' 00000050: 203d 3e20 27f4 6c73 7465 6e27 2c0d 0a20 => '.lsten',.. 00000060: 2020 2020 2020 2020 2020 2027 4e61 616d 'Naam 00000070: 2720 3d3e 2027 5065 7465 7227 0d0a 2020 ' => 'Peter'.. 00000080: 2020 2020 2020 2020 7d0d 0a20 2020 2020 }.. 00000090: 2020 205d 3b0d 0a ];..
    Bill
Re: Text::CSV_XS and encoding
by bliako (Monsignor) on Sep 16, 2018 at 22:43 UTC

    yes poj Data::Dumper does this for my utf8's too and get's me stressed every time. Anyome can suggest an alternative dumper who bloody prints utf8 in a less scaring way??

Re: Text::CSV_XS and encoding
by PeterKaagman (Sexton) on Sep 17, 2018 at 21:29 UTC

    Just to let you all know

    • Funny escape sequence at the start of my stream is gone
    • Special chars appear nicely in an output file (assuming it will allso work for the database

    This monk has found some rest in this monastary.... thank you all _o_

    Putting it all together:

    my @medewerkers; open(my $FH, '<:encoding(utf8)', \$res->content) || die("could not + open result as file: $!"); my $csv = Text::CSV->new({ sep_char => ';', binary => 1, auto_diag => 1}); $csv->header($FH,{detect_bom => 1}); while (my $row = $csv->getline_hr($FH) ){ push @medewerkers, $row; } close $FH; return \@medewerkers;

      As Text::CSV is back in sync with Text::CSV_XS, you can shorten that a lot :)

      #!/usr/bin/env perl use 5.14.2; use warnings; use Text::CSV "csv"; use Data::Peek; my $content = <<"EOC"; id;naaam;datum in dienst;functie 1;Jip;2001-04-14;Chef lege dozen 2;Janneke;2013-10-01;Miep kraak EOC my @medewerkers; open my $FH, "<:encoding(utf8)", \$content or die "could not open result as file-handle: $!"; my $csv = Text::CSV_XS->new ({ sep_char => ";", binary => 1, auto_diag => 1, }); $csv->header ($FH, { detect_bom => 1 }); while (my $row = $csv->getline_hr ($FH)) { push @medewerkers, $row; } close $FH; DDumper \@medewerkers;
      [ { 'datum in dienst' => '2001-04-14', functie => 'Chef lege dozen', id => '1', naaam => 'Jip' }, { 'datum in dienst' => '2013-10-01', functie => 'Miep kraak', id => '2', naaam => 'Janneke' } ]

      Now shorten that to

      #!pro/bin/perl use 5.14.2; use warnings; use Text::CSV "csv"; use Data::Peek; my $content = <<"EOC"; id;naaam;datum in dienst;functie 1;Jip;2001-04-14;Chef lege dozen 2;Janneke;2013-10-01;Miep kraak EOC DDumper csv (in => \$content, bom => 1);

      Or, if you want to be more explicit

      csv (in => \$content, bom => 1, sep => ";");

      Which would reduce your function to

      sub csv_content { my $res = shift; return csv (in => \$res->content, bom => 1, sep => ";"); } # csv_content

      Add some error handling and you're done :)


      Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1222464]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2024-04-23 23:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found