Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Substitute and converting to UTF8

by tomred (Novice)
on Jan 08, 2021 at 13:11 UTC ( #11126599=perlquestion: print w/replies, xml ) Need Help??

tomred has asked for the wisdom of the Perl Monks concerning the following question:

Dear gurus,

Not for the first time I found myself duelling with character encoding.

I am trying to process CSV files from a variety of sources. The data, once parsed by Text::CSV_XS, is encoded by Cpanel::JSON::XS and stored in a UTF8 database table.

I often get caught by fatal errors, from Cpanel::JSON::XS:

"Error creating json: malformed or illegal unicode character in string"
or
utf8 "\x92" does not map to Unicode.

I'd like to create a hash of the most often offending characters and their substitutions but I can't seem to find a means of capturing the hex value. In the code below, I've cat'ed a sample into the bottom of the file.

What am I missing? Thanks in advance,

Dermot
use v5.22; use warnings; # Causes "Wide Character.." warning for :std #use utf8; my %swaps = ( '91' => '‘', '92' => '’', '94' => '”', '96' => '–', 'A9' => '©', ); while (my $line = <DATA>) { say "Before=".$line; # Works. #my ($key) = $line =~ m/(\x92)/; # All fail #my ($key) = $line =~ /(\\x[[:xdigit:]]{1,3})/; #my ($key) = $line =~ m/(\x\d{2})/; #my ($key) = $line =~ /(0x[0-9A-F]{1,3})/i; #my ($key) = $line =~ m/(0x\d{1,3})/; #my ($key) = $line =~ m/(\0x\d{1,3})/; my ($key) = $line =~ m/0x(?-i:[\da-f]+)/; say "Key=$key"; $line =~ s/$key/$swaps{$key}/g if $key; say "After=".$line; } # This sample has \x92 between "heart" and "s". # In xxd, the line looks like this # 0000080: 6420 7061 6e73 792c 2068 6561 7274 9273 d pansy, heart.s __DATA__ "Wild pansy (Viola tricolor), 19th century illustration","19th-century + hand painted illustration of wild pansy, heart<92>s ease, or love in + idleness (Viola tricolor) flower by Pierre-Joseph Redoute (1759-1840 +). Published in Choix Des Plus Belles Fleurs, Paris (1827).",N/A,"Pan +sy, pansies, wild, Viola tricolor, 19th century, painted, Engraving, +illustration, nobody, no-one, flower, artwork, Pierre Joseph redoute, + bloom, blossom, botanical, botanist, bud, flora, floral, history, hi +storic, horticulture, leaves, petal, petals, plant, vintage, watercol +or, flower head, painting, stem, victorian style, botanic, flowers, p +lants, Botany",,C,Fl,N/A,,,,^M

Replies are listed 'Best First'.
Re: Substitute and converting to UTF8
by choroba (Archbishop) on Jan 08, 2021 at 14:01 UTC
    Rather than fixing the consequences of wrong input, fix the decoding of the input.

    When reading from the CSV, follow the documentation and use

    my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); # ~~~~~~~~~~~ open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!"; # ~~~~~~~~~~~~~~~

    When encoding to JSON, be sure to decode the string, as Cpanel::JSON::XS works with UTF-8 encoded strings.

    #!/usr/bin/perl use strict; use warnings; use feature qw{ say }; use charnames ':full'; use Cpanel::JSON::XS; use Encode qw{ encode }; my $decoded = qq({"yellow horse": "\N{LATIN SMALL LETTER Z WITH CARON} +lu\N{LATIN SMALL LETTER T WITH CARON}ou\N{LATIN SMALL LETTER C WITH C +ARON}k\N{LATIN SMALL LETTER Y WITH ACUTE} k\N{LATIN SMALL LETTER U WI +TH RING ABOVE}\N{LATIN SMALL LETTER N WITH CARON}"}); my $encoded = encode('UTF-8', $decoded); my $structure = decode_json($encoded); binmode *STDOUT, ':encoding(UTF-8)'; say $structure->{'yellow horse'};

    Note that I also set the encoding of the output handle.

    I can't show you how to set the encoding/decoding properly for the database as you haven't told us what driver you use.

    NOTE: In real life, I'd use use utf8; and type "žluťoučký kůň" directly in the script, but PerlMonks can't display code containing non-latin1 characters. Some people even recommend to write your code this way, but I find utf-8 more readable.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      The exiting (production) code uses the open pragma.

      use open qw/:std :encoding(utf8)/;

      From what I tell, that forces all $fh to UTF-8. After that, I found all hex-type substitution fail. As I've been labouring with the intention of substituting, I've been testing with '<:raw' instead.

      But it looks like the encode step is what's needed. I can at least encode_json now.

      use v5.22; use warnings; use Devel::Dwarn; use Cpanel::JSON::XS; use Encode qw/ decode encode /; use Text::CSV_XS; my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 }); while (my $line = <DATA>) { say "Before=".$line; my $string = encode('UTF8', $line); my $x = $csv->parse($string); warn $x if !$x; my @data = $csv->fields; Dwarn \@data; my $structure = encode_json(\@data); } __DATA__ IR_Choix_0092.tif,C,psi,,"Wild pansy (Viola tricolor), 19th century il +lustration","19th-century hand painted illustration of wild pansy, he +art<92>s ease, or love in idleness (Viola tricolor) flower by Pierre- +Joseph Redoute (1759-1840). Published in Choix Des Plus Belles Fleurs +, Paris (1827).",N/A,"Pansy, pansies, wild, Viola tricolor, 19th cent +ury, painted, Engraving, illustration, nobody, no-one, flower, artwor +k, Pierre Joseph redoute, bloom, blossom, botanical, botanist, bud, f +lora, floral, history, historic, horticulture, leaves, petal, petals, + plant, vintage, watercolor, flower head, painting, stem, victorian s +tyle, botanic, flowers, plants, Botany",,C,Fl,N/A,,,,^M

      A big thank you. It's been a very frustrating day.

      Dermot
Re: Substitute and converting to UTF8
by hippo (Chancellor) on Jan 08, 2021 at 13:59 UTC
    I'd like to create a hash of the most often offending characters and their substitutions

    You could do it that way but it's a losing battle. Is there some reason not to decode your CSV data from whatever encoding it really uses (as it is clearly not utf-8) and then process that decoded data? That ought to be the simpler approach.


    🦛

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11126599]
Approved by LanX
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2021-01-22 00:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?