Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Perl-Sensitive Sunglasses
 
PerlMonks  

UTF8 and Postgresql

by cm029 (Novice)
on Apr 30, 2008 at 20:26 UTC ( #683785=perlquestion: print w/ replies, xml ) Need Help??
cm029 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, all: I'm trying to import a corpus of spam mail into a Postgresql database using Email::Folder to read in an MBox mail store. I'm parsing it and getting the info I want but I'm receiving an error trying to put the data into the database: I'm using a prepared statement to do the INSERT INTO and it works for most of the records. Some are giving this error:
DBD::Pg::st execute failed: ERROR: invalid byte sequence for encoding + "UTF8": 0x95 HINT: This error can also happen if the byte sequence does not match +the encoding expected by the server, which is controlled by "client_e +ncoding".
How do I convert or otherwise clean the data before putting it into the database? If I lose characters, I am OK with that. This is only for my own personal use so a bit of lost data on foreign language emails that I cannot read anyway is fine. Any help appreciated. I can post the (admittedly ugly and unpolished) code if necessary. Thanks!

Comment on UTF8 and Postgresql
Download Code
Re: UTF8 and Postgresql
by shmem (Canon) on Apr 30, 2008 at 21:01 UTC
    Not knowing the column spec of the table you are trying to import the data into, nor Postgres very well, nor even the purpose to which end you are storing spam, this is a complete shoot into the dark - but you might want to import that spam data full of nasty unicode into a blob (binary large object) field.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: UTF8 and Postgresql
by mscharrer (Hermit) on Apr 30, 2008 at 21:04 UTC
    Some emails seem to be in a non-ASCII, non-UTF8 encoding, e.g. ISO-8859-1. You could try to detect this and convert them to UTF8 before you pass them to the DB. Search for 'Unicode' at CPAN and look which of the modules there are best for you.

    If you just like to remove all non-ASCII characters, you could use tr/\200-\377//d.

Re: UTF8 and Postgresql
by pc88mxer (Vicar) on Apr 30, 2008 at 21:09 UTC
    I have a feeling that something is broken with the DBD::Pg driver. If it was Unicode clean, you shouldn't have this problem. Clearly your database is set up to store text as UTF8, and the perl side certainly knows about Unicode, so why the disconnect ???

    One solution would be to simply remove all non-ASCII characters before insertion:

    $text =~ s/[^\0-\x7f]//g;

    Another potential option (if you have administrator privs for the db) is to change the client_encoding to something like LATIN1.

    Yet another possibility is to convert your text to UTF8:

    use Encode; ... my $bytes = encode('utf8', $text); # insert $bytes instead of $text
    I think this last solution has a good chance of working if my theory of what's happening is correct. You'll have to check what you get out of the database when you read the data back - hopefully it will be the same as $text. If it isn't, try decoding it using decode('utf8', ...).
      Thanks, everyone! The encode() worked perfectly.
Re: UTF8 and Postgresql
by almut (Canon) on Apr 30, 2008 at 22:14 UTC

    Grepping through the source of Email::Folder doesn't find a single occurrence of "utf"... so I suspect the module is not UTF-8 aware. In this case your mail data is at the moment Perl-internally probably not handled as unicode, but as a mere octet/byte-string. Or are you decoding it yourself manually, already? If not, you could try to pass the data through Encode::decode to sanitize it and convert it to unicode text, which the DBD::Pg driver will hopefully then handle correctly:

    use Encode; ... my $utf8 = Encode::decode("UTF-8", $mail);

    The optional third 'CHECK' argument to decode() defaults to Encode::FB_DEFAULT, which should be what you want, i.e. it replaces invalid UTF-8 sequences with the (valid) codepoint FFFD (REPLACEMENT CHARACTER) — also see Handling Malformed Data, so Postgresql should no longer have any reason to complain...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://683785]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2014-04-17 11:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (446 votes), past polls