Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Perl and Postgresql: Invalid byte sequence for encoding "UTF8"

by StoneTable (Beadle)
on Dec 21, 2006 at 20:19 UTC ( [id://591180]=perlquestion: print w/replies, xml ) Need Help??

StoneTable has asked for the wisdom of the Perl Monks concerning the following question:

Here's one that's been a persistent thorn in my side.

I'm trying to import some of my apache logs for statistical analysis. I get various bits of UTF-8 in there, either in the URI or the user-agent field. I've tried various hacks, such as running the text through tr, but I've never managed to solve this cleanly.

My locale is defined as UTF-8.
The database encoding is UTF8.
I have use UTF8;.

The error points to the text containing invalid UTF8 characters. What I'd like to do is truncate any invalid UTF8 characters, but I'm not sure how to go about that.

The error:
DBD::Pg::st execute failed: ERROR: invalid byte sequence for encoding "UTF8": 0xb0

I could ignore the errors, but I'm trying to batch insert the data. If it hits one of the invalid byte sequences, the entire transaction is rolled back. Not good, obviously.

What ways have you found to best handle/strip invalid UTF8 characters?

  • Comment on Perl and Postgresql: Invalid byte sequence for encoding "UTF8"

Replies are listed 'Best First'.
Re: Perl and Postgresql: Invalid byte sequence for encoding "UTF8"
by ides (Deacon) on Dec 22, 2006 at 14:47 UTC

    I ran into this recently myself with a project to insert E-mail messages into a PostgreSQL database. I solved it like so:

    use utf8; use Encode; my $possibly_bad_utf8_data = get_data(); my $good_data = encode( "UTF-8", $possibly_bad_utf8_data );

    The capitalization of the UTF-8 in the call to encode() is important, it tells encode to be "strict" about the UTF-8. It turns out use utf8; isn't 100% strict and hence can't be inserted into a strict mode PostgreSQL database.

    Hope this helps.

    Frank Wiles <frank@revsys.com>
    www.revsys.com

      Remarkable, thanks!

      I had tried using Encode before, but missed the "UTF-8" bit apparently. It's working perfectly now.

        It is NOT the capitalization that is needed. Encode is case-insensitive for the encoding. It is the hyphen that makes the difference, see this example:
        use Encode qw(resolve_alias); my @aliases = ('utf-8', 'UTF-8', 'utf8',); for my $alias ( @aliases ) { my $canonical_name = Encode::resolve_alias($alias); print "$alias \t has canonical name $canonical_name\n"; }
Re: Perl and Postgresql: Invalid byte sequence for encoding "UTF8"
by ferreira (Chaplain) on Dec 22, 2006 at 13:10 UTC

    This issue was recently discussed in London-pm mailing list. See the thread here (give special attention to Matt Lawrence's last posts):

    http://london.pm.org/pipermail/london.pm/Week-of-Mon-20061127/thread.html#5758

    It had to do with some weirdness of the bytea data type and the fine/fast solution was to use:

    $sth->bind_param($param_num, $bind_value, { pg_type => DBD::Pg::PG_BYTEA });
    where $sth is a DBI statement over a DBD::Pg connection, $param_num is the position of your placeholder in the SQL statement, $bind_value is the content that is making DBD::Pg unhappy. There is more info at the mentioned thread: escaping the data, handling BLOBs in Postgresql, dealing with Class::DBI, etc.
Re: Perl and Postgresql: Invalid byte sequence for encoding "UTF8"
by diotalevi (Canon) on Dec 21, 2006 at 23:09 UTC

    Encode has functions for error detection and recovery.

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Perl and Postgresql: Invalid byte sequence for encoding "UTF8"
by Fletch (Bishop) on Dec 21, 2006 at 21:31 UTC

    A complete kludge, but perhaps munge the possibly-invalid data through MIME::Base64's encode_base64 and decode_base64 routines into/out of the database? Presuming your columns are large enough to handle the inflation that might be a work around until you figure out the right way to do it.

Re: Perl and Postgresql: Invalid byte sequence for encoding "UTF8"
by Khen1950fx (Canon) on Dec 21, 2006 at 21:58 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://591180]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2025-04-18 03:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.