Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Malformed UTF-8 character Error

by walkingthecow (Friar)
on May 11, 2010 at 21:15 UTC ( #839503=perlquestion: print w/ replies, xml ) Need Help??
walkingthecow has asked for the wisdom of the Perl Monks concerning the following question:

I have a mail log file (qmail) that I am processing line by line and trying to create a map so it's easier to read the mail log. The problem is that qmail has so many different patches and is logging in so many different ways. For the most part everything processes; however, I continue to get the following errors for some lines:

Malformed UTF-8 character (unexpected end of string)
Malformed UTF-8 character (unexpected continuation byte 0xae, with no preceding start byte)

The lines that are giving me trouble have odd characters like the following:
@400000004be972c621c0b57c 12760 > 4^Pd"}^HL\kk^T<89><99>-W!##<81> ^Zo^NoN^U^C^A^@ <83><9d>^L^Nq%^R^Z.9 7G'q<9d>7^M
@400000004be9731d05945a34 13997 < Subject: Don<92>t Pay Retail Prices^M
@400000004be9739614b4c9f4 15230 < Subject: The truth about work-at-home opportunities<85>^M
@400000004be9802004c81154 14584 > 4<81>s^BQz=^H<98>,<96>VN)rPp^Uq/<98><9f><93><97><89> ^E<9b>QMbs?^KR/o$^H^T+

How do I either decode these lines, find out what their encoding is, or skip them all together and stop throwing the warnings?

I have tried using Encode::Guess to no avail, and used the following bit of code to possibly give me an idea, but still getting the Malformed errors:
use warnings; use strict; use Devel::Peek 'Dump'; while ( my $line = <> ) { Dump $line; }


UPDATE: I'd really just be interested in skipping line if it is not UTF-8, or not dealing with these lines at all.

Comment on Malformed UTF-8 character Error
Download Code
Re: Malformed UTF-8 character Error
by grantm (Parson) on May 11, 2010 at 21:29 UTC
    It looks like you might have a mixture of encodings in your log file. You could try piping it through Encoding::FixLatin - the output might still be garbage (like your example above) but at least it should be well formed UTF8 garbage :-)
Re: Malformed UTF-8 character Error
by ikegami (Pope) on May 11, 2010 at 22:48 UTC

    Only the first three fields of this log file are text. The fourth appears to be portions of raw emails. In order to make sense of a given line, you would first need to reassemble the email by identifying the lines that belong to the same message. The leading fields surely provides the necessary information to achieve that. Then you may process the message as any other mail client would, and properly decode text if it is text.

    I doubt that the last field of the first line and the last field of the fourth line contain text. They appear to be part of the payload of non-text emails. It's no surprise that you run into problems if you treat them as text.

    The last field of the second line and the last field of the third line appear to be email headers. You would have to reassemble the message to decisively determine the encoding used by each message, but both appear to be encoded using cp1252. By no means does that mean that every email you logged used cp1252.

    Update: Added first paragraph. Phrasing improvements elsewhere.

Re: Malformed UTF-8 character Error
by ikegami (Pope) on May 11, 2010 at 23:25 UTC

    UPDATE: I'd really just be interested in skipping line if it is not UTF-8, or not dealing with these lines at all.

    Do you realize that you'd be skipping all four lines you posted since none are valid UTF-8?

    It's easy to do:

    use strict; use warnings; use open ':std', ':locale'; use Encode qw( ); my $log = 'log'; open(my $fh, '<:raw:perlio', $log) or die("Can't open log file \"$log\": $!\n"); while (<$fh>) { s/\r?\n\z//; my $data = (split(/ /, $_, 4))[3]; my ($text) = eval { decode("UTF-8", $data, Encode::FB_CROAK) } or next; print($text); }
Re: Malformed UTF-8 character Error
by graff (Chancellor) on May 12, 2010 at 06:36 UTC
    Rather than skipping whole lines because they contain non-character (or mixed encoding) data -- which might be most or all of the lines -- you could just "neutralize" the non-character content, by converting it to something innocuous. Here's an example, adopting the "open" method shown above by ikegami:
    use strict; use warnings; my $log = shift # let's put the log file name in @ARGV or die "Usage: $0 name_of_input_file\n"; open(my $fh, '<:raw:perlio', $log) or die("Can't open input file \"$log\": $!\n"); while (<$fh>) { tr/\x09\x0a\x0d -~/./c; print; }
    That just converts all non-visible, non-ascii bytes to "." -- which allows you to see where the non-printable stuff is, while leaving the printable ascii stuff legible, with no worries about character encoding errors.
Re: Malformed UTF-8 character Error
by Krambambuli (Deacon) on May 13, 2010 at 15:28 UTC
    Sorry if my questioning is a bit off-topic, but have you looked at possible solutions for what you're after in the first place in the 'qmail-world' ?

    A short google search after 'qmail log' brings a lot of links; one of the first ones is Working with qmail's log files which points to a few useful scripts, like mtrack, strack or convert-multilog.

    There might be others too; might be that one of those it as least a good starting point ?

    Just an idea, maybe it helps too.


    Krambambuli
    ---
      It's a good suggestion, but of course I looked before deciding to write my own script. The problem with qmail, and other software by Daniel Bernstein, is that it relies so heavily on patches to do anything. So, mtrack/strack are built to work with core qmail logs, but we have 50+ patches to this qmail server and so the logs are completely different than the logs mtrack expects. I used mtrack/strack as a basis for my script, but had to heavily modify it to work in our environment.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://839503]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2014-07-28 06:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (189 votes), past polls