Note: the following is not about a bug in Perl (but instead actually a misused feature), but a bug that might be in your Perl program. Here's a detailed discussion of what was discovered and shared with the audience.

At the Technical Dutch Open Source Event, T-DOSE, during the workshop "Exploiting Open Source" led by security consultant Tim Hemel, a flaw was discussed that exists in several Perl programs.

Technical background

Perl has Unicode strings, that internally are encoded as either ISO-8859-1 or UTF8. A flag, called "SvUTF8", a.k.a. "the UTF8 flag", is set to 1 for strings that are UTF-8 internally, and to 0 for strings that are ISO-8859-1 (or raw binary) internally. On the Perl side of things, regardless of the internal encoding, you have a string that consists of characters (not bytes).

Once the UTF8 flag is set, Perl does not check the validity of the UTF8 sequences further. Typically, this is okay, because it was Perl that set the flag in the first place. However, some people set the UTF8 flag manually. They circumvent protection built into encoding/decoding functions and PerlIO layers, either because it's easier (less typing), for performance reasons, or even because they don't know they're doing something wrong.

The :utf8 PerlIO layer sets the UTF8 flag, without checking the byte sequences, on incoming data. This is not a bug or a flaw, but the very function of this PerlIO layer. It is used internally by other layers (most importantly the :encoding layer), after they have (safely) converted the input to UTF8. A function that sets the UTF8 flag, _utf8_on is available from the Encoding module.

Several XS modules set the UTF8 flag on incoming data from a file or a socket (think of databases and network protocols), sometimes without checking the validity of the UTF8 sequences.

Perl's functions use Unicode semantics by default (except for some bug, but see Unicode::Semantics for a workaround), which means that \w matches any alphanumeric character or underscore. This does match quite a huge number of Unicode characters. Similar semantics are in effect for \d and \s, but many people assume that \w is short for [A-Za-z0-9_], that \d is short for [0-9], and that \s is short for [ \f\t\r\n]. This is not true. Since 5.8, released more than 5 years ago, they match with Unicode semantics.

Proof of concept exploit

The (contrived) proof of concept exploit:

test.bin is a file containing the following 7 bytes:

66 6f 6f c9 3b 69 64 f o o ***** i d
***** represent an invalid UTF8 byte sequence, with a starting byte indicating a character length of 2 bytes, and a byte that in ASCII is a semicolon (!). is the following simple Perl program:

#!/usr/bin/perl -T use strict; %ENV = ( PATH => '/usr/bin' ); open my $filehandle, "< :utf8", "test.bin" or die $!; my $word = readline $filehandle; my ($untainted) = $word =~ /^(\w+)$/; if ($untainted) { # It passed the regex, so it is "safe". system "echo $untainted"; }

When this program is executed, the C9 3B together will be interpreted as the Unicode character U+027B (which when UTF8 encoded properly would have been C9 BB), but the shell sees a semicolon and executes not only echo, but also id.

For some reason, with warnings enabled, this program throws a fatal exception (not a warning) "Malformed UTF-8 character (unexpected non-continuation byte 0x3b, immediately after start byte 0xc9)". Because this is probably a side effect of something, and because warnings are often disabled dynamically (at a distance), this does not provide sufficient protection.

The solution is very simple: do not use :utf8, but use :encoding(UTF8) (or for strict Unicode compliant UTF-8, use :encoding(UTF-8) (same, but with a hyphen)), as should have been done in the first place.

More subtle vulnerabilities exist when a module like a database library assumes that data (e.g. from the database) is valid UTF8, but it isn't (for example, because the database engine allows inserting arbitrary binary data into the field). This was not tested at T-DOSE, but a quick look at the source code makes me think that while DBD::SQLite may be vulnerable (uses SvUTF8_on without checking), DBD::mysql (uses sv_utf8_decode) and DBD::Pg (uses is_utf8_string) are probably not.

The security vulnerability is the result of naive use of Perl's API, possibly inspired by misleading documentation. It is not a bug in perl itself.

There may be other vectors of attack for abusing malformed UTF8 sequences.


Please do not set the UTF8 flag unless you are fully convinced that your data is actually valid UTF8, and remember that :utf8 sets the UTF8 flag without checking.

Instead of the :utf8 PerlIO layer, use :encoding(UTF8) or :encoding(UTF-8).

Instead of _utf8_on, use utf8::decode or Encode::decode_utf8 or Encode::decode("UTF8", ...), or Encode::decode("UTF-8", ...),

Instead of SvUTF8_on, use sv_utf8_decode, or check validity first, with is_utf8_string.

Instead of writing \w, \d, or \s, write a literal character class if you do not want non-ASCII parts to match, or filter/forbid non-ASCII characters (those with a codepoint (numeric value) greater than 127) beforehand.

Perl documentation flaws

Several official Perl documents use :utf8 in code examples. This has already been changed in the current development version earlier this year, and will be updated in the next release. My own document perlcheat is wrong about equivalencies for \w, \d and \s, and I will try to have this repaired soon.

Update: license added (requested): This report (© 2007 Juerd Waalboer <>) may be copied with attribution, under the CC:by license.
Update: system() with a unicode string is a violation of text/binary separation, but encoding $untainted to UTF8 or UTF-8 explicitly (as should have been done) does not solve the security problem because these are optimized and use the internal value when Perl believes it is valid UTF(-)8.