http://www.perlmonks.org?node_id=940968

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi there

My script uses GDBM_FILE and ties a hash to a file on disk as follows

tie(%term_doc_count, "GDBM_File", "term_docCount", GDBM_WRCREAT, 0666);

Then it tries to add to this hash as follows:

$term_doc_count{$stem_term}++;

This seems to be causing a "Wide character in null operation" error when $stem_term equals "today’s" - but the apostrophe in "today's" seems to be a special character

Referring to this thread: http://www.perlmonks.org/?node_id=565560, I gathered that this might be because GDBM cannot handle UTF-8 keys, and using the suggestion from user 'graff' I added the following lines to my code which seem to solve the problem:

my $usable_term = encode( 'utf8', $stem_term ); $term_doc_count{$usable_term}++;

However, I cannot understand why this works. I mean, if GDBM_FILE can't handle utf-8 keys, then why is it that encoding the key as utf-8 and then handing that key to GDBM solves the problem? I'd really prefer to understand the solution to my problem rather than to just use something which works without understanding it. Can anyone help?

Replies are listed 'Best First'.
Re: hash tied to GDBM_FILE causes Wide character in null operation
by ikegami (Patriarch) on Nov 30, 2011 at 23:44 UTC

    There's three approaches the authors could have taken.

    • The module could accept strings of bytes. If the string has three byte characters, it will be stored over three bytes. If the string has non-byte characters, garbage will be produced. ("Garbage" happens to be something similar to the UTF-8 encoding of the string due to internal Perl details that have nothing to do with GDBM.) Any GDBM files can be read.

    • The module could accept strings of text, and store it as UTF-8. If the string has three bytes, it will be stored over three to six bytes. If the string has non-byte characters, they will be stored and extracted properly. Only GDBM files whose text fields contains UTF-8 can be read.

    • The module could accept strings of text, but use of two storage formats depending on the contents of the string. Strings would be stored as efficiently as the first option when possible (except for one extra byte per string), and arbitrary text could be stored. Only GDBM files whose text fields contain strings in this format can be read.

    The implementers went with the first. It's the only one that allows the module to read any GDBM file, and that's extremely important. That leaves it up to the user to serialise strings of text into strings of bytes by encoding them.

    I suppose its constructor could accept an argument specifying an encoding, allowing the user to choose whether he wants the first or second behaviour. I guess the authors didn't consider that, but that's excusable because the module predates Perl's support for strings of non-ASCII text.


    Your error is that your string contains the characters

    74 6F 64 61 79 2019 73

    so one of the characters isn't a byte, yet the module expects bytes.


    Note that UTF-8 isn't always produced. Only when garbage (something that isn't a string of bytes) is given.

    Literal: "\xC9\x72\x69\x63" String: C9 72 69 63 Stored: C9 72 69 63 Literal: "\N{LATIN CAPITAL LETTER E WITH ACUTE}ric" String: C9 72 69 63 Stored: C9 72 69 63 Literal: "\N{LATIN CAPITAL LETTER E WITH ACUTE}ric\N{RIGHT SINGLE QUOT +ATION MARK}s" String: C9 72 69 63 2019 73 Stored: C3 89 72 69 63 E2 80 99 73 (with warning)