Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: hash tied to GDBM_FILE causes Wide character in null operation

by ikegami (Patriarch)
on Nov 30, 2011 at 23:44 UTC ( [id://940972]=note: print w/replies, xml ) Need Help??


in reply to hash tied to GDBM_FILE causes Wide character in null operation

There's three approaches the authors could have taken.

  • The module could accept strings of bytes. If the string has three byte characters, it will be stored over three bytes. If the string has non-byte characters, garbage will be produced. ("Garbage" happens to be something similar to the UTF-8 encoding of the string due to internal Perl details that have nothing to do with GDBM.) Any GDBM files can be read.

  • The module could accept strings of text, and store it as UTF-8. If the string has three bytes, it will be stored over three to six bytes. If the string has non-byte characters, they will be stored and extracted properly. Only GDBM files whose text fields contains UTF-8 can be read.

  • The module could accept strings of text, but use of two storage formats depending on the contents of the string. Strings would be stored as efficiently as the first option when possible (except for one extra byte per string), and arbitrary text could be stored. Only GDBM files whose text fields contain strings in this format can be read.

The implementers went with the first. It's the only one that allows the module to read any GDBM file, and that's extremely important. That leaves it up to the user to serialise strings of text into strings of bytes by encoding them.

I suppose its constructor could accept an argument specifying an encoding, allowing the user to choose whether he wants the first or second behaviour. I guess the authors didn't consider that, but that's excusable because the module predates Perl's support for strings of non-ASCII text.


Your error is that your string contains the characters

74 6F 64 61 79 2019 73

so one of the characters isn't a byte, yet the module expects bytes.


Note that UTF-8 isn't always produced. Only when garbage (something that isn't a string of bytes) is given.

Literal: "\xC9\x72\x69\x63" String: C9 72 69 63 Stored: C9 72 69 63 Literal: "\N{LATIN CAPITAL LETTER E WITH ACUTE}ric" String: C9 72 69 63 Stored: C9 72 69 63 Literal: "\N{LATIN CAPITAL LETTER E WITH ACUTE}ric\N{RIGHT SINGLE QUOT +ATION MARK}s" String: C9 72 69 63 2019 73 Stored: C3 89 72 69 63 E2 80 99 73 (with warning)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://940972]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-04-24 22:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found