Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: RFC: Is this the correct use of Unicode::Collate?

by Jim (Curate)
on Jun 24, 2012 at 02:13 UTC ( #978026=note: print w/ replies, xml ) Need Help??


in reply to Re^2: RFC: Is this the correct use of Unicode::Collate?
in thread RFC: Is this the correct use of Unicode::Collate?

A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion.

What you need is an invisible letter in Unicode. Just such a letter was proposed several years ago by typographer Michael Everson. His proposed name for the character was INVISIBLE LETTER. Unfortunately, the Unicode Consortium rejected his proposal. See Proposal to add INVISIBLE LETTER to the UCS and Every character has a story #11: U+???? (The Invisible Letter)

If there were such an invisible Unicode character, you could do something like this:

#!perl use strict; use warnings; use open qw( :std :encoding(UTF-8) ); use charnames qw( :full ); use Unicode::Collate; my $DISAMBIGUATOR_CHARACTER = "\N{LATIN SMALL LIGATURE FFL}"; # U+FB04 my %president_number_by; # President number by president name my %seen; while (<DATA>) { chomp; my ($name, $number) = split m/,/, $_, 2; $seen{$name} = exists $seen{$name} ? $seen{$name} . $DISAMBIGUATOR_CHARACTER : $name ; $president_number_by{$seen{$name}} = $number; } my $collator = Unicode::Collate->new(); for my $name ($collator->sort(keys %president_number_by)) { my $number = $president_number_by{$name}; $name =~ s/$DISAMBIGUATOR_CHARACTER+$//; print "$name,$number\n"; } exit 0; __DATA__ Washington,1 Adams,2 Jefferson,3 Madison,4 Monroe,5 Adams,6 Jackson,7 Van Buren,8 Harrison,9 Tyler,10 Polk,11 Taylor,12 Fillmore,13 Pierce,14 Buchanan,15 Lincoln,16 Johnson,17 Simpson,18 Hayes,19 Garfield,20 Arthur,21 Cleveland,22 Harrison,23 Cleveland,24 McKinley,25 Roosevelt,26 Taft,27 Wilson,28 Harding,29 Coolidge,30 Hoover,31 Roosevelt,32 Truman,33 Eisenhower,34 Kennedy,35 Johnson,36 Nixon,37 Ford,38 Carter,39 Reagan,40 Bush,41 Clinton,42 Bush,43 Obama,44 Bush,45

This script produces this output:

Adams,2 Adams,6 Arthur,21 Buchanan,15 Bush,41 Bush,43 Bush,45 Carter,39 Cleveland,22 Cleveland,24 Clinton,42 Coolidge,30 Eisenhower,34 Fillmore,13 Ford,38 Garfield,20 Harding,29 Harrison,9 Harrison,23 Hayes,19 Hoover,31 Jackson,7 Jefferson,3 Johnson,17 Johnson,36 Kennedy,35 Lincoln,16 Madison,4 McKinley,25 Monroe,5 Nixon,37 Obama,44 Pierce,14 Polk,11 Reagan,40 Roosevelt,26 Roosevelt,32 Simpson,18 Taft,27 Taylor,12 Truman,33 Tyler,10 Van Buren,8 Washington,1 Wilson,28

(For the purpose of demonstrating more than two presidents with the same last name, I had to assume Barack Obama is re-elected in 2012 and Jeb Bush is elected in 2016. I'm sorry if this prospect offends you.)

This is a pure Unicode solution to the problem. There's no commingling of Unicode characters or graphemes with binary data. Unfortunately, however, there isn't a Unicode character with the general property L (Letter) that's guaranteed to be invisible. If there were, it would be just the right character to use for this "weirdo" purpose.

Why did I use the Unicode character LATIN SMALL LIGATURE FFL in the demo script? I don't know exactly. Maybe because it's a character that collates high and seems impossibly unlikely ever to occur in real data.

Jim


Comment on Re^3: RFC: Is this the correct use of Unicode::Collate?
Select or Download Code
Re^4: RFC: Is this the correct use of Unicode::Collate?
by flexvault (Parson) on Jun 24, 2012 at 10:35 UTC

    Jim,

    Thank you for you input. You seem to know quite a bit about Unicode.

    What I tried to ask in the original post was why 'use Unicode::Collate;' changed the meaning of characters 0..31? Everything I have read, talked about not changing the meaning of 7bit ASCII.

    History of the question:

    I don't know if you are familiar with the NoSQL database engine BerkeleyDB (now owned by Oracle), but I have written a pure perl replacement that performs as well. In some cases where the data portion of the key/value pair are very large, it outperforms BerkeleyDB.

    Most people on this forum, believe that BerkeleyDB is free. Oracle has added some conditions that make it very expensive( our law firm's counsel ). One example: If a company employee downloads BerkeleyDB and installs it, that's okay. But as a software vendor, if I download it and install it, the company owes Oracle a fee based on number of cores and type of box. For a power7 IBM p-series with 32 cores, the license fee is $ 48,000. for the "free" BerkeleyDB.

    Most of our products sell for under $ 5,000. Hard to ask a company to pay an additional $48K.

    Since the PurePerlDB already exists, I was looking at adding a feature to use Unicode::Collate, but it broke other features of PurePerlDB. Unfortunately, my only solution now was to put the burden on the software developer to handle Unicode and duplicates, which is the same as BerkeleyDB.

    Thanks again for your input...Ed

    "Well done is better than well said." - Benjamin Franklin

      Most people on this forum, believe that BerkeleyDB is free. Oracle has added some conditions that make it very expensive( our law firm's counsel ). One example: If a company employee downloads BerkeleyDB and installs it, that's okay. But as a software vendor, if I download it and install it, the company owes Oracle a fee based on number of cores and type of box. For a power7 IBM p-series with 32 cores, the license fee is $ 48,000. for the "free" BerkeleyDB.

      Just in case anyone was wondering about it, see my take on it in Open Source License for Berkeley DB unchanged

      The situation hasn't changed with the latest Berkeley db-5.3.21 , license is essentialy the same, though there is an addition of ASM for Java (only affects java bits, doesn't affect distribution / pricing )

      But I'm not a businessman or a lawyer or work for oracle


      regarding http://www.flexbasedb.com/, I notice you don't provide html only pdf, minor hassle

      For anyone interested about PurePerlDB/FlexBaseDB, from http://www.flexbasedb.com/FlexBaseDB_Introduction.pdf

      use strict; use warnings; use FlexBaseDB; my $dirname = '/home/FlexBaseDB'; unlink glob("$dirname/*"); my $fbenv = FB_OpenENV ( EnvHome => $dirname ); ## Directory for database(s) if ( ! $fbenv ) { die "FB_OpenENV: Bad ENV\n"; } my $filename = "TestDB"; ## Test file name in Environment! my $fb = FB_OpenDB ( FB_Name => $filename, ## Unique name of database FB_ENV => $fbenv, ## reference from FB_OpenENV ); if ( ! $fb ) { die "FB_OpenDB: Bad FILE\n"; } my $key = "Hello"; my $data = "World, we're here!"; my $ret; for my $count ( 1..5 ) { $ret = FB_Write( $fb,\"$key-$count",\$data ); if ( $ret==FALSE ) { die "Write failed $FB_Error \n"; } } if ( FB_Seek( $fb,\$key, FB_FIRST ) ) { print "\nOutput:\n\n"; while( $ret ) { $key = ""; $data = ""; $ret = FB_ReadNext( $fb,\$key,\$data ); print "$key\t$data\n"; } } print "\n","=" x 54, "\n"; ## Will print statistics for your DB my @results = FB_Stat ( $fb ); for my $no ( 0 .. $#results ) { if ( substr($results[$no],0,1) eq "=" ) { $results[$no] = "=" x 54; } print "$results[$no]\n"; } print "=" x 54, "\n"; $ret = FB_CloseDB( $fb ); $ret = FB_CloseENV( $fbenv ); __END__ Output: Hello-1 World, we're here! Hello-2 World, we're here! Hello-3 World, we're here! Hello-4 World, we're here! Hello-5 World, we're here!

        Dear Monk,

        I am not a lawyer, however if you do a web search on

        "The Sneaky Sleepycat License"
        you will find many legal opinions. Whether you are right or they are, I'm not the one to ask!

        Most of our clients have IBM *ix systems, and we have to be concerned about the legal use or mis-use of our or other's software. This isn't just about my opinion!

        YMMV!

        "Well done is better than well said." - Benjamin Franklin

      I don't know if you are familiar with the NoSQL database engine BerkeleyDB (now owned by Oracle), but I have written a pure perl replacement that performs as well. In some cases where the data portion of the key/value pair are very large, it outperforms BerkeleyDB.

      I'm familiar with NoSQL and key-value stores such as Berkeley DB. But what I'd never heard of before reading your PerlMonks post is the idiom—the trick—of modifying data to disambiguate otherwise identical keys by appending control codes or invisible characters to them. This idiom seems "weirdo" to me, just as it did to Tom, who first invoked the word to describe it.

      Is my example Perl script a fair representation of the idiom your NoSQL database software uses to disambiguate like keys?

      I'm not a database theory guru or a database programming wizard, but my gut sense is that the idiom you describe of ornamenting data with invisible control codes or other characters is fraught with problems. I understand how data modified this way would ensure uniqueness and preserve insertion order. But how then do you match such modified strings? Isn't there a better way to achieve the same objectives without altering data? Do other NoSQL database engines besides yours use this same idiom? If so, which ones?

      Jim

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://978026]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2014-07-12 21:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (241 votes), past polls