Generate unique initials from a list of names

EdwardG has asked for the wisdom of the Perl Monks concerning the following question:

Recently I had the need to produce a report of a certain resource level over time. The basic report looked like this -

Date Range                  Level   Consumers
-------------------------- -------  --------------------------
13 Mar 2004 to 23 Mar 2004      7   AB, AT, DA, MS, RI, RR, RR
24 Mar 2004 to 31 Mar 2004      6   AB, DA, RC, RI, RR, SJ
01 Apr 2004 to 15 May 2004      3   Alan Po, Rohan Ito, Sam Jones
16 May 2004 to 31 Dec 2004      5   AP, PS, RC, RI, SJ
[download]

What this shows is that for each listed _date_range_, we require _level_ items, and that these items are to be allocated to the _consumer_ as identified by the listed names. Where the list of names is too long to fit neatly on one line, I programmatically shorten the name to just the initials to try keep it from wrapping or truncating. Incidentally, this report has proven very usful as it shows if we will exceed or have spare capacity in the known future.

But this node is not about that report, it is about a frustrating little problem (did you spot it?) that eventually resulted in the code below.

I noticed that I had neglected to cater for two or more people having the same initials. In fact my code to generate intials from names was simply

$init{$_} = join '', ($_ =~ /\b\w/g) for @names;
[download]

People with the same initials make the report ambiguous, although this doesn't matter if they are both included in a single date range. But when a single "RR" appears in a date range, as in the second line of the report, it is ambiguous.

Now this was not a big problem since the report was more about the resource level than about the resource consumers, but naively I thought it was a simple fix.

My first lazy attempt at distinguishing derived initials was to detect when the derived initials were already used, and in that case add more letters from the name (iteratively) until the second (etc) set of derived initials were unique. This resulted in a report that looked like this -

Date Range                  Level   Consumers
-------------------------- -------  --------------------------
13 Mar 2004 to 23 Mar 2004      7   AB, AT, DA, MS, RI, RR, RaRa
24 Mar 2004 to 31 Mar 2004      6   AB, DA, RC, RI, RaRa, SJ
01 Apr 2004 to 15 May 2004      3   Alan Po, Rohan Ito, Sam Jones
16 May 2004 to 31 Dec 2004      5   AP, PS, RC, RI, SJ
[download]

Notice "RaRa" is now used for the second person with initials of "RR". I can now see in the second line of the report, it is actually the second "RR" person who needs the resource, not the first like I could have mistakenly assumed.

This was an improvement, and in fact I left the report in this state since a) it was only for my information, b) it was good enough, and c) I had more important things to do.

At this point I could have abandoned names and maybe used numbers to identify people, but I avoided that option for two main reasons:

I didn't want the maintenance hassle of a new item of data (where would these "id numbers" be stored? How would I maintain them?)
I really wanted the report to tell me names without referring to yet another list. (The report was supposed to be simple, dammit)

Fast forward several months, I'm running this report again - by now it has many more people on the list - and once again I notice the ambiguity. With more time to spend on the problem, I dived in.

To cut the potentially long story short, I've now got the following code that takes a list of names and returns a hashref of name => initials.

Example:

Alphonse Romeo-Smith => ARS
Big Al Bundy => BAB
Dilbert => D
I Palindrome I => IPI
James O'Leary => JOL
Jimmy O'Brien => JOB
John Smith => JS
K9 => K
Medhi Majesh => MedMaj
Mickey Mouse => MicMou
Minnie Mouse => MinMou
Tiny Tim => TT
Tom Demarco => TomDem
Tom Denada => TomDen
[download]

Notice the additional characters used to distinguish Mickey, Minnie, and Medhi

For names that are nearly identical, this doesn't work so well . o O(but what can I do?)

A A Milne => AAMilne   
A A Milner => AAMilner
[download]

With some more work, and this is the real reason I'm posting it, I wonder if it could become a module fit for sharing.

So with that in mind, I hereby invite your thoughts on the sensibility of what I'm trying to do, and the code itself (bugs or inefficiencies), and also perhaps you can think of names I haven't catered for. I'm already aware that this code would not be suitable for a long list of names (like a telephone book), but as it stands I find it useful for up to say several hundred names.

In case anyone is wondering, I got the long list of names from a Random Name Generator.

Here is the code. (I'm going out on a limb here, I expect someone will post an embarassing 80 character golfed version within minutes :)


use strict;
use warnings;

MAIN: {

    my @names = <DATA>;
    chomp @names;
    my $initref = &DistinctInitials( @names );

    # Sorted by Name
    print "$_ => $$initref{$_}\n" for (sort keys %$initref);

    # Sorted by Initials
    for my $init (sort values %$initref) {
        print "$init => $_\n" for (grep {$$initref{$_} eq $init} keys 
+%$initref);
    }
}

exit;

sub DistinctInitials {

    # Derive unique "initials" for each name in a list

    # 1. Get simple initials using the first letter of each word in na
+me
    # 2. For non-unique initials, insert a sufficient quantity of lett
+ers from
    #    the original name to make the initial unique among all initia
+ls

    my @names = @_; # A copy to be modified 
    
    # capitalise words
    s/\b(\w)(\w+)\b/\u$1\L$2\E/g for @names;
    
    # remove duplicates
    my %seen;
    @names = grep { ! $seen{$_}++ } @names;    
    
    # derive initial of each name
    my %init;
    $init{$_} = join '', ($_ =~ /\b\w/g) for @names;

    # identify non-unique initials
    my %nonu;
    for my $nonu_v (grep {$seen{$_}++} values %init) {
        map {push @{$nonu{$nonu_v}},$_ } grep {$init{$_} eq $nonu_v} k
+eys %init;
    }
    
    # remove duplicates
    %seen=();
    @{$nonu{$_}} = grep { ! $seen{$_} ++ } @{$nonu{$_}} for keys %nonu
+; 

    for my $init (keys %nonu) {
        # determine minimum additional characters from each 
        # name that make all these initials unique

        my $c = 2; # start with one extra character
        my $notunique = 1;

        while ($notunique) {
            my %tryuniq;
            
            for my $name (@{$nonu{$init}}) {
                $tryuniq{ join '', ($name =~ /\b\w{1,$c}/g) } = $name;
+  # greedy {1,$c}
            }

            if (scalar keys %tryuniq == scalar @{$nonu{$init}}) {
                
                undef $notunique; # success, all initials are now uniq
+ue
                
                # update the initial hash with our newfound initials
                while (my ($tryinit,$tryname) = each %tryuniq) {
                    $init{$tryname} = $tryinit;
                } 
            } else { 
                # failed to make them all unique
                $c += 1;                
            }
        }
        
    }
    
    \%init;

}    

# Hand-crafted test names follows...
__DATA__
John Smith
Medhi Majesh
Mickey Mouse
Minnie Mouse
A A Milne
A A Milne
A A Milner
Jimmy O'Brien
James O'Leary
Dilbert
Alphonse Romeo-Smith
I Palindrome I
tiny tim
BIG AL BUNDY
Tom DeMarco
Tom DeNada
K9
[download]

And here is further list of names I used for testing. As I mentioned, these are random names, any similarity to persons alive or dead is purely coincidental :)

Adam Harper
Alan Pfaff
Alan Vanwinkle
Alana Stutsman
Albert Osterman
Alison Sands
Allan Goris
Allan Razor
Alma Ratcliffe
Alvaro Cangemi
Amanda Basile
Annabelle Ducker
Anthony Granado
Anthony Mcdevitt
Antonio Starnes
April Kain
Arlene Stephen
Armand Stower
Arthur Ragsdale
Ashlee Hurtt
Ballentine
Beata Zar
Bernice Turnage
Bertha Schuck
Billy Sundberg
Binns
Blanche Page
Bobby Eaton
Bobby Winkelman
Brandon Dominick
Brittanie Noviello
Brittian
Bulah Twombley
Candace Cote
Carey Balck
Carey Grishaber
Carey Twedell
Chad Guarino
Chandra Puffer
Cheryl Langley
Chris Creekmore
Clare Milum
Clayton Elbert
Clinton Dahmen
Clinton Ganley
Clinton Mcnaught
Cody Ridout
Cody Vigo
Dale Hernandes
Dane Hasychak
Daniel Colunga
Darren Tacey
Darren Vanaman
Darryl Buchta
Debbie Schlueter
Doggett
Doris Adkins
Eileen Bard
Elinor Rita
Elizabeth Hendricks
Elnora Hornbuckle
Elwood Steketee
Ericka Hodgin
Erik Duley
Ernest Cowley
Ernest Frey
Ernest Helmick
Esmeralda Kardos
Eve Goodfellow
Ewa Villerreal
Felicia Manson
Fernando Outler
Fidel Jacoway
Fidel Paone
Florence Lillie
Fred Pyles
Genna Tranbarger
George Lamoureux
Gerald Graziano
Gerald Minnich
Gina Nickelson
Gladys Rothman
Glenda Wheat
Gregory Walls
Guinn
Guy Gatton
Hannah Quijano
Harold Houser
Harriet Dreyer
Heinricher
Howard Sirois
Hugh Charland
Hugh Dipalma
Hugh Fuselier
Hugh Minaya
Ida Noe
Inez Marion
James Angell
James Cravens
Jami Gula
Jamie Brickhouse
Jamie Huot
Jason Willson
Jeannette Macaluso
Jerri Giesen
Jerri Houseknecht
Jerri Nickson
Jessie Olah
Ji Greenier
Jonathan Matos
Joyce Brian
Julio Sepeda
Karina Palka
Katherine Holzer
Kathryn Diaz
Katie Lloyd
Keila Brue
Keith Berner
Kelly Henegar
Kenneth Caraballo
Kevin Batchelder
Kevin Poindexter
Kimberly Cecil
Kimberly Hare
Kingsbury
Koteles
Kristina Kasper
Lakisha Trees
Larita Battisti
Laura Mcfadden
Laurie Carranza
Lilia Fouse
Linda Kavanaugh
Linda Power
Lonnie Sherrow
Lori Stjohn
Lorrie Lobdell
Louis Leggett
Louis Watterson
Louisa Galyean
Louisa Swingle
Louise Belanger
Lynda Knudsen
Madelene Robasciotti
Mae Brantley
Maggie Fernandez
Maggie Madore
Mallory Brodbeck
Mallory Sickels
Manuel Trotter
Marcel Avers
Marcel Cuffia
Marcel Merow
Margaret Amey
Maricela Bautch
Marilyn Shumpert
Mark Haynie
Marlin Gryniuk
Martin King
Mathew Mayton
Max Sprau
Merrilee Garelick
Mildred Colburn
Ming Negro
Nathan Shuster
Neil Helbert
Nelson Pinzon
Nelson Rayo
Nicholas Lowell
Nicole Tillman
Norman Thatcher
Olivia Bridgeman
Peggy Loggins
Penelope Maize
Peter Agee
Peter Valadez
Priscilla Katz
Rachel Marchand
Randy Frederickson
Raphael Dumaine
Raymond Nava
Robyn Duvall
Rod Schoneman
Roger Goldman
Ronald Looney
Roy Kushner
Royce
Schrock
Scott Obryan
Shawn Ault
Stanley Cleaver
Sydow
Sylvia Salazar
Sylvia Walther
Tabatha Goodsell
Tari Windish
Tarra Bellantuono
Thomas Coleman
Tia Drakeford
Tina Palomo
Tompkins
Tony Haight
Tyrone Crossett
Tyrone Golub
Victor Mcduffie
Vincent Nowak
Viola Mcnamee
Wayne Maye
Wayne Ulrich
Yolanda Hood
[download]

Comment on Generate unique initials from a list of names Select or Download Code

Replies are listed 'Best First'.
Re: Generate unique initials from a list of names by TomDLux (Vicar) on Mar 12, 2004 at 13:45 UTC
If fitting data onto one line is essential In your situation, I would look into Soundex and similar name encoding algorithms. Normally they are used to match up variant spellings of names, but since they convert a name to a code, yet still have a chancee of recognizing the name from the code. Consider multi-line output Alternatly, you could look into using format and turning the Consumer section into a multi-line component: `Date Range Level Consumers -------------------------- ------- ---------------------------------- 13 Mar 2004 to 23 Mar 2004 7 John Wayne, John Dillinger, George W. Bush, Federico Felini, Ian Sutherland, Katherine Hepburn, + Madonna` [download] No messy encodings, no problems figuring out the identities of the people in the Consumers list, only a minor inconvenience learning to use a format. -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply] [d/l]
Re: Re: Generate unique initials from a list of names by EdwardG (Vicar) on Mar 12, 2004 at 14:08 UTC
The multi-line approach is probably my most realistic fallback, and I must confess that this has (for me) turned into a bit of an academic exercise, just for fun. I did briefly consider a soundex encoding (metaphone actually), but I couldn't find a suitable encoding. Text::Soundex gives results like 'E460' and 'L222', and Text::Metaphone gives results that are in some cases worse than my Initials hack - `D:\tmp>cat metaphone.pl use Text::Metaphone; print $_,' => ', Metaphone($_),"\n" for map {chomp;$_} <DATA>; __DATA__ Stephen Thrasher Steve Trasher D:\tmp>metaphone.pl Stephen Thrasher => STFN0RXR Steve Trasher => STFTRXR D:\tmp>initials.pl Stephen Thrasher => StTh Steve Trasher => StTr D:\tmp>` [download]	[reply] [d/l]
Re: Generate unique initials from a list of names by Limbic~Region (Chancellor) on Mar 12, 2004 at 13:48 UTC
EdwardG, Do you mean something like this: #!/usr/bin/perl use strict; use warnings; my %name = map {chomp; $_ => undef} <DATA>; for my $person ( keys %name ) { my $p_case = $person; $p_case =~ s/(\w+)/\u$1/g; $name{$p_case} = delete $name{$person}; } my $finished = 0; while ( ! $finished ) { my $update = 0; for my $person ( keys %name ) { my @shorter = split " " , $person; $_ =~ s/(\w+)\w/$1/ for @shorter; my $new_name = join " " , @shorter; if ( ! exists $name{$new_name} ) { $update = 1; $name{$new_name} = delete $name{$person}; } } $finished = 1 if ! $update; } [download] Cheers - L~R	[reply] [d/l]
Re: Re: Generate unique initials from a list of names by EdwardG (Vicar) on Mar 12, 2004 at 14:23 UTC
Nice and short :) But I'm uncertain it works as I intend. For example, try these names: `__DATA__ Adam Harper Alan Harper` [download] With your code I get this: `d:\tmp>test.pl $VAR1 = { 'A H' => undef, 'A Ha' => undef };` [download] Both of these are ambiguous.	[reply] [d/l] [select]
Re: Re: Re: Generate unique initials from a list of names by Limbic~Region (Chancellor) on Mar 12, 2004 at 17:43 UTC
EdwardG, You are right - I only had a minimal chance to test it. I think it works to your satisfacation. Sorry it took a while to get back to you - stupid thing called work ;-) #!/usr/bin/perl use strict; use warnings; my %name = map {chomp; $_ => undef} <DATA>; my $num_names = 0; for my $person ( keys %name ) { my @names = split " " , $person; $num_names = @names if @names > $num_names; $_ = ucfirst lc $_ for @names; $name{ $person } = \@names; } my $finished = 0; my $index = 0; while ( ! $finished ) { my $update = 0; for my $person ( keys %name ) { next if ! $name{$person}[$index]; if ( $name{$person}[$index] =~ /(\w+)\w/ ) { my $new_name = $1; if ( ! grep { $new_name eq $_ } map { $name{$_}[$index] } grep { @{$name{$_}} == @{$name{$person}} } keys %name ) { $update = 1; $name{$person}[$index] = $new_name; } } } $index++ if ! $update; $finished = 1 if $index > $num_names; } print "$_ => @{ $name{$_} }\n" for keys %name; [download] Cheers - L~R	[reply] [d/l]
Re: Re: Re: Re: Generate unique initials from a list of names by EdwardG (Vicar) on Mar 12, 2004 at 20:33 UTC
Re: Re: Re: Re: Re: Generate unique initials from a list of names by Limbic~Region (Chancellor) on Mar 12, 2004 at 21:00 UTC
Some notes below your chosen depth have not been shown here
Re: Generate unique initials from a list of names by BrowserUk (Patriarch) on Mar 13, 2004 at 03:58 UTC
This code is short, though it loops a bit, and it produces unique, logical (though not necessarily obvious), short abbreviations. #! perl -slw use strict; sub mapNM (&@) { my $code = shift; map{ local( a, b ) = \( @_[ 0, 1 ] ); $code->(shift); } 0 .. @_ - 2 } sub strcmp{ my( $p, $b ) = (0) x 2; $p++ until $b = substr( $_[ 0 ], $p, 1 ) cmp substr( $_[ 1 ], $p, 1 ); $p * $b; }; my @names = sort map{ join' ', map{ $_ = ucfirst } split '[^a-z0-9]+', lc } <DATA>; print "@$_" for mapNM{ if( $a->[ 0 ] eq $b->[ 0 ] ) { my $n = strcmp $a->[ 1 ], $b->[ 1 ]; if( $n ) { $a->[ 0 ] .= substr( $a->[ 1 ], abs $n, 1 ); $b->[ 0 ] .= substr( $b->[ 1 ], abs $n, 1 ); } } $a; } sort{ $a->[ 0 ] cmp $b->[ 0 ] } map{ [ join('', m[([A-Z])]g), $_ ] } @names; __DATA__ A Robertson A Robinson A A Milne A A Milner [SNIP] [download] Output Read more... (4 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Re: Generate unique initials from a list of names by EdwardG (Vicar) on Mar 13, 2004 at 10:39 UTC
Your code certainly looks nicer than my attempt, nicer even than L~R's attempt, but it doesn't work in all cases. `__DATA__ A Remus A Ripper A Robertson A Robinson A Rupee d:\tmp>336330.pl ARe A Remus ARi A Ripper ARe A Robertson ARi A Robinson` [download] ...some duplicates, and what's happened to Rupee?	[reply] [d/l]
Re: Re: Re: Generate unique initials from a list of names by BrowserUk (Patriarch) on Mar 13, 2004 at 11:30 UTC
This fixes both those bugs, but there may be other edge cases. The fudge wasn't necessary once I fixed the real problem:) ~~Yes. 'ZZZZZZZZZZZ' is a fudge, but I am re-using mapNM which is a utility routine that wasn't specifically design for this case and this is the 'cheap fix' to flush the pump.~~ #! perl -slw use strict; sub mapNM (&@) { my $code = shift; map{ local( a, b ) = \( @_[ 0, 1 ] ); $code->(shift); # } 0 .. @_ - 2 } 0 .. @_ - 1 } sub strcmp{ my( $p, $b ) = ( $_[2]\|\|0, 0 ); $p++ until $b = substr( $_[ 0 ], $p, 1 ) cmp substr( $_[ 1 ], $p, 1 ); $p * $b; }; my @names = sort map{ join' ', map{ $_ = ucfirst } split '[^a-z0-9]+', + lc } <DATA>; my %abbrev; print "@$_" for mapNM{ if( defined $b ) { my $n = 0; while( exists $abbrev{ $a->[ 0 ] } or $a->[ 0 ] eq $b->[ 0 ] ) { $n = strcmp $a->[ 1 ], $b->[ 1 ], abs($n)+1; if( $n ) { $a->[ 0 ] .= substr( $a->[ 1 ], abs $n, 1 ); $b->[ 0 ] .= substr( $b->[ 1 ], abs $n, 1 ); } } } $abbrev{ $a->[ 0 ] } = undef; $a } sort{ $a->[ 0 ] cmp $b->[ 0 ] } map{ [ join('', m[([A-Z])]g), $_ ] #} @names, 'ZZZZZZZZZZZZ'; } @names; [download] Read more... (4 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Generate unique initials from a list of names by QM (Parson) on Mar 12, 2004 at 22:23 UTC
Would it be useful to use initials with trailing digits, and have a legend at the end of (each page of) the report? Something like: `Date Range Level Consumers -------------------------- ------- -------------------------- 13 Mar 2004 to 23 Mar 2004 7 AB, AT, DA, MS, RI, RR1, RR2 24 Mar 2004 to 31 Mar 2004 6 AB, DA, RC, RI, RR1, SJ 01 Apr 2004 to 15 May 2004 3 Alan Po, Rohan Ito, Sam Jones 16 May 2004 to 31 Dec 2004 5 AP, PS, RC, RI, SJ --------------------- RR1: Ron Richards RR2: Renee Rotunda` [download] And what happens if you have 2 "John J. Smith"s? Do you append their employee numbers? -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]
Re: Re: Generate unique initials from a list of names by EdwardG (Vicar) on Mar 12, 2004 at 23:18 UTC
For this report, it is a given that the names are unique. And your idea of using numbers is of course workable, only it seems less elegant than trying for unique initials. As I mentioned in the OP, I considered just using numbers but didn't want the additional list. Perhaps unrealistic, but if I just wanted a practical solution I'd probably be using MS Project for resource levelling rather than a perl hack :)	[reply]
Re: Generate unique initials from a list of names by ambrus (Abbot) on Mar 12, 2004 at 17:19 UTC
I've written a slightly related piece of code (not in Perl). It was a patch for bash with which you can display the working directory in the prompt abbreviated, but so that the abbreviation is still unique (unless the file system has changed since the abbrev is computed). More information here: http://www.math.bme.hu/~ambrus/pu/compact-pwd However, this one creates abbreviations a different way, from A A Milne and A A Milner, it will print A and Ar resp. (I originally wrote this patch because of the silly long directory names of Mathematica.)	[reply]

Back to Seekers of Perl Wisdom

If fitting data onto one line is essential

Consider multi-line output