Recently I had the need to produce a report of a certain resource level over time. The basic report looked like this -
Date Range Level Consumers
-------------------------- ------- --------------------------
13 Mar 2004 to 23 Mar 2004 7 AB, AT, DA, MS, RI, RR, RR
24 Mar 2004 to 31 Mar 2004 6 AB, DA, RC, RI, RR, SJ
01 Apr 2004 to 15 May 2004 3 Alan Po, Rohan Ito, Sam Jones
16 May 2004 to 31 Dec 2004 5 AP, PS, RC, RI, SJ
What this shows is that for each listed _date_range_, we require _level_ items, and that these items are to be allocated to the _consumer_ as identified by the listed names. Where the list of names is too long to fit neatly on one line, I programmatically shorten the name to just the initials to try keep it from wrapping or truncating. Incidentally, this report has proven very usful as it shows if we will exceed or have spare capacity in the known future.
But this node is not about that report, it is about a frustrating little problem (did you spot it?) that eventually resulted in the code below.
I noticed that I had neglected to cater for two or more people having the same initials. In fact my code to generate intials from names was simply
$init{$_} = join '', ($_ =~ /\b\w/g) for @names;
People with the same initials make the report ambiguous, although this doesn't matter if they are both included in a single date range. But when a single "RR" appears in a date range, as in the second line of the report, it is ambiguous.
Now this was not a big problem since the report was more about the resource level than about the resource consumers, but naively I thought it was a simple fix.
My first lazy attempt at distinguishing derived initials was to detect when the derived initials were already used, and in that case add more letters from the name (iteratively) until the second (etc) set of derived initials were unique. This resulted in a report that looked like this -
Date Range Level Consumers
-------------------------- ------- --------------------------
13 Mar 2004 to 23 Mar 2004 7 AB, AT, DA, MS, RI, RR, RaRa
24 Mar 2004 to 31 Mar 2004 6 AB, DA, RC, RI, RaRa, SJ
01 Apr 2004 to 15 May 2004 3 Alan Po, Rohan Ito, Sam Jones
16 May 2004 to 31 Dec 2004 5 AP, PS, RC, RI, SJ
Notice "RaRa" is now used for the second person with initials of "RR". I can now see in the second line of the report, it is actually the second "RR" person who needs the resource, not the first like I could have mistakenly assumed.
This was an improvement, and in fact I left the report in this state since a) it was only for my information, b) it was good enough, and c) I had more important things to do.
At this point I could have abandoned names and maybe used numbers to identify people, but I avoided that option for two main reasons:
- I didn't want the maintenance hassle of a new item of data (where would these "id numbers" be stored? How would I maintain them?)
- I really wanted the report to tell me names without referring to yet another list. (The report was supposed to be simple, dammit)
Fast forward several months, I'm running this report again - by now it has many more people on the list - and once again I notice the ambiguity. With more time to spend on the problem, I dived in.
To cut the potentially long story short, I've now got the following code that takes a list of names and returns a hashref of name => initials.
Example:
Alphonse Romeo-Smith => ARS
Big Al Bundy => BAB
Dilbert => D
I Palindrome I => IPI
James O'Leary => JOL
Jimmy O'Brien => JOB
John Smith => JS
K9 => K
Medhi Majesh => MedMaj
Mickey Mouse => MicMou
Minnie Mouse => MinMou
Tiny Tim => TT
Tom Demarco => TomDem
Tom Denada => TomDen
Notice the additional characters used to distinguish Mickey, Minnie, and Medhi
For names that are nearly identical, this doesn't work so well . o O(but what can I do?)
A A Milne => AAMilne
A A Milner => AAMilner
With some more work, and this is the real reason I'm posting it, I wonder if it could become a module fit for sharing.
So with that in mind, I hereby invite your thoughts on the sensibility of what I'm trying to do, and the code itself (bugs or inefficiencies), and also perhaps you can think of names I haven't catered for. I'm already aware that this code would not be suitable for a long list of names (like a telephone book), but as it stands I find it useful for up to say several hundred names.
In case anyone is wondering, I got the long list of names from a Random Name Generator.
Here is the code. (I'm going out on a limb here, I expect someone will post an embarassing 80 character golfed version within minutes :)
use strict;
use warnings;
MAIN: {
my @names = <DATA>;
chomp @names;
my $initref = &DistinctInitials( @names );
# Sorted by Name
print "$_ => $$initref{$_}\n" for (sort keys %$initref);
# Sorted by Initials
for my $init (sort values %$initref) {
print "$init => $_\n" for (grep {$$initref{$_} eq $init} keys
+%$initref);
}
}
exit;
sub DistinctInitials {
# Derive unique "initials" for each name in a list
# 1. Get simple initials using the first letter of each word in na
+me
# 2. For non-unique initials, insert a sufficient quantity of lett
+ers from
# the original name to make the initial unique among all initia
+ls
my @names = @_; # A copy to be modified
# capitalise words
s/\b(\w)(\w+)\b/\u$1\L$2\E/g for @names;
# remove duplicates
my %seen;
@names = grep { ! $seen{$_}++ } @names;
# derive initial of each name
my %init;
$init{$_} = join '', ($_ =~ /\b\w/g) for @names;
# identify non-unique initials
my %nonu;
for my $nonu_v (grep {$seen{$_}++} values %init) {
map {push @{$nonu{$nonu_v}},$_ } grep {$init{$_} eq $nonu_v} k
+eys %init;
}
# remove duplicates
%seen=();
@{$nonu{$_}} = grep { ! $seen{$_} ++ } @{$nonu{$_}} for keys %nonu
+;
for my $init (keys %nonu) {
# determine minimum additional characters from each
# name that make all these initials unique
my $c = 2; # start with one extra character
my $notunique = 1;
while ($notunique) {
my %tryuniq;
for my $name (@{$nonu{$init}}) {
$tryuniq{ join '', ($name =~ /\b\w{1,$c}/g) } = $name;
+ # greedy {1,$c}
}
if (scalar keys %tryuniq == scalar @{$nonu{$init}}) {
undef $notunique; # success, all initials are now uniq
+ue
# update the initial hash with our newfound initials
while (my ($tryinit,$tryname) = each %tryuniq) {
$init{$tryname} = $tryinit;
}
} else {
# failed to make them all unique
$c += 1;
}
}
}
\%init;
}
# Hand-crafted test names follows...
__DATA__
John Smith
Medhi Majesh
Mickey Mouse
Minnie Mouse
A A Milne
A A Milne
A A Milner
Jimmy O'Brien
James O'Leary
Dilbert
Alphonse Romeo-Smith
I Palindrome I
tiny tim
BIG AL BUNDY
Tom DeMarco
Tom DeNada
K9
And here is further list of names I used for testing. As I mentioned, these are random names, any similarity to persons alive or dead is purely coincidental :)
Adam Harper
Alan Pfaff
Alan Vanwinkle
Alana Stutsman
Albert Osterman
Alison Sands
Allan Goris
Allan Razor
Alma Ratcliffe
Alvaro Cangemi
Amanda Basile
Annabelle Ducker
Anthony Granado
Anthony Mcdevitt
Antonio Starnes
April Kain
Arlene Stephen
Armand Stower
Arthur Ragsdale
Ashlee Hurtt
Ballentine
Beata Zar
Bernice Turnage
Bertha Schuck
Billy Sundberg
Binns
Blanche Page
Bobby Eaton
Bobby Winkelman
Brandon Dominick
Brittanie Noviello
Brittian
Bulah Twombley
Candace Cote
Carey Balck
Carey Grishaber
Carey Twedell
Chad Guarino
Chandra Puffer
Cheryl Langley
Chris Creekmore
Clare Milum
Clayton Elbert
Clinton Dahmen
Clinton Ganley
Clinton Mcnaught
Cody Ridout
Cody Vigo
Dale Hernandes
Dane Hasychak
Daniel Colunga
Darren Tacey
Darren Vanaman
Darryl Buchta
Debbie Schlueter
Doggett
Doris Adkins
Eileen Bard
Elinor Rita
Elizabeth Hendricks
Elnora Hornbuckle
Elwood Steketee
Ericka Hodgin
Erik Duley
Ernest Cowley
Ernest Frey
Ernest Helmick
Esmeralda Kardos
Eve Goodfellow
Ewa Villerreal
Felicia Manson
Fernando Outler
Fidel Jacoway
Fidel Paone
Florence Lillie
Fred Pyles
Genna Tranbarger
George Lamoureux
Gerald Graziano
Gerald Minnich
Gina Nickelson
Gladys Rothman
Glenda Wheat
Gregory Walls
Guinn
Guy Gatton
Hannah Quijano
Harold Houser
Harriet Dreyer
Heinricher
Howard Sirois
Hugh Charland
Hugh Dipalma
Hugh Fuselier
Hugh Minaya
Ida Noe
Inez Marion
James Angell
James Cravens
Jami Gula
Jamie Brickhouse
Jamie Huot
Jason Willson
Jeannette Macaluso
Jerri Giesen
Jerri Houseknecht
Jerri Nickson
Jessie Olah
Ji Greenier
Jonathan Matos
Joyce Brian
Julio Sepeda
Karina Palka
Katherine Holzer
Kathryn Diaz
Katie Lloyd
Keila Brue
Keith Berner
Kelly Henegar
Kenneth Caraballo
Kevin Batchelder
Kevin Poindexter
Kimberly Cecil
Kimberly Hare
Kingsbury
Koteles
Kristina Kasper
Lakisha Trees
Larita Battisti
Laura Mcfadden
Laurie Carranza
Lilia Fouse
Linda Kavanaugh
Linda Power
Lonnie Sherrow
Lori Stjohn
Lorrie Lobdell
Louis Leggett
Louis Watterson
Louisa Galyean
Louisa Swingle
Louise Belanger
Lynda Knudsen
Madelene Robasciotti
Mae Brantley
Maggie Fernandez
Maggie Madore
Mallory Brodbeck
Mallory Sickels
Manuel Trotter
Marcel Avers
Marcel Cuffia
Marcel Merow
Margaret Amey
Maricela Bautch
Marilyn Shumpert
Mark Haynie
Marlin Gryniuk
Martin King
Mathew Mayton
Max Sprau
Merrilee Garelick
Mildred Colburn
Ming Negro
Nathan Shuster
Neil Helbert
Nelson Pinzon
Nelson Rayo
Nicholas Lowell
Nicole Tillman
Norman Thatcher
Olivia Bridgeman
Peggy Loggins
Penelope Maize
Peter Agee
Peter Valadez
Priscilla Katz
Rachel Marchand
Randy Frederickson
Raphael Dumaine
Raymond Nava
Robyn Duvall
Rod Schoneman
Roger Goldman
Ronald Looney
Roy Kushner
Royce
Schrock
Scott Obryan
Shawn Ault
Stanley Cleaver
Sydow
Sylvia Salazar
Sylvia Walther
Tabatha Goodsell
Tari Windish
Tarra Bellantuono
Thomas Coleman
Tia Drakeford
Tina Palomo
Tompkins
Tony Haight
Tyrone Crossett
Tyrone Golub
Victor Mcduffie
Vincent Nowak
Viola Mcnamee
Wayne Maye
Wayne Ulrich
Yolanda Hood