Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Generate unique initials from a list of names

by EdwardG (Vicar)
on Mar 12, 2004 at 12:23 UTC ( #336125=perlquestion: print w/ replies, xml ) Need Help??
EdwardG has asked for the wisdom of the Perl Monks concerning the following question:

Recently I had the need to produce a report of a certain resource level over time. The basic report looked like this -

Date Range Level Consumers -------------------------- ------- -------------------------- 13 Mar 2004 to 23 Mar 2004 7 AB, AT, DA, MS, RI, RR, RR 24 Mar 2004 to 31 Mar 2004 6 AB, DA, RC, RI, RR, SJ 01 Apr 2004 to 15 May 2004 3 Alan Po, Rohan Ito, Sam Jones 16 May 2004 to 31 Dec 2004 5 AP, PS, RC, RI, SJ

What this shows is that for each listed _date_range_, we require _level_ items, and that these items are to be allocated to the _consumer_ as identified by the listed names. Where the list of names is too long to fit neatly on one line, I programmatically shorten the name to just the initials to try keep it from wrapping or truncating. Incidentally, this report has proven very usful as it shows if we will exceed or have spare capacity in the known future.

But this node is not about that report, it is about a frustrating little problem (did you spot it?) that eventually resulted in the code below.

I noticed that I had neglected to cater for two or more people having the same initials. In fact my code to generate intials from names was simply

$init{$_} = join '', ($_ =~ /\b\w/g) for @names;

People with the same initials make the report ambiguous, although this doesn't matter if they are both included in a single date range. But when a single "RR" appears in a date range, as in the second line of the report, it is ambiguous.

Now this was not a big problem since the report was more about the resource level than about the resource consumers, but naively I thought it was a simple fix.

My first lazy attempt at distinguishing derived initials was to detect when the derived initials were already used, and in that case add more letters from the name (iteratively) until the second (etc) set of derived initials were unique. This resulted in a report that looked like this -

Date Range Level Consumers -------------------------- ------- -------------------------- 13 Mar 2004 to 23 Mar 2004 7 AB, AT, DA, MS, RI, RR, RaRa 24 Mar 2004 to 31 Mar 2004 6 AB, DA, RC, RI, RaRa, SJ 01 Apr 2004 to 15 May 2004 3 Alan Po, Rohan Ito, Sam Jones 16 May 2004 to 31 Dec 2004 5 AP, PS, RC, RI, SJ

Notice "RaRa" is now used for the second person with initials of "RR". I can now see in the second line of the report, it is actually the second "RR" person who needs the resource, not the first like I could have mistakenly assumed.

This was an improvement, and in fact I left the report in this state since a) it was only for my information, b) it was good enough, and c) I had more important things to do.

At this point I could have abandoned names and maybe used numbers to identify people, but I avoided that option for two main reasons:

  1. I didn't want the maintenance hassle of a new item of data (where would these "id numbers" be stored? How would I maintain them?)
  2. I really wanted the report to tell me names without referring to yet another list. (The report was supposed to be simple, dammit)

Fast forward several months, I'm running this report again - by now it has many more people on the list - and once again I notice the ambiguity. With more time to spend on the problem, I dived in.

To cut the potentially long story short, I've now got the following code that takes a list of names and returns a hashref of name => initials.

Example:

Alphonse Romeo-Smith => ARS Big Al Bundy => BAB Dilbert => D I Palindrome I => IPI James O'Leary => JOL Jimmy O'Brien => JOB John Smith => JS K9 => K Medhi Majesh => MedMaj Mickey Mouse => MicMou Minnie Mouse => MinMou Tiny Tim => TT Tom Demarco => TomDem Tom Denada => TomDen

Notice the additional characters used to distinguish Mickey, Minnie, and Medhi

For names that are nearly identical, this doesn't work so well . o O(but what can I do?)

A A Milne => AAMilne A A Milner => AAMilner

With some more work, and this is the real reason I'm posting it, I wonder if it could become a module fit for sharing.

So with that in mind, I hereby invite your thoughts on the sensibility of what I'm trying to do, and the code itself (bugs or inefficiencies), and also perhaps you can think of names I haven't catered for. I'm already aware that this code would not be suitable for a long list of names (like a telephone book), but as it stands I find it useful for up to say several hundred names.

In case anyone is wondering, I got the long list of names from a Random Name Generator.

Here is the code. (I'm going out on a limb here, I expect someone will post an embarassing 80 character golfed version within minutes :)

use strict; use warnings; MAIN: { my @names = <DATA>; chomp @names; my $initref = &DistinctInitials( @names ); # Sorted by Name print "$_ => $$initref{$_}\n" for (sort keys %$initref); # Sorted by Initials for my $init (sort values %$initref) { print "$init => $_\n" for (grep {$$initref{$_} eq $init} keys +%$initref); } } exit; sub DistinctInitials { # Derive unique "initials" for each name in a list # 1. Get simple initials using the first letter of each word in na +me # 2. For non-unique initials, insert a sufficient quantity of lett +ers from # the original name to make the initial unique among all initia +ls my @names = @_; # A copy to be modified # capitalise words s/\b(\w)(\w+)\b/\u$1\L$2\E/g for @names; # remove duplicates my %seen; @names = grep { ! $seen{$_}++ } @names; # derive initial of each name my %init; $init{$_} = join '', ($_ =~ /\b\w/g) for @names; # identify non-unique initials my %nonu; for my $nonu_v (grep {$seen{$_}++} values %init) { map {push @{$nonu{$nonu_v}},$_ } grep {$init{$_} eq $nonu_v} k +eys %init; } # remove duplicates %seen=(); @{$nonu{$_}} = grep { ! $seen{$_} ++ } @{$nonu{$_}} for keys %nonu +; for my $init (keys %nonu) { # determine minimum additional characters from each # name that make all these initials unique my $c = 2; # start with one extra character my $notunique = 1; while ($notunique) { my %tryuniq; for my $name (@{$nonu{$init}}) { $tryuniq{ join '', ($name =~ /\b\w{1,$c}/g) } = $name; + # greedy {1,$c} } if (scalar keys %tryuniq == scalar @{$nonu{$init}}) { undef $notunique; # success, all initials are now uniq +ue # update the initial hash with our newfound initials while (my ($tryinit,$tryname) = each %tryuniq) { $init{$tryname} = $tryinit; } } else { # failed to make them all unique $c += 1; } } } \%init; } # Hand-crafted test names follows... __DATA__ John Smith Medhi Majesh Mickey Mouse Minnie Mouse A A Milne A A Milne A A Milner Jimmy O'Brien James O'Leary Dilbert Alphonse Romeo-Smith I Palindrome I tiny tim BIG AL BUNDY Tom DeMarco Tom DeNada K9

And here is further list of names I used for testing. As I mentioned, these are random names, any similarity to persons alive or dead is purely coincidental :)

Adam Harper Alan Pfaff Alan Vanwinkle Alana Stutsman Albert Osterman Alison Sands Allan Goris Allan Razor Alma Ratcliffe Alvaro Cangemi Amanda Basile Annabelle Ducker Anthony Granado Anthony Mcdevitt Antonio Starnes April Kain Arlene Stephen Armand Stower Arthur Ragsdale Ashlee Hurtt Ballentine Beata Zar Bernice Turnage Bertha Schuck Billy Sundberg Binns Blanche Page Bobby Eaton Bobby Winkelman Brandon Dominick Brittanie Noviello Brittian Bulah Twombley Candace Cote Carey Balck Carey Grishaber Carey Twedell Chad Guarino Chandra Puffer Cheryl Langley Chris Creekmore Clare Milum Clayton Elbert Clinton Dahmen Clinton Ganley Clinton Mcnaught Cody Ridout Cody Vigo Dale Hernandes Dane Hasychak Daniel Colunga Darren Tacey Darren Vanaman Darryl Buchta Debbie Schlueter Doggett Doris Adkins Eileen Bard Elinor Rita Elizabeth Hendricks Elnora Hornbuckle Elwood Steketee Ericka Hodgin Erik Duley Ernest Cowley Ernest Frey Ernest Helmick Esmeralda Kardos Eve Goodfellow Ewa Villerreal Felicia Manson Fernando Outler Fidel Jacoway Fidel Paone Florence Lillie Fred Pyles Genna Tranbarger George Lamoureux Gerald Graziano Gerald Minnich Gina Nickelson Gladys Rothman Glenda Wheat Gregory Walls Guinn Guy Gatton Hannah Quijano Harold Houser Harriet Dreyer Heinricher Howard Sirois Hugh Charland Hugh Dipalma Hugh Fuselier Hugh Minaya Ida Noe Inez Marion James Angell James Cravens Jami Gula Jamie Brickhouse Jamie Huot Jason Willson Jeannette Macaluso Jerri Giesen Jerri Houseknecht Jerri Nickson Jessie Olah Ji Greenier Jonathan Matos Joyce Brian Julio Sepeda Karina Palka Katherine Holzer Kathryn Diaz Katie Lloyd Keila Brue Keith Berner Kelly Henegar Kenneth Caraballo Kevin Batchelder Kevin Poindexter Kimberly Cecil Kimberly Hare Kingsbury Koteles Kristina Kasper Lakisha Trees Larita Battisti Laura Mcfadden Laurie Carranza Lilia Fouse Linda Kavanaugh Linda Power Lonnie Sherrow Lori Stjohn Lorrie Lobdell Louis Leggett Louis Watterson Louisa Galyean Louisa Swingle Louise Belanger Lynda Knudsen Madelene Robasciotti Mae Brantley Maggie Fernandez Maggie Madore Mallory Brodbeck Mallory Sickels Manuel Trotter Marcel Avers Marcel Cuffia Marcel Merow Margaret Amey Maricela Bautch Marilyn Shumpert Mark Haynie Marlin Gryniuk Martin King Mathew Mayton Max Sprau Merrilee Garelick Mildred Colburn Ming Negro Nathan Shuster Neil Helbert Nelson Pinzon Nelson Rayo Nicholas Lowell Nicole Tillman Norman Thatcher Olivia Bridgeman Peggy Loggins Penelope Maize Peter Agee Peter Valadez Priscilla Katz Rachel Marchand Randy Frederickson Raphael Dumaine Raymond Nava Robyn Duvall Rod Schoneman Roger Goldman Ronald Looney Roy Kushner Royce Schrock Scott Obryan Shawn Ault Stanley Cleaver Sydow Sylvia Salazar Sylvia Walther Tabatha Goodsell Tari Windish Tarra Bellantuono Thomas Coleman Tia Drakeford Tina Palomo Tompkins Tony Haight Tyrone Crossett Tyrone Golub Victor Mcduffie Vincent Nowak Viola Mcnamee Wayne Maye Wayne Ulrich Yolanda Hood

Comment on Generate unique initials from a list of names
Select or Download Code
Re: Generate unique initials from a list of names
by TomDLux (Vicar) on Mar 12, 2004 at 13:45 UTC

    If fitting data onto one line is essential

    In your situation, I would look into Soundex and similar name encoding algorithms.

    Normally they are used to match up variant spellings of names, but since they convert a name to a code, yet still have a chancee of recognizing the name from the code.


    Consider multi-line output

    Alternatly, you could look into using format and turning the Consumer section into a multi-line component:

    Date Range Level Consumers -------------------------- ------- ---------------------------------- 13 Mar 2004 to 23 Mar 2004 7 John Wayne, John Dillinger, George W. Bush, Federico Felini, Ian Sutherland, Katherine Hepburn, + Madonna

    No messy encodings, no problems figuring out the identities of the people in the Consumers list, only a minor inconvenience learning to use a format.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      The multi-line approach is probably my most realistic fallback, and I must confess that this has (for me) turned into a bit of an academic exercise, just for fun.

      I did briefly consider a soundex encoding (metaphone actually), but I couldn't find a suitable encoding. Text::Soundex gives results like 'E460' and 'L222', and Text::Metaphone gives results that are in some cases worse than my Initials hack -

      D:\tmp>cat metaphone.pl use Text::Metaphone; print $_,' => ', Metaphone($_),"\n" for map {chomp;$_} <DATA>; __DATA__ Stephen Thrasher Steve Trasher D:\tmp>metaphone.pl Stephen Thrasher => STFN0RXR Steve Trasher => STFTRXR D:\tmp>initials.pl Stephen Thrasher => StTh Steve Trasher => StTr D:\tmp>
Re: Generate unique initials from a list of names
by Limbic~Region (Chancellor) on Mar 12, 2004 at 13:48 UTC
    EdwardG,
    Do you mean something like this:
    #!/usr/bin/perl use strict; use warnings; my %name = map {chomp; $_ => undef} <DATA>; for my $person ( keys %name ) { my $p_case = $person; $p_case =~ s/(\w+)/\u$1/g; $name{$p_case} = delete $name{$person}; } my $finished = 0; while ( ! $finished ) { my $update = 0; for my $person ( keys %name ) { my @shorter = split " " , $person; $_ =~ s/(\w+)\w/$1/ for @shorter; my $new_name = join " " , @shorter; if ( ! exists $name{$new_name} ) { $update = 1; $name{$new_name} = delete $name{$person}; } } $finished = 1 if ! $update; }
    Cheers - L~R

      Nice and short :)

      But I'm uncertain it works as I intend. For example, try these names:

      __DATA__ Adam Harper Alan Harper

      With your code I get this:

      d:\tmp>test.pl $VAR1 = { 'A H' => undef, 'A Ha' => undef };

      Both of these are ambiguous.

        EdwardG,
        You are right - I only had a minimal chance to test it. I think it works to your satisfacation. Sorry it took a while to get back to you - stupid thing called work ;-)
        #!/usr/bin/perl use strict; use warnings; my %name = map {chomp; $_ => undef} <DATA>; my $num_names = 0; for my $person ( keys %name ) { my @names = split " " , $person; $num_names = @names if @names > $num_names; $_ = ucfirst lc $_ for @names; $name{ $person } = \@names; } my $finished = 0; my $index = 0; while ( ! $finished ) { my $update = 0; for my $person ( keys %name ) { next if ! $name{$person}[$index]; if ( $name{$person}[$index] =~ /(\w+)\w/ ) { my $new_name = $1; if ( ! grep { $new_name eq $_ } map { $name{$_}[$index] } grep { @{$name{$_}} == @{$name{$person}} } keys %name ) { $update = 1; $name{$person}[$index] = $new_name; } } } $index++ if ! $update; $finished = 1 if $index > $num_names; } print "$_ => @{ $name{$_} }\n" for keys %name;
        Cheers - L~R
Re: Generate unique initials from a list of names
by ambrus (Abbot) on Mar 12, 2004 at 17:19 UTC

    I've written a slightly related piece of code (not in Perl). It was a patch for bash with which you can display the working directory in the prompt abbreviated, but so that the abbreviation is still unique (unless the file system has changed since the abbrev is computed).

    More information here: http://www.math.bme.hu/~ambrus/pu/compact-pwd

    However, this one creates abbreviations a different way, from A A Milne and A A Milner, it will print A and Ar resp.

    (I originally wrote this patch because of the silly long directory names of Mathematica.)

Re: Generate unique initials from a list of names
by QM (Vicar) on Mar 12, 2004 at 22:23 UTC
    Would it be useful to use initials with trailing digits, and have a legend at the end of (each page of) the report?

    Something like:

    Date Range Level Consumers -------------------------- ------- -------------------------- 13 Mar 2004 to 23 Mar 2004 7 AB, AT, DA, MS, RI, RR1, RR2 24 Mar 2004 to 31 Mar 2004 6 AB, DA, RC, RI, RR1, SJ 01 Apr 2004 to 15 May 2004 3 Alan Po, Rohan Ito, Sam Jones 16 May 2004 to 31 Dec 2004 5 AP, PS, RC, RI, SJ --------------------- RR1: Ron Richards RR2: Renee Rotunda
    And what happens if you have 2 "John J. Smith"s? Do you append their employee numbers?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      For this report, it is a given that the names are unique.

      And your idea of using numbers is of course workable, only it seems less elegant than trying for unique initials. As I mentioned in the OP, I considered just using numbers but didn't want the additional list.

      Perhaps unrealistic, but if I just wanted a practical solution I'd probably be using MS Project for resource levelling rather than a perl hack :)

Re: Generate unique initials from a list of names
by BrowserUk (Pope) on Mar 13, 2004 at 03:58 UTC

    This code is short, though it loops a bit, and it produces unique, logical (though not necessarily obvious), short abbreviations.

    #! perl -slw use strict; sub mapNM (&@) { my $code = shift; map{ local( *a, *b ) = \( @_[ 0, 1 ] ); $code->(shift); } 0 .. @_ - 2 } sub strcmp{ my( $p, $b ) = (0) x 2; $p++ until $b = substr( $_[ 0 ], $p, 1 ) cmp substr( $_[ 1 ], $p, 1 ); $p * $b; }; my @names = sort map{ join' ', map{ $_ = ucfirst } split '[^a-z0-9]+', lc } <DATA>; print "@$_" for mapNM{ if( $a->[ 0 ] eq $b->[ 0 ] ) { my $n = strcmp $a->[ 1 ], $b->[ 1 ]; if( $n ) { $a->[ 0 ] .= substr( $a->[ 1 ], abs $n, 1 ); $b->[ 0 ] .= substr( $b->[ 1 ], abs $n, 1 ); } } $a; } sort{ $a->[ 0 ] cmp $b->[ 0 ] } map{ [ join('', m[([A-Z])]g), $_ ] } @names; __DATA__ A Robertson A Robinson A A Milne A A Milner [SNIP]

    Output


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

      Your code certainly looks nicer than my attempt, nicer even than L~R's attempt, but it doesn't work in all cases.

      __DATA__ A Remus A Ripper A Robertson A Robinson A Rupee d:\tmp>336330.pl ARe A Remus ARi A Ripper ARe A Robertson ARi A Robinson

      ...some duplicates, and what's happened to Rupee?

        This fixes both those bugs, but there may be other edge cases.

        The fudge wasn't necessary once I fixed the real problem:) Yes. 'ZZZZZZZZZZZ' is a fudge, but I am re-using mapNM which is a utility routine that wasn't specifically design for this case and this is the 'cheap fix' to flush the pump.

        #! perl -slw use strict; sub mapNM (&@) { my $code = shift; map{ local( *a, *b ) = \( @_[ 0, 1 ] ); $code->(shift); # } 0 .. @_ - 2 } 0 .. @_ - 1 } sub strcmp{ my( $p, $b ) = ( $_[2]||0, 0 ); $p++ until $b = substr( $_[ 0 ], $p, 1 ) cmp substr( $_[ 1 ], $p, 1 ); $p * $b; }; my @names = sort map{ join' ', map{ $_ = ucfirst } split '[^a-z0-9]+', + lc } <DATA>; my %abbrev; print "@$_" for mapNM{ if( defined $b ) { my $n = 0; while( exists $abbrev{ $a->[ 0 ] } or $a->[ 0 ] eq $b->[ 0 ] ) { $n = strcmp $a->[ 1 ], $b->[ 1 ], abs($n)+1; if( $n ) { $a->[ 0 ] .= substr( $a->[ 1 ], abs $n, 1 ); $b->[ 0 ] .= substr( $b->[ 1 ], abs $n, 1 ); } } } $abbrev{ $a->[ 0 ] } = undef; $a } sort{ $a->[ 0 ] cmp $b->[ 0 ] } map{ [ join('', m[([A-Z])]g), $_ ] #} @names, 'ZZZZZZZZZZZZ'; } @names;

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://336125]
Approved by b10m
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2014-10-01 11:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (8 votes), past polls