regex: seperating parts of non-formatted names

emilford has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to write a script that will help out a fellow co-worker who has not yet been enlightened of the powers of perl. I already managed to impress when I took 5 minutes to write a script that ran for 30s, that saved her at least an hour of work. She has a database full of names that follow no specific format, that she needs to seperate down to

1) title
2) first name
3) middle initial
4) last name
[download]

Some might have all this information, some might not.

I know that this is feasible with a fairly complex regex, which is where I'm running into some problems. I'm sure I could put something together that would work fairly well, but I want to try and write code that will perform appropriately for all cases.

To show that I'm not just asking you guys to solve my problem, I have come up with some ideas that I think need to be incorporated into the regex.

there are multiple titles that are possible (i.e. - LTC, COL, DR, MS, MR, MISS, etc); instead of having a long regex testing LTC|DR|MS|MR, would it be possible to toss them into an array and have a portion of the regex be executed code that iterates through each possibility in the array and returns the match. That way, as new titles come up, they can easily be added.
the different parts of the name are seperated mostly by spaces: the middle initial could be grabbed with (\w\.) and the first and last names could be grabbed based on \w versus spaces. Is there a better approach?
there are certain names that are only last names; there could be a special case for this that would lessen the complexity of the regex.

Here's an example of what I'm looking for. Say I had the following names:

Frederick H. Jones
Dr. James T. Taylor
Dr. Mat L. R. Michaels
[download]

I'd want to be able to seperate this into:
(< > marks chunk tossed into variable)

<Frederick> <H.> <Jones>
<Dr.> <James> <T.> <Taylor>
<Dr.> <Mat> <L. R.> <Michaels>
[download]

I'm going to start working on this regex and toy around with different ideas. I'll post what I have completed every so often, but any feedback, ideas, suggestions, code would be appreciated.

Thanks in advance,
Eric

Comment on regex: seperating parts of non-formatted names Select or Download Code

Replies are listed 'Best First'.

Re: regex: seperating parts of non-formatted names
by dws (Chancellor) on Sep 09, 2002 at 18:23 UTC

She has a database full of names that follow no specific format, that she needs to seperate down to [title, first name, middle initial, last name]

Welcome to the administrative sub-basement of hell. Depending on the names your friend has to deal with, you might discover that a regex can handle 98%, but that the remaining 2% will cause you to run screaming into the night.

Consider my dear friend Lt. Col. J. Random von Perl-Hacker III By the scheme your friend is using, Randy's name needs to reduce to <Lt. Col.> <J.> <R.> <von Perl-Hacker> (And it isn't immediately clear what to do with the "III".) In any large set of unstructured names, you're going to run into a few like this. Good luck doing handling them with a single regexp.

I think you'll have better luck breaking the name into tokens, providing predicate functions that answer whether a token can be of a particular type, then providing a set of "rules" to match a set of tokens against. This will be slower, but potentially much more accurate, than a regex.

[reply]
[d/l]
[select]

Re: Re: regex: seperating parts of non-formatted names

by jkahn (Friar) on Sep 09, 2002 at 18:48 UTC

will

sub parse_names
  @uncategorized = tokenize($erstwhile_name);
  push (@titleToks, shift @uncategorized)
    until ( not is_title($uncategorized[0])
            or not @uncategorized );
  if (not @uncategorized) {
    warn "all titles!";
    return;
  }
  push (@nameToks, shift @uncategorized)
     until ( not is_name($uncategorized[0])
             or not @uncategorized );
  # now probably want to break up @nameToks into first and 
  # last names; this probably involves specific lists like 
  # "van" and "von" and "de" so you attach "de Sade", "van 
  # Gogh" to the last name, but "Robert Louis" to the first 
  # name

  if (@uncategorized) {
    # must be suffixes like "III", "Jr.", etc
    @suffixes = @uncategorized;
  }
  if (not is_acceptable_suffix(@suffixes)) {
    warn "problem with suffixes " . join " ", @suffixes;
  }
}
[download]

[reply]
[d/l]

Re: Re: regex: seperating parts of non-formatted names

by sauoq (Abbot) on Sep 09, 2002 at 23:32 UTC

Consider my dear friend
Lt. Col. J. Random von Perl-Hacker III
[download]

Lt. Col. J. Random von Perl-Hacker III Ph.D.
[download]

Good luck!

. . . I wonder if there is a unicode glyph for "the artist formerly known as Prince" . . .

-sauoq
"My two cents aren't worth a dime.";

[reply]
[d/l]
[select]

Re: regex: seperating parts of non-formatted names
by kschwab (Vicar) on Sep 09, 2002 at 19:15 UTC

Lingua::EN::NameParse

From the docs:

This module takes as input a person or persons name in free format text such as,
Mr AB & M/s CD MacNay-Smith
MR J.L. D'ANGELO
Estate Of The Late Lieutenant Colonel AB Van Der Heiden

and attempts to parse it. If successful, the name is broken down into components and useful functions can be performed such as :

converting upper or lower case values to name case (Mr AB MacNay )

creating a personalised greeting or salutation (Dear Mr MacNay )

extracting the names individual components (Mr,AB,MacNay )

determining the type of format the name is in (Mr_A_Smith )

If the name cannot be parsed you have the option of cleaning the name of bad characters, or extracting any portion that was parsed and the portion that failed.

[reply]

Re: regex: seperating parts of non-formatted names
by fruiture (Curate) on Sep 09, 2002 at 18:15 UTC

A good approach to programming problems in general is "breaking the problem into pieces". This is true for regular expressions in general. So I'd start writing the main regexp like that:

$name = qr{($title)?\W+($first)?\W+($middle)?\W+($last)};
[download]

And now you define each of the variables on their own, of course above this declaration.

Read more... (667 Bytes)

update: untested, there may be more mistakes...

fruiture

[reply]
[d/l]
[select]

Re: regex: seperating parts of non-formatted names
by Kozz (Friar) on Sep 09, 2002 at 18:25 UTC

Mr. H. L. Mencken (first name is an initial)
Mr. Tim O'Reilly (surname with apostrophe or other puncuation)
Mr. Vincent Van Gogh (last name containing a space)
[download]

[reply]
[d/l]

Re: regex: seperating parts of non-formatted names
by Django (Pilgrim) on Sep 09, 2002 at 19:17 UTC

Breaking the thing into pieces is surely the way to go. My following code works with the included test names, but there may be some names that break it. The fields will contain trailing spaces, which can be removed afterwards.

#!usr/bin/perl -w
@Data
= ( 'Dr. Foo B. Baz',
    'Ms Bar',
    'Foo Bar',
    'Col Foo Bar',
    'Foo E.G. Bar',
    'Baz',
  )
;
my $Title
= qr/ (?:  LTC
        |  COL
        |  DR
        |  MS
        |  MR
        |  MISS
      )           
  /ix
;
for (@Data) {
  / ( (?: $Title \.?      \s+ )? )
    ( (?: [\w-]+          \s+ )? )
    ( (?: (?: \w\.\s*? )+ \s+ )? )
    (     [\w-]+?         \s* $  )
  /ix;
  my $i++;
  ( $Fields{'title'   }[$i],
    $Fields{'name'    }[$i],
    $Fields{'initials'}[$i],
    $Fields{'surname' }[$i],
  ) = ( $1, $2, $3, $4 );
  print "$1:$2:$3:$4\n";
}

__DATA__
Dr. :Foo :B. :Baz
Ms :::Bar
:Foo ::Bar
Col :Foo ::Bar
:Foo :E.G. :Bar
:::Baz
[download]

^~Django
"Why don't we ever challenge the spherical earth theory?"

[reply]
[d/l]

Re: regex: seperating parts of non-formatted names
by rdfield (Priest) on Sep 10, 2002 at 08:36 UTC

dws

The general method of processing started with tokenizing the input data and then ranking the tokens based on the frequency and placement and then iterating over the result to move the tokens to the correct list (titles, firstname, initials,surname, qualifications etc).

rdfield

[reply]

Re: regex: seperating parts of non-formatted names
by emilford (Friar) on Sep 11, 2002 at 01:20 UTC

[reply]

Re: regex: seperating parts of non-formatted names
by rir (Vicar) on Sep 09, 2002 at 21:34 UTC

odd

Massaging messy data into formal structures
is often easier if you do not attempt a complete
algorithmic solution. Exploring the data is much of
the problem, so solving it as you explore it is a
possibility.

Often it is easier to solve the problem bit by bit.
Copying out all the two-field lines and solving them
is probably trivial.
This warms you up to do the three field lines, or maybe
you see that the titles are not very varied, and decide
to handle that aspect first.

By the time you get to the hard cases your remaining data
set may be quite small.

The approach I'm proposing is efficient in certain
situations. Your situation may or may not be such.

[reply]

Re: regex: seperating parts of non-formatted names
by zigdon (Deacon) on Sep 10, 2002 at 12:14 UTC

@titles = qw/Dr. Col. Miss/;  # list of allowable titles
$title_reg = join "|", @titles;
$title_reg =~ s/\./\\./g; # make sure the dots are literal

while (<>) {
  /
     ($title_reg)?  # optional title
     \s*            # skip any spaces
    (\S+)  \s+      # first name, skip following spaces
    ((?:\w\.\s*)+)? # middle initials - any number, optional
    (\S+)           # last name
  /iox;
  print "$1, $2, $3, $4\n";
}
[download]

-- Dan (who forgot to hit submit last night)

[reply]
[d/l]

Back to Seekers of Perl Wisdom