http://www.perlmonks.org?node_id=196369

emilford has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to write a script that will help out a fellow co-worker who has not yet been enlightened of the powers of perl. I already managed to impress when I took 5 minutes to write a script that ran for 30s, that saved her at least an hour of work. She has a database full of names that follow no specific format, that she needs to seperate down to
1) title 2) first name 3) middle initial 4) last name
Some might have all this information, some might not.

I know that this is feasible with a fairly complex regex, which is where I'm running into some problems. I'm sure I could put something together that would work fairly well, but I want to try and write code that will perform appropriately for all cases.

To show that I'm not just asking you guys to solve my problem, I have come up with some ideas that I think need to be incorporated into the regex.

  1. there are multiple titles that are possible (i.e. - LTC, COL, DR, MS, MR, MISS, etc); instead of having a long regex testing LTC|DR|MS|MR, would it be possible to toss them into an array and have a portion of the regex be executed code that iterates through each possibility in the array and returns the match. That way, as new titles come up, they can easily be added.

  2. the different parts of the name are seperated mostly by spaces: the middle initial could be grabbed with (\w\.) and the first and last names could be grabbed based on \w versus spaces. Is there a better approach?

  3. there are certain names that are only last names; there could be a special case for this that would lessen the complexity of the regex.

Here's an example of what I'm looking for. Say I had the following names:
Frederick H. Jones Dr. James T. Taylor Dr. Mat L. R. Michaels
I'd want to be able to seperate this into:
(< > marks chunk tossed into variable)
<Frederick> <H.> <Jones> <Dr.> <James> <T.> <Taylor> <Dr.> <Mat> <L. R.> <Michaels>
I'm going to start working on this regex and toy around with different ideas. I'll post what I have completed every so often, but any feedback, ideas, suggestions, code would be appreciated.

Thanks in advance,
Eric

Replies are listed 'Best First'.
Re: regex: seperating parts of non-formatted names
by dws (Chancellor) on Sep 09, 2002 at 18:23 UTC
    She has a database full of names that follow no specific format, that she needs to seperate down to [title, first name, middle initial, last name]

    Welcome to the administrative sub-basement of hell. Depending on the names your friend has to deal with, you might discover that a regex can handle 98%, but that the remaining 2% will cause you to run screaming into the night.

    Consider my dear friend   Lt. Col. J. Random von Perl-Hacker III By the scheme your friend is using, Randy's name needs to reduce to   <Lt. Col.> <J.> <R.> <von Perl-Hacker> (And it isn't immediately clear what to do with the "III".) In any large set of unstructured names, you're going to run into a few like this. Good luck doing handling them with a single regexp.

    I think you'll have better luck breaking the name into tokens, providing predicate functions that answer whether a token can be of a particular type, then providing a set of "rules" to match a set of tokens against. This will be slower, but potentially much more accurate, than a regex.

      I can't agree enough.
      You will save yourself brain hurt by tokenizing first, so at least you have some idea where the word boundaries are in some reliable way. Then you need the predicate functions, as the previous poster pointed out.
      Perhaps the following snippet makes sense:
      sub parse_names @uncategorized = tokenize($erstwhile_name); push (@titleToks, shift @uncategorized) until ( not is_title($uncategorized[0]) or not @uncategorized ); if (not @uncategorized) { warn "all titles!"; return; } push (@nameToks, shift @uncategorized) until ( not is_name($uncategorized[0]) or not @uncategorized ); # now probably want to break up @nameToks into first and # last names; this probably involves specific lists like # "van" and "von" and "de" so you attach "de Sade", "van # Gogh" to the last name, but "Robert Louis" to the first # name if (@uncategorized) { # must be suffixes like "III", "Jr.", etc @suffixes = @uncategorized; } if (not is_acceptable_suffix(@suffixes)) { warn "problem with suffixes " . join " ", @suffixes; } }
      Consider my dear friend
      Lt. Col. J. Random von Perl-Hacker III
      I know him! He finally got that degree... Now he gives his name as
      Lt. Col. J. Random von Perl-Hacker III Ph.D.

      Good luck!

      . . . I wonder if there is a unicode glyph for "the artist formerly known as Prince" . . .

      -sauoq
      "My two cents aren't worth a dime.";
      
Re: regex: seperating parts of non-formatted names
by kschwab (Vicar) on Sep 09, 2002 at 19:15 UTC
    Check out Lingua::EN::NameParse.

    From the docs:

    This module takes as input a person or persons name in free format text such as,

    Mr AB & M/s CD MacNay-Smith
    MR J.L. D'ANGELO
    Estate Of The Late Lieutenant Colonel AB Van Der Heiden

    and attempts to parse it. If successful, the name is broken down into components and useful functions can be performed such as :
    converting upper or lower case values to name case (Mr AB MacNay )
    creating a personalised greeting or salutation (Dear Mr MacNay )
    extracting the names individual components (Mr,AB,MacNay )
    determining the type of format the name is in (Mr_A_Smith )

    If the name cannot be parsed you have the option of cleaning the name of bad characters, or extracting any portion that was parsed and the portion that failed.

Re: regex: seperating parts of non-formatted names
by fruiture (Curate) on Sep 09, 2002 at 18:15 UTC

    A good approach to programming problems in general is "breaking the problem into pieces". This is true for regular expressions in general. So I'd start writing the main regexp like that:

    $name = qr{($title)?\W+($first)?\W+($middle)?\W+($last)};

    And now you define each of the variables on their own, of course above this declaration.

    update: untested, there may be more mistakes...

    --
    http://fruiture.de
Re: regex: seperating parts of non-formatted names
by Kozz (Friar) on Sep 09, 2002 at 18:25 UTC
    When writing a regex to handle this, you may want to keep in mind other unique names, such as
    Mr. H. L. Mencken (first name is an initial) Mr. Tim O'Reilly (surname with apostrophe or other puncuation) Mr. Vincent Van Gogh (last name containing a space)
    also note the last two that have no middle initial/name.

    If you need to write a regex to handle all of these situations, I do not envy you. ;)
Re: regex: seperating parts of non-formatted names
by Django (Pilgrim) on Sep 09, 2002 at 19:17 UTC

    Breaking the thing into pieces is surely the way to go. My following code works with the included test names, but there may be some names that break it. The fields will contain trailing spaces, which can be removed afterwards.

    #!usr/bin/perl -w @Data = ( 'Dr. Foo B. Baz', 'Ms Bar', 'Foo Bar', 'Col Foo Bar', 'Foo E.G. Bar', 'Baz', ) ; my $Title = qr/ (?: LTC | COL | DR | MS | MR | MISS ) /ix ; for (@Data) { / ( (?: $Title \.? \s+ )? ) ( (?: [\w-]+ \s+ )? ) ( (?: (?: \w\.\s*? )+ \s+ )? ) ( [\w-]+? \s* $ ) /ix; my $i++; ( $Fields{'title' }[$i], $Fields{'name' }[$i], $Fields{'initials'}[$i], $Fields{'surname' }[$i], ) = ( $1, $2, $3, $4 ); print "$1:$2:$3:$4\n"; } __DATA__ Dr. :Foo :B. :Baz Ms :::Bar :Foo ::Bar Col :Foo ::Bar :Foo :E.G. :Bar :::Baz

    ~Django
    "Why don't we ever challenge the spherical earth theory?"

Re: regex: seperating parts of non-formatted names
by rdfield (Priest) on Sep 10, 2002 at 08:36 UTC
    As dws pointed out, this is a deceptively tricky problem to solve in a reliable way. I did some consultancy work for a company trying to clean up the names and address in there customer database (6M customers as I recall) to save money on their bulk postings. The development team assigned to this task had spent 2 years without producing an acceptable solution and the vendor solutions I was evaluating were in the region of GBP250K for the licensing costs alone. Each vendor solution also required additional work to "tailor" the solution to local needs, and even then there was about a 1% "failure to parse" rate.

    The general method of processing started with tokenizing the input data and then ranking the tokens based on the frequency and placement and then iterating over the result to move the tokens to the correct list (titles, firstname, initials,surname, qualifications etc).

    rdfield

Re: regex: seperating parts of non-formatted names
by emilford (Friar) on Sep 11, 2002 at 01:20 UTC
    Thank you all for your responses. I knew that this could be quite a difficult problem, but my needs are a bit less substantial. I think we currently have only a couple thousand records, and if it doesn't happen to work exactly right for every single case, those changes can be made by hand. I think I have enough to get where I need to be. Thanks again everyone.

    Eric
Re: regex: seperating parts of non-formatted names
by rir (Vicar) on Sep 09, 2002 at 21:34 UTC
    You could say more about your data set.
    How many lines are we talking about?
    Are the names and titles all english or anglified(sp)?
    How many odd names are there?

    Massaging messy data into formal structures
    is often easier if you do not attempt a complete
    algorithmic solution. Exploring the data is much of
    the problem, so solving it as you explore it is a
    possibility.

    Often it is easier to solve the problem bit by bit.
    Copying out all the two-field lines and solving them
    is probably trivial.
    This warms you up to do the three field lines, or maybe
    you see that the titles are not very varied, and decide
    to handle that aspect first.

    By the time you get to the hard cases your remaining data
    set may be quite small.

    The approach I'm proposing is efficient in certain
    situations. Your situation may or may not be such.

Re: regex: seperating parts of non-formatted names
by zigdon (Deacon) on Sep 10, 2002 at 12:14 UTC
    How about something like this?
    @titles = qw/Dr. Col. Miss/; # list of allowable titles $title_reg = join "|", @titles; $title_reg =~ s/\./\\./g; # make sure the dots are literal while (<>) { / ($title_reg)? # optional title \s* # skip any spaces (\S+) \s+ # first name, skip following spaces ((?:\w\.\s*)+)? # middle initials - any number, optional (\S+) # last name /iox; print "$1, $2, $3, $4\n"; }
    note that the middle initial will have trailing spaces, and that this will break if the middle name is spelled out.

    -- Dan (who forgot to hit submit last night)