comment on

Hello Monks,

I am trying to write a script that will help out a fellow co-worker who has not yet been enlightened of the powers of perl. I already managed to impress when I took 5 minutes to write a script that ran for 30s, that saved her at least an hour of work. She has a database full of names that follow no specific format, that she needs to seperate down to

1) title
2) first name
3) middle initial
4) last name
[download]

Some might have all this information, some might not.

I know that this is feasible with a fairly complex regex, which is where I'm running into some problems. I'm sure I could put something together that would work fairly well, but I want to try and write code that will perform appropriately for all cases.

To show that I'm not just asking you guys to solve my problem, I have come up with some ideas that I think need to be incorporated into the regex.

there are multiple titles that are possible (i.e. - LTC, COL, DR, MS, MR, MISS, etc); instead of having a long regex testing LTC|DR|MS|MR, would it be possible to toss them into an array and have a portion of the regex be executed code that iterates through each possibility in the array and returns the match. That way, as new titles come up, they can easily be added.
the different parts of the name are seperated mostly by spaces: the middle initial could be grabbed with (\w\.) and the first and last names could be grabbed based on \w versus spaces. Is there a better approach?
there are certain names that are only last names; there could be a special case for this that would lessen the complexity of the regex.

Here's an example of what I'm looking for. Say I had the following names:

Frederick H. Jones
Dr. James T. Taylor
Dr. Mat L. R. Michaels
[download]

I'd want to be able to seperate this into:
(< > marks chunk tossed into variable)

<Frederick> <H.> <Jones>
<Dr.> <James> <T.> <Taylor>
<Dr.> <Mat> <L. R.> <Michaels>
[download]

I'm going to start working on this regex and toy around with different ideas. I'll post what I have completed every so often, but any feedback, ideas, suggestions, code would be appreciated.

Thanks in advance,
Eric

In reply to regex: seperating parts of non-formatted names by emilford

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Keep It Simple, Stupid
	PerlMonks