Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Splitting/joining on different characters

by SirBones (Friar)
on May 25, 2006 at 18:25 UTC ( [id://551667]=perlquestion: print w/replies, xml ) Need Help??

SirBones has asked for the wisdom of the Perl Monks concerning the following question:

Hey wise-guys/gals... :-). Kind of a two part question, the first fine-tuning, is-there-a-better-slicker-Perl-idiom. The second is a request for some split/join guidance.

I've got a list of names that I need to "normalize" according to the following rules:

  • Uppercase first char of each name, lowercase rest:
    SMITH, JOHN H. --becomes--> Smith, John H.
  • Preferred name in parens, same rule applies:
    Smith, John H. (JOHNNY) --becomes--> Smith, John H. (Johnny)
  • Words starting with a "*" should be left alone:
    SMITH, JOHN H. (JOHNNY) *CONTRACTOR* --becomes--> Smith, John H. (John +ny) *CONTRACTOR*

I'm promised (I think) that the items will always be space-separated, although I'm a bit nervous about that one. I think I need to allow for that, just in case. In any event, the following code seems to work if I assume just space delimiters:

#!/usr/bin/perl use strict; use warnings; my @names = ("Foonman, Joseph S. (Joe)", "SMITH, EDWARD", "Perl, Paul M. *CONTRACTOR*", "Jones, Bobby R.", "Ruth, BABE B. *CONTRACTOR*", "CLAUSE, SANTINO (SANTA)", ); for my $n (@names) { print join(" ", map( { if (/^\*/) { $_; } elsif (/^\(/) { "\(".ucfirst(lc(substr($_,1))); } else { ucfirst(lc($_)); } } split(/ /,$n)))."\n"; }

Gives me exactly what I want:

Foonman, Joseph S. (Joe) Smith, Edward Perl, Paul M. *CONTRACTOR* Jones, Bobby R. Ruth, Babe B. *CONTRACTOR* Clause, Santino (Santa)

Part of my learning process with Perl has been to always ask "Is there a better way?" Any ideas? I'd be particularly interested in people pointing out any weaknesses with this approach, and if there is a more concise way to do it.

And back to question 2: Of course this code doesn't work for Claus,Santa (no space delimeter). I'd appreciate any suggestions for handling the case if someone surprises me with only a comma delimeter. I tried splitting on / |,/ but of course that eliminates all the commas in the output. I'm having difficulty since if I'm splitting on either a space or a comma, I don't know how to tell which one forced the split so that I can use the appropriate character when I re-join the whole thing. I hope that makes sense.

Cheers,
Ken

"This bounty hunter is my kind of scum: Fearless and inventive." --J.T. Hutt

Replies are listed 'Best First'.
Re: Splitting/joining on different characters
by japhy (Canon) on May 25, 2006 at 18:39 UTC
Re: Splitting/joining on different characters
by idsfa (Vicar) on May 25, 2006 at 18:48 UTC

    You do realize that McDarren, brian_d_foy, Sir Tim Berners-Lee and the folks from O'Reilly will be stopping by to talk with you about your mistaken assumptions about how names are capitalized?

    I understand that it may not matter for your particular data set, but it's worth keeping in mind if the results might get back to the user. People get touchy about their names ...


    The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. — Cyrus H. Gordon
      People get touchy about their names ...

      It's a good point, but I work for a BIG COMPANY and they spell our names any way they like. :-) But well taken; I think I need to have a couple of conversations about that.

      Ken

      "This bounty hunter is my kind of scum: Fearless and inventive." --J.T. Hutt
        You might find Lingua::EN::NameCase useful. It includes support for quite a few of the special rules for name capitalization. A quick search on the code seems to indicate it wouldn't handle 'brian d foy' correctly though!

        --Brian

      Names can be very tricky (as in impossibly so) to parse to everyone's satisfaction with a clean set of rules.

      A few examples: Simon Conway Morris: "Conway Morris" is his surname.
      Ludwig van Beethoven: the "van" is not capitalized (but, course, the Van in Van Morrison is). Similarly, the "da" in da Vinci is not capitalized.

      Of course, the big company you work for doesn't care, until the CEO (with a name like Gerard 't Hooft) gets all annoyed that his name keeps getting turned into Gerard Thooft.


      Added some quotes around Conway Morris and corrected some spelling (or at least miscorrected it in a better looking way )

      emc

      "Being forced to write comments actually improves code, because it is easier to fix a crock than to explain it. "
      —G. Steele

        Well, fortunately my CEO isn't likely to be on this list (it's a list of coders who have access to a group of servers), but you've got me thinking. Perhaps I should just punt with

        $name = uc($name);

        That will annoy everyone, although not equally. ;-)

        "This bounty hunter is my kind of scum: Fearless and inventive." --J.T. Hutt
Re: Splitting/joining on different characters
by dragonchild (Archbishop) on May 25, 2006 at 18:36 UTC
    To answer your second question, just normalize the input first. s/,(?=\S)/, /;

    Your first question ... I'd rewrite it as so:

    foreach my $name ( @names ) { my @n; foreach ( split ' ', $name ) { # Go through each item, push'ing onto @n } $name = join ' ', @n; }
    At that point, @names has been modified in-place using aliasing.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Splitting/joining on different characters
by dsheroh (Monsignor) on May 25, 2006 at 18:46 UTC
    Regarding question 2, splitting on /([ ,])/ should include the characters it splits on in the returned array. e.g., ('Claus' ',' 'Santa') instead of just ('Claus' 'Santa').

    (The parentheses (to capture the match) are the key to this; the brackets are just there because I prefer to use a character class for single-character matching instead of alternation. /( |,)/ should do the same thing, too.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://551667]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (1)
As of 2025-01-13 09:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which URL do you most often use to access this site?












    Results (28 votes). Check out past polls.