Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Weird "soundex" algorithm

by giulienk (Curate)
on Aug 28, 2003 at 14:59 UTC ( #287394=perlquestion: print w/replies, xml ) Need Help??
giulienk has asked for the wisdom of the Perl Monks concerning the following question:

I need to work out a bunch of data with an algorithm i didn't choose. The algorithm has nothing to do with the "real" soundex algorithm (see Text::Soundex), even though it was called like that where i read it. So from now on i will baptize this algorithm Weirdex; this is described to receive a string in input and then:
  • Strip chars outside [a-zA-Z] range
  • Take all consonants except doubles (that should be reduced to one)
  • Take just the first vowel

larry walllarywl

My first coding of Weirdex is

sub weirdex { local $_ = shift; my ($weirdex, $vowel) = ('', 0); s/[^a-zA-Z]//g; for (split '') { if (/[aeiou]/i) { $vowel++ or $weirdex .= $_; } else { $weirdex .= $_ if substr($weirdex, -1, 1) ne $_; } } return $weirdex; }

I'm interested in more elegant/efficient solutions, only regexp solutions and even Golf/Obfu ones.
Thank you.


Replies are listed 'Best First'.
Re: Weird "soundex" algorithm
by broquaint (Abbot) on Aug 28, 2003 at 15:17 UTC
    sub weirdex { local $_ = shift; tr/a-zA-Z//dc; tr/a-zA-Z//s; m/[aeiou]/g and substr($_, pos) =~ s/[aieuo]//g; $_; } print "$_ - ", weirdex($_), $/ for "giulienk", "larry wall", "etheroskedasticity"; __output__ giulienk - gilnk larry wall - larywl etheroskedasticity - ethrskdstcty
    tr//, m// and s///, pos and substr are your friends :)


      Thanks to yours and jmcnamara's reply: i totally forgot about the power of tr/// modifiers, especially c and s, which i never used before. :)


      Nice. I was unhappy with your vowel stripping though. You use a looping match for offsets into $_ but really, you only have to find the *first* vowel and then no looping is required.

      # m/[aeiou]/g and substr($_, pos) =~ s/[aieuo]//g; /[aeiou]/ and substr( $_, $+[0] ) =~ tr/aeiou//d;
        You use a looping match for offsets into $_ but really ...
        Er, what looping? The /g matches the first vowel then saves the position of the match for substr which the replace then operates on. I didn't want to use the $+ variable because of the overhead it invokes.


Re: Weird "soundex" algorithm
by jmcnamara (Monsignor) on Aug 28, 2003 at 15:20 UTC

    sub weirdex { local $_ = $_[0]; tr/a-zA-Z//s; tr/a-zA-Z//cd; my $i; s/([aeiou])/$1if!$i++/eg; $_; }


Re: Weird "soundex" algorithm
by rcaputo (Chaplain) on Aug 28, 2003 at 15:46 UTC
Re: Weird "soundex" algorithm
by Aristotle (Chancellor) on Aug 28, 2003 at 16:03 UTC
    use strict; use Test::More 'no_plan'; sub weirdex { local $_ = shift; tr/A-Za-z//cd; tr/A-Za-z//s; /[aeiou]/g; substr($_, pos()) =~ s/[aeiou]+//g; return $_; } my %test = ( 'giulienk' => 'gilnk', 'larry wall' => 'larywl', 'etheroskedasticity' => 'ethrskdstcty', ); my ($input, $result); is(weirdex($input), $result, "$input") while ($input, $result) = each +%test; __END__ ok 1 - etheroskedasticity ok 2 - larry wall ok 3 - giulienk 1..3

    Makeshifts last the longest.

Re: Weird "soundex" algorithm
by Anonymous Monk on Aug 28, 2003 at 15:37 UTC
    /[aeiou]/i should be  /[aeiouy]/i.

    Y is a vowel, isn't it?

      Sometimes, so it may not be necessary for this particular algorithm. It would also appear that two Welsh words in the dictionary use 'w' as a vowel.


      Y and W are pseudo-vowels. They represent vowel sounds, but are not vowel letters. This is just one of the many contradictions between spoken and written language. I think W is a vowel in Croatian.
Re: Weird "soundex" algorithm
by giulienk (Curate) on Sep 01, 2003 at 06:56 UTC
    I add a couple comments for posterity:
    • All the solutions (except the original one) have the same problem: they don't care about upper-case vowels. All examples were in lower cases so nobody noticed it.
    • The original implementation was wrong has it would consider consonant separated by a "non-first-vowel" as contigous ones and squash them if they are equal.
    • The illustrious surname "Schwartz" with its consonant-to-vowel ratio of 7.0 render this algorithm pretty useless :)


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://287394]
Approved by jwest
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (9)
As of 2017-10-19 18:27 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (255 votes). Check out past polls.