http://www.perlmonks.org?node_id=121053


in reply to Re: Getting impossible things right (behaviour of keys)
in thread Getting impossible things right (behaviour of keys)

Or even:
foreach my $suffic( sort { length($b) <=> length($a) or $a cmp $b } keys %sufdata)
to sort both on length and alphanumeric

Replies are listed 'Best First'.
Re: Re:{2} Getting impossible things right (behaviour of keys)
by blakem (Monsignor) on Oct 24, 2001 at 14:03 UTC
    Or you could exploit the fact that the regex looks for the leftmost matching string, and do something like:
    my $pattern = join('|',(sort {$a cmp $b} keys %sufdata)); if ($name =~ s/($pattern)$/$sufdata{$1}/o) { print "Suffix $1 -> $name\n"; return; } print " no rule apply -> $name\n";

    -Blake

      If this is going to be at all robust (umm and work as desired, sorry Blakem) I would change the sort to the following:
      my $regex=join '|', map {substr $_,2} sort {$a cmp $b} map {pack "SA*",length($_),quotemeta($_)} keys %su +fdata;
      Your code doesnt actually sort the words by length. (Yes I _am_ deliberately storing the length before I quotemeta it.)

      :-)

      Update
      Thanks to Amoe I reexamined this and realized I missed an opportunity for lazyness that geeky virtue:

      my $regex=join '|', map {substr $_,2} sort map {pack "SA*",length($_),quotemeta($_)} keys %su +fdata;
      Although IIRC perl will optimize the first into the second anyway, it does save about 10 chars or so..
      Oh also for the curious this is more modern form of the Schwartizian Transform which is a very cool trick. Unfortunately I cant remember the name of this version, nor the link to the excellent document I read about it. Hopefully someone that does will post a reply.

      Update2
      Tilly kindly supplied the link (see replies to this post). However the name I had in mind is the GRT or Guttman Rosler Transform.

      DeMerphq / Yves
      --
      Have you registered your Name Space?

        I think the phrase you want is, "packed default".

        It is discussed in this paper on efficient sorting in Perl.

        Ah, but you dont *need* to sort by length... the regex is anchored at the end, so the pattern that matches first from left to right will *already be* the longest match. For instance, look at the following code:
        #!/usr/bin/perl -wT use strict; my $text = 'fedcba'; $text =~ (/(a|ba|cba|dcba)$/); print "Match: $1\n"; =OUTPUT Match: dcba
        It matches on the longest one, even though its at the end of the alternation.... thats because the regex engine works from left to right, and the first one that matches wins. That was the whole point of my post, sorry I wasn't more explicit.

        -Blake

      I don't believe sorting on cmp alone will do it in this case. If you had two keys 'ba' and 'cba', then you're going to get the wrong behavior. As long as you're tying the regex to the end of the string, length should be sufficient.

      -----------------------------------------------------
      Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
      "I can see my house from here!"
      It's not what you know, but knowing how to find it if you don't know that's important

        Good, point... the entire sort is unnecessary. The length is taken care of by the regex engine, and its not possible to have two matches of the same length, so no sorting all all is needed. Therefore, I submit that:
        my $pattern = join('|',(keys %sufdata)); if ($name =~ s/($pattern)$/$sufdata{$1}/o) { print "Suffix $1 -> $name\n"; return; } print " no rule apply -> $name\n";
        Would work the same as my code above.

        -Blake

      Hmm.. from perlre:
      Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. This means that alternatives are not necessarily greedy. For example: when matching `foo|foot' against "barefoot", only the "foo" part will match, as that is the first alternative tried, and it successfully matches the target string.

      So that would take a string random with respect to length.

      Jeroen

      Update: I see. Can you explain that regex-feature?

        Reply to Update.

        Read the paragraph you quoted above very carefully.....

        The regex engine works something like this: it moves from left to right, checking each alternation in order before moving on to the next position in the string.

        Now, lets consider the pattern /(a|b|c)/ against the string 123abc. In the following diagram '^' denotes the current position in the string. The "pointer" gets moved one char to the right after each stanza:

        ($txt = '123abc') =~ /(a|b|c)/; 1. ^123abc - check for a -> fail - check for b -> fail - check for c -> fail 2. 1^23abc - check for a -> fail - check for b -> fail - check for c -> fail 3. 12^3abc - check for a -> fail - check for b -> fail - check for c -> fail 4. 123^abc - check for a -> succeess, $1 becomes 'a'
        Here are a few others, to illustrate the point.
        ($txt = 'barefoot') =~ /(foo|foot)/; 1. ^barefoot - check for foo -> fail - check for foot -> fail 2. b^arefoot - check for foo -> fail - check for foot -> fail 3. ba^refoot - check for foo -> fail - check for foot -> fail 4. bar^efoot - check for foo -> fail - check for foot -> fail 5. bare^foot - check for foo -> success, $1 becomes 'foo'
        Here, 'foo' beats 'foot' because they match at the same spot in the string, and foo is listed first in the alternation. The leftmost match will always win, the order they are listed in the alternation is merely a tie breaker.

        And the one in question (slightly shortened)..

        ($txt = 'arvec') =~ /(ar|ec|vec)$/ 1. ^arvec - check for ar$ -> fail (because of the end-of-string anch +or) - check for ec$ -> fail - check for vec$ -> fail 2. a^rvec - check for ar$ -> fail - check for ec$ -> fail - check for vec$ -> fail 3. ar^vec - check for ar$ -> fail - check for ec$ -> fail - check for vec$ -> success, $1 becomes 'vec'
        Notice how 'ec' *would* match in the next stanza ("arv^ec"), but we never get that far. The first match wins, and in this situation it is the one we want.

        Now, go back and read that paragraph you quoted again... does it make more sense now?

        -Blake

        Add the requirement that it has to be anchored at the end, and you will get the longest match. Try /(ot|foot|refoot)$/ and see which one matches. Bet ya its the longest one.

        -Blake

      I feel ashamed...

      but I donīt understand this code. I would like to before I try if it works or not. (call me a theory fetish monk)

      First I donīt understand why you use cmp for the sort.
      Then (or because of this) I donīt see how we can be sure that the longest suffixes are examined first.

      I could understand what ($name =~ s/($pattern)$/$sufdata{$1}/o) does. And in fact it is elegant to use regexp instead of an own iteration, just the $pattern creation isnīt clear.

      Ciao

        OK, I'll take you through it... it looks like I misled several others with my late night post as well.

        First, lets look at a slightly newer version of the code:

        1: my $pattern = join('|',(keys %sufdata)); 2: if ($name =~ s/($pattern)$/$sufdata{$1}/o) { 3: print "Suffix $1 -> $name\n"; 4: return; 5: } 6: print " no rule apply -> $name\n";
        Our main goal is to construct a pattern that matches a piece at the end of the string. This match needs to be one of the keys in %sufdata, specifically the longest such key that matches.

        So, start with line #1.

        my $pattern = join('|',(keys %sufdata));
        Here we create a list of keys in %sufdata, then join them together with '|' symbols. $pattern is now a string that looks like "a|s|k|ar|ic|ec|ek|er|vec" though the order of the keys is unknown (and unimportant in this specific case).

        In line #2:

        if ($name =~ s/($pattern)$/$sufdata{$1}/o) {
        If we expand $pattern oursleves, we get:
        if ($name =~ s/(a|s|k|ar|ic|ec|ek|er|vec)$/$sufdata{$1}/o) {
        We check for a match, and replace the first one we find with the corresponding value from %sufdata.

        Now here's the trick... the regex engine works from left to right, checking if *any* of those keys match at each position along the way. Since we are anchored at the end, and we use the first one we find, this effectively becomes a search for the longest matching suffix.

        So, because of the "left most matching" behavior of the regex engine, we can feel assured that the first match, will actually be the longest one in this particular instance.

        Lines 3-6 are left as an excercise for the reader. ;-)

        -Blake