Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Sorting ip addresses quickly

by DrManhattan (Chaplain)
on Jul 13, 2000 at 22:43 UTC ( #22432=snippet: print w/replies, xml ) Need Help??
Description: I recently set out to sort some ip addresses. Here's what I came up with:

First the straightforward method:

sub by_ip
        # Split the two ip addresses up into octets
        my ($a1, $a2, $a3, $a4) = split /\./, $a;
        my ($b1, $b2, $b3, $b4) = split /\./, $b;

        # Check to see if the first octets are the same
        if ($a1 == $b1) {
                # If the first octets are the same, check
                # the second octets
                if ($a2 == $b2)
                        # Ditto for the third octets
                        if ($a3 == $b3)
                                # If the first 3 octets
                                # of each address are 
                                # the same, return the
                                # comparison of the last
                                # octet
                                $a4 =~ s/^([0-9]+)/$1/;
                                $b4 =~ s/^([0-9]+)/$1/;

                                return $a4 <=> $b4;
                        } else {
                                # 3rd octets were different
                                # so return their comparison
                                return $a3 <=> $b3;
                } else {
            # 2nd octets were different so
                        # return their comparison
                        return $a2 <=> $b2;
        } else {
                # Best case: The first octets were
                # different so we can return their
                # comparison immediately
                return $a1 <=> $b1;

my @sorted = sort by_ip @unsorted;

This works, but it's slow because we have to split both ip addresses and probably make several equality checks for each address comparison. So can we get away with only mangling each ip address once during the whole operation and only making one numeric comparison? Yes, Schwartzian Transform to the rescue:

my @sorted = map  { $_->[0] }
             sort { $a->[1] cmp $b->[1] }
             map  { [$_, sprintf("%03.f%03.f%03.f%03.f", split(/\./, $
+_))] }

Now what's this doing? Let's split it up into three separate statements:

@mapped = map { [$_, sprintf("%03.f%03.f%03.f%03.f", split(/\./, $_))]
+ } @unsorted;
@sorted = sort { $a->[1] cmp $b->[1] } @mapped;
@sorted = map { $_->[0] } @sorted;
  1. The first map command takes the list of ip addresses as input and gives us back a new array of arrayrefs. Now each array element in @mapped looks like [ "", "001002003004" ]. Now we can use the numeric conversion of each ip address to compare it directly with another.
  2. The sort command sorts the members of @mapped by comparing the second element of each (e.g. 001002003004), and leaves us with a sorted array of arrayrefs in @sorted.
  3. The final map command extracts the first element of each member of the sorted array (e.g. ""), and leaves us with a sorted list of ip addresses.

So why is it faster to use the transform? Well, even in the best case when the first octet of each ip address to be sorted is different, &by_ip() has to make two split()s and two numeric comparisons every time it's called. Since Perl's sort() uses a quicksort algorithm, it runs in O(N*log(N)) time. That makes 2N*log(N) split()s and 2N*log(N) numeric comparisons, where N is the number of addresses in the unsorted list. The transform, on the other hand, only split()s each address once and only makes one numeric comparison for each address comparison. N split()s and N*log(N) comparisons is a big win over 2N*log(N) split()s and 2N*log(N) comparisons.

- Matt

Replies are listed 'Best First'.
Sorting ip addresses quicker
by gryng (Hermit) on Jul 14, 2000 at 05:47 UTC
    You are indeed correct DrManhattan however a few small things that I'd like to add to make this code even faster. (Though, I'll readily admit, probably not the fastest!)

    The first thing I noticed was that you said: "Now we can use the numeric conversion of each ip address to compare it ...". This is in regards to your transformation from form to 001002003004 form. However, you use cmp not <=> to compare your ips after this point. cmp is for strings of course, and <=> for numbers. I thought that maybe by simply changing this we could get a little speed up:

    New code:

    my @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, int sprintf("%03.f%03.f%03.f%03.f", split(/\. +/, $_))] } @unsorted;

    The two differences are that we are now forcing perl to compare the two ips as numbers not strings -- which should be faster, as a 12 digit binary number will only have 5 bytes to compare, whereas a 12 digit character string would have 12 bytes to compare. Also, concerning space, I added an int infront of the sprintf, in order to force perl to store the ip as a number, and not as a string.
    The result? A slight speed down (6%), I didn't bother to measure the space taken, it should have simply been proportional. However the fact that it got slower shows that the cost difference between 5 comparisons and 12 was not worth the additional cost of converting the 12 character string into a 5 byte number. However it is good to know that it should use less space, which may be more important than 6% speed difference.


    Loading ips Loaded 1000 ips Benchmark: running Fastnum, Faststr, each for at least 10 CPU seconds. +.. Fastnum: 11 wallclock secs (11.28 usr + 0.00 sys = 11.28 CPU) @ 7 +.98/s (n=90) Faststr: 12 wallclock secs (11.34 usr + 0.00 sys = 11.34 CPU) @ 8 +.29/s (n=94)
    Now, the real kicker is, why use sprintf? The first warning is that you are using a number as a string for longer than you need to. The modified code below treats the octets as numbers as soon as they shoot out of the split statement. On top of this sprintf isn't going to be the fastest subroutine to call.

    The second thing is, why multiply by 1000? Ip's can only range up to 255, not 999. So just multiply by 256 (which can be optimized away as a bit-shift instead of a full blown integer).

    Is the cost of string conversion worth the speed increase of removing the sprintf? Remember we are also substituting the sprintf with 4 multiplies. Of course we are also getting one byte back too, 4 bytes to store the ip instead of 5, so slightly faster to compare too (though as we can see the cost of converting can outway the time sorting).

    So here is the newer code, and benchmark below it.

    my @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { my ($x,$y)=(0,$_); $x=$_ + $x * 256 for split(/\./, $y); [$y,$x]} @unsorted;
    Loading ips Loaded 1000 ips Benchmark: running Faster, Faststr, each for at least 10 CPU seconds.. +. Faster: 11 wallclock secs (11.09 usr + 0.00 sys = 11.09 CPU) @ 12 +.98/s (n=144) Faststr: 11 wallclock secs (11.24 usr + 0.00 sys = 11.24 CPU) @ 8 +.27/s (n=93)
    Tada, a 57% increase! Much much better. As you can see a subroutine call can make a big difference, they cost alot, plus you often don't get to know all the stuff that goes on inside.

    Well here is one final benchmark, all three running on a larger file. The main idea here is to put more weight on sorting side of the process than the transformation:

    Loading ips Loaded 100000 ips Benchmark: timing 5 iterations of Faster, Fastnum, Faststr... Faster: 58 wallclock secs (57.90 usr + 0.21 sys = 58.11 CPU) Fastnum: 83 wallclock secs (82.57 usr + 0.38 sys = 82.95 CPU) Faststr: 83 wallclock secs (82.74 usr + 0.00 sys = 82.74 CPU)
    Aha! As you can see, now fastnum is now nearly the same speed as faststr. So the difference of saving 7 comps for the cost of string conversion eventually pays off. This is, as DrManhattan mentioned earlier, because you are paying the cost of your comps N*log(N) times, not just N. Also notice that our speedup with the new code is also less drastic now that we have 100 times the ips to sort.

    Anyway, it's been fun,

    Updated: Fixed a small typo (thanks splinky)

      I think you meant

      $x=$_ + $x << 8 for split(/\./, $y);

      rather than

      $x=$_ + $x * 8 for split(/\./, $y);


      Neat! See, this is exactly why open source works.


        Yeah I agree, I think open source is great for learning, and for programming in general. Open forums like this are great places for learning essential techniques as well as those things that aren't taught in (most) books and classes.

        I wonder if anyone can come up with a (significantly) faster perl version of this code. I would be quite surprised if that was the fastest way.


RE: Sorting ip addresses quickly
by davorg (Chancellor) on Jul 14, 2000 at 11:43 UTC

    Sorting IP addresses is the canonical example of the Guttman-Rosler Transform (aka the packed default sort) given in A Fresh Look at Efficient Perl Sorting.

    The paper is well-worth reading in detail, but here is the IP sorting code.

    @out = map substr($_, 4) => sort map pack('C4' => /(\d+)\.(\d+)\.(\d+)\.(\d+)/) . $_ => @in;

    Notice that by careful choice of a 'pack' function they can use the default sort behaviour, rather than writing a custom sort routine.

    The benchmarks in the paper give this version as being about twice as fast as the Schwartzian Transform.

    Update: URL replaced with one that works. Thanks to grinder for pointing it out.


    European Perl Conference - Sept 22/24 2000, ICA, London

      I found that split() benchmarks a little faster than the regex in the following code

      #!/usr/bin/perl -w use strict; use Benchmark; use vars qw/@ip_strings/; # these are all made up I hope @ip_strings = qw( 1.2 +.3.4 10 +.12.14.16); timethese (10000, { 'split-em', q{ my @packed_ips; foreach (@ip_strings) { push @packed_ips, pack 'C4', split /\./, $_; } }, 'regex-em', q{ my @packed_ips; foreach (@ip_strings) { push @packed_ips, pack 'C4', /(\d+)\.(\d+)\.(\d+)\.(\d+) +/; } } } );

      Some typical results:

      Benchmark: timing 10000 iterations of regex-em, split-em... regex-em: 2 wallclock secs ( 3.64 usr + 0.00 sys = 3.64 CPU) split-em: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) bash-2.04$ perl Benchmark: timing 10000 iterations of regex-em, split-em... regex-em: 3 wallclock secs ( 4.03 usr + 0.00 sys = 4.03 CPU) split-em: 3 wallclock secs ( 3.76 usr + 0.00 sys = 3.76 CPU)

      Philosophy can be made out of anything -- or less

RE: Sorting ip addresses quickly
by turnstep (Parson) on Jul 13, 2000 at 23:55 UTC
    Nice ST, but the non-ST example could be shorter:
    @sorted = sort { @a = split /\./, $a; @b = split /\./, $b; $a[0] <=> $b[0] or $a[1] <=> $b[1] or $a[2] <=> $b[2] or $a[3] <=> $b[3]; } @unsorted;
Re: Sorting ip addresses quickly
by salva (Abbot) on May 28, 2007 at 17:23 UTC
      print sort map { s/(\d+)/$1/eg ; $_ } <DATA>; __DATA__
        obviously that doesn't work:
        print sort map { s/(\d+)/$1/eg ; $_ } <DATA>; __DATA__
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://22432]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2019-06-24 08:52 GMT
Find Nodes?
    Voting Booth?
    Is there a future for codeless software?

    Results (97 votes). Check out past polls.