Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

How can I do a numeric sort on a substring?

by misterperl (Pilgrim)
on Jun 25, 2021 at 13:27 UTC ( [id://11134273]=perlquestion: print w/replies, xml ) Need Help??

misterperl has asked for the wisdom of the Perl Monks concerning the following question:

I often run into this situation, where I have an array with values like:
a-1 a-2 a-3 .. a-n

and I want to sort on the numeric part using $a <=> $b But instead I jump thru hoops using a regex to get the numbers then save them in another array, etc etc.. I'm pretty sure this can be a 1-line sort? I appreciate the wisdom of the Monks as always!

Replies are listed 'Best First'.
Re: How can I do a numeric sort on a substring?
by hippo (Bishop) on Jun 25, 2021 at 13:41 UTC

    The general case would be to use a Schwartzian Transform but in this simplistic case for small values of n you can just perform the extractions within the sort:

    use strict; use warnings; use Test::More tests => 1; my @in = qw/a-3 a-1 a-2/; my @want = qw/a-1 a-2 a-3/; my @have = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } @in; is_deeply \@have, \@want;

    See also the FAQ: How do I sort an array by (anything)?

    PS. Here's the same thing but with substr:

    use strict; use warnings; use Test::More tests => 1; my @in = qw/a-3 a-1 a-2/; my @want = qw/a-1 a-2 a-3/; my @have = sort { substr ($a, 2) <=> substr ($b, 2) } @in; is_deeply \@have, \@want;

    🦛

      TYVM that's very helpful. Is the [0] there in case there are more than one number matching like a-1-3 ? TY!

        Not specifically. The regex match is context sensitive, so the brackets around it enforce list context and then the [0] pulls out the first value of that list before the <=> gets a chance to enforce scalar context on it. Example:

        #!/usr/bin/env perl use strict; use warnings; use feature 'say'; my $x = 'a-9'; say "Scalar: " . $x =~ /(\d+)/; say "List: " . ($x =~ /(\d+)/); say "First: " . ($x =~ /(\d+)/)[0];

        Read lots more in the Context tutorial.


        🦛

Re: How can I do a numeric sort on a substring?
by haukex (Archbishop) on Jun 25, 2021 at 14:01 UTC
      yes TY, I was able to do the sort, but I'm seeking a 1-liner to do the work this time.
Re: How can I do a numeric sort on a substring?
by syphilis (Archbishop) on Jun 25, 2021 at 14:07 UTC
    I want to sort on the numeric part using $a <=> $b

    Here's my guess:
    use strict; use warnings; my @data = qw(a-2 a-10 a-9 a-8 a-0 a-1 a-3 a-6 a-5 a-4 a-7); my @sorted = map {'a-' . $_} sort {$b <=> $a} map {substr($_, 2, lengt +h($_) -2) } @data; print "@sorted\n"; __END__ Outputs: a-10 a-9 a-8 a-7 a-6 a-5 a-4 a-3 a-2 a-1 a-0
    If you want them to appear in the reverse order, instead do:
    my @sorted = map {'a-' . $_} sort {$a <=> $b} map {substr($_, 2, lengt +h($_) -2) } @data;
    Cheers,
    Rob
      nicely done Rob! ++ vote also love it..
Re: How can I do a numeric sort on a substring? [Benchmark]
by kcott (Archbishop) on Jun 26, 2021 at 05:29 UTC

    G'day misterperl,

    Other than a number of solutions involving modules, there were several pure Perl solutions. I was interested in how these might compare. The results I got were unexpected; I added a lot of variations (currently up to 14) but the same trend continued.

    I had thought that the Schwartzian and Guttman-Rosler transformations would have been the fastest, but that wasn't the case. In fact, those that performed extraction of the number, prior to comparison, within the sort, were the quickest. Given these extractions might have been done multiple times for the same data, I suspect Perl (v5.34.0) is performing some optimisation and perhaps caching results; I would be interested in results others receive using different Perl versions. In case there's concern about the number of tests, the entire benchmark should run in under a minute: a COUNT of zero indicates 3 seconds for each of the 14 tests, plus a few seconds for other overhead.

    Here's a rough breakdown of the results:

    • Schwartzian Transform: slowest. Test names begin with ST.
    • Guttman-Rosler Transform: second slowest but roughly twice as fast as Schwartzian Transform. Test names begin with GRT.
    • syphilis' solution: about 40-50% faster than Guttman-Rosler Transform; removing the length calculation added a small boost. Test names begin with mcs.
    • Other solutions only using sort were the hands-down winners by far. Their speeds varied slightly between runs; the results were too close to call. Test names begin with s.

    Here's a sample run. It might be worth using the "download" link to see this without wrapping.

    Perl & OS: v5.34.0 on cygwin Unordered data (for preamble tests): a-10 a-01 a-22 a-2 a-0 a-3 a-000 a-1 a-12345 a-1 Preamble tests: grt_pack_expr: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 grt_pack_expr_q: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_anchored: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_anch_expr_ni: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_anch_ni: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_expr_ni: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_no_index: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 map_cat_substr: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 map_cat_substr_len: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 sort_pack: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 sort_regex: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 sort_regex_anchored: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 sort_substr: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 Legend: GRTpe: grt_pack_expr GRTpeq: grt_pack_expr_q STr: st_regex STra: st_regex_anchored STraen: st_regex_anch_expr_ni STran: st_regex_anch_ni STren: st_regex_expr_ni STrn: st_regex_no_index mcs: map_cat_substr mcsl: map_cat_substr_len sp: sort_pack sr: sort_regex sra: sort_regex_anchored ss: sort_substr Benchmarks: Note: Unordered data extended with 'map "a-$_", shuffle 0..10000' Rate STraen STran STra STrn STren STr GRTp +e GRTpeq mcsl mcs sp sr ss sra STraen 68.2/s -- -0% -1% -2% -3% -5% -54 +% -54% -60% -61% -100% -100% -100% -100% STran 68.6/s 0% -- -0% -1% -3% -4% -54 +% -54% -60% -61% -100% -100% -100% -100% STra 68.6/s 1% 0% -- -1% -2% -4% -53 +% -54% -60% -61% -100% -100% -100% -100% STrn 69.3/s 2% 1% 1% -- -1% -3% -53 +% -54% -59% -60% -100% -100% -100% -100% STren 70.4/s 3% 3% 3% 2% -- -2% -52 +% -53% -59% -60% -100% -100% -100% -100% STr 71.5/s 5% 4% 4% 3% 2% -- -52 +% -52% -58% -59% -100% -100% -100% -100% GRTpe 147/s 116% 115% 115% 113% 109% 106% - +- -1% -13% -16% -100% -100% -100% -100% GRTpeq 149/s 119% 118% 118% 116% 112% 109% 1 +% -- -12% -15% -100% -100% -100% -100% mcsl 170/s 149% 148% 148% 145% 141% 138% 15 +% 14% -- -3% -100% -100% -100% -100% mcs 175/s 157% 156% 155% 153% 149% 145% 19 +% 17% 3% -- -100% -100% -100% -100% sp 263168/s 385629% 383801% 383272% 379410% 373718% 368096% 178350 +% 175994% 154699% 149996% -- -2% -2% -3% sr 268680/s 393708% 391841% 391301% 387358% 381547% 375807% 182088 +% 179682% 157941% 153139% 2% -- -0% -1% ss 269453/s 394842% 392970% 392428% 388474% 382646% 376890% 182612 +% 180200% 158396% 153581% 2% 0% -- -0% sra 270688/s 396651% 394771% 394227% 390254% 384400% 378617% 183450 +% 181026% 159122% 154285% 3% 1% 0% --

    There was a typo in the code; see update below. The original Benchmarks: section is in the spoiler below.

    Benchmarks: Note: Unordered data extended with 'map "a-$_", shuffle 0..10000' Rate STra STr STrn STran STraen STren GRTpe +q GRTpe mcsl mcs sra sp sr ss STra 69.0/s -- -0% -0% -1% -1% -2% -52 +% -52% -58% -60% -100% -100% -100% -100% STr 69.3/s 0% -- -0% -1% -1% -1% -52 +% -52% -58% -60% -100% -100% -100% -100% STrn 69.3/s 0% 0% -- -1% -1% -1% -51 +% -52% -58% -60% -100% -100% -100% -100% STran 69.7/s 1% 1% 1% -- -0% -1% -51 +% -51% -58% -60% -100% -100% -100% -100% STraen 69.7/s 1% 1% 1% 0% -- -0% -51 +% -51% -58% -60% -100% -100% -100% -100% STren 70.1/s 2% 1% 1% 1% 0% -- -51 +% -51% -58% -59% -100% -100% -100% -100% GRTpeq 143/s 107% 106% 106% 105% 105% 104% - +- -0% -14% -17% -100% -100% -100% -100% GRTpe 144/s 108% 107% 107% 106% 106% 105% 0 +% -- -14% -17% -100% -100% -100% -100% mcsl 166/s 141% 140% 140% 138% 138% 137% 16 +% 16% -- -4% -100% -100% -100% -100% mcs 173/s 150% 149% 149% 148% 148% 146% 21 +% 20% 4% -- -100% -100% -100% -100% sra 269609/s 390588% 389108% 388890% 386666% 386573% 384706% 188564 +% 187751% 162119% 156078% -- -1% -1% -1% sp 271738/s 393672% 392181% 391961% 389720% 389626% 387744% 190053 +% 189234% 163400% 157311% 1% -- -1% -1% sr 273197/s 395787% 394288% 394067% 391813% 391719% 389826% 191074 +% 190251% 164278% 158157% 1% 1% -- -0% ss 273197/s 395787% 394288% 394067% 391813% 391719% 389826% 191074 +% 190251% 164278% 158157% 1% 1% 0% --

    Here's the code:

    #!/usr/bin/env perl use strict; use warnings; use namespace::autoclean; use Benchmark 'cmpthese'; use List::Util 'shuffle'; my @unordered = qw{a-10 a-01 a-22 a-2 a-0 a-3 a-000 a-1 a-12345 a-1}; my %expanded_abbrev_for = ( sr => 'sort_regex', STr => 'st_regex', STrn => 'st_regex_no_index', STren => 'st_regex_expr_ni', sra => 'sort_regex_anchored', STra => 'st_regex_anchored', STran => 'st_regex_anch_ni', STraen => 'st_regex_anch_expr_ni', ss => 'sort_substr', mcs => 'map_cat_substr', mcsl => 'map_cat_substr_len', sp => 'sort_pack', GRTpe => 'grt_pack_expr', GRTpeq => 'grt_pack_expr_q', ); my %coderef_for = ( sr => \&sort_regex, STr => \&st_regex, STrn => \&st_regex_no_index, STren => \&st_regex_expr_ni, sra => \&sort_regex_anchored, STra => \&st_regex_anchored, STran => \&st_regex_anch_ni, STraen => \&st_regex_anch_expr_ni, ss => \&sort_substr, mcs => \&map_cat_substr, mcsl => \&map_cat_substr_len, sp => \&sort_pack, GRTpe => \&grt_pack_expr, GRTpeq => \&grt_pack_expr_q, ); print "Perl & OS:\n $^V on $^O\n"; print "Unordered data (for preamble tests):\n @unordered\n"; print "Preamble tests:\n"; my $tests_fmt = " %-22s %s\n"; for my $name (sort keys %coderef_for) { printf $tests_fmt, "$expanded_abbrev_for{$name}:", "@{[$coderef_for{$name}->()]}"; } exit if @ARGV && $ARGV[0] eq '--dry_run'; print "Legend:\n"; my $legend_fmt = " %-7s %s\n"; for my $abbrev (sort keys %expanded_abbrev_for) { printf $legend_fmt, "$abbrev:", $expanded_abbrev_for{$abbrev}; } # Extend @unordered for improved benchmarking push @unordered, map "a-$_", shuffle 0..10000; print "Benchmarks:\n"; print " Note: Unordered data extended with 'map \"a-\$_\", shuffle +0..10000'\n"; my $count = 0; cmpthese $count => \%coderef_for; sub sort_regex { sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } @unordered; } sub st_regex { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, (/(\d+)/)[0]] } @unordered; } sub st_regex_no_index { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, /(\d+)/] } @unordered; } sub st_regex_expr_ni { map $_->[0], sort { $a->[1] <=> $b->[1] } map [$_, /(\d+)/], @unordered; } sub sort_regex_anchored { sort { ($a =~ /(\d+)$/)[0] <=> ($b =~ /(\d+)$/)[0] } @unordered; } sub st_regex_anchored { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, (/(\d+)$/)[0]] } @unordered; } sub st_regex_anch_ni { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, /(\d+)$/] } @unordered; } sub st_regex_anch_expr_ni { map $_->[0], sort { $a->[1] <=> $b->[1] } map [$_, /(\d+)$/], @unordered; } sub sort_substr { sort { substr($a, 2) <=> substr($b, 2) } @unordered; } sub map_cat_substr { map { 'a-' . $_ } sort { $a <=> $b } map { substr $_, 2 } @unordered; } sub map_cat_substr_len { map { 'a-' . $_ } sort { $a <=> $b } map { substr $_, 2, length($_) - 2 } @unordered; } sub sort_pack { sort { pack(L => substr($a, 2)) cmp pack(L => substr($b, 2)) } @unordered; } sub grt_pack_expr { map substr($_, 4), sort map pack(L => substr($_, 2)) . $_, @unordered; } sub grt_pack_expr_q { map substr($_, 8), sort map pack(Q => substr($_, 2)) . $_, @unordered; }

    Notes on pack:

    1. I'm not any great expert with this function. Happy to receive corrections, advice, etc.
    2. The doco indicates that the "Q" template is not universally available. If you don't have it, you might want to remove &grt_pack_expr_q and the two lines referencing it in %expanded_abbrev_for and %coderef_for (the key is "GRTpeq" in both cases). If you do have it, be aware others, with whom you share your code, might not.

    Update: There was a typo in my original code: the first key/value pair in %coderef_for was sr => \&sort_substr but it should have been sr => \&sort_regex. I've fixed my code locally and edited the code above to reflect this fix.

    I reran the code and have posted a new Benchmarks: section. All of the output before Benchmarks: is unchanged from the original posting. The old Benchmarks: section is now in a spoiler. The results are not particularly different — except sr and ss are no longer identical; although, still too close to call — and my original "rough breakdown of the results" has not changed.

    — Ken

      > I had thought that the Schwartzian and Guttman-Rosler transformations would have been the fastest, but that wasn't the case. In fact, those that performed extraction of the number, prior to comparison, within the sort, were the quickest

      you are only testing with 10 elements in your @unordered array.

      for n elements you have in best case O(n*log(n)) comparisons but O(n) packs and unpacks with ST and GRT, which means the overhead to extract n numbers will account much more for small n.

      use at least n >> 1000 elements for a real benchmark.

      Rule of thumb : the choice of algorithm is almost always neglectable for small data.

      update

      it's like getting the Porsche out of the garage to buy a six-pack of beer just 10m around the corner.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        G'day Rolf,

        "you are only testing with 10 elements in your @unordered array."

        The code only contains one comment which I added for the express purpose of heading off such feedback:

        # Extend @unordered for improved benchmarking push @unordered, map "a-$_", shuffle 0..10000;
        "use at least n >> 1000 elements for a real benchmark"

        Indeed. I used 10,011 elements "for improved benchmarking".

        The preamble tests were to check that all subroutines returned identical results. Each of the preamble tests were only run once and did not involve any benchmarking; they were intended as a sanity check (processing stopped here if --dry_run was used). The data used for this had 10 elements, which I considered sufficient for these tests, and was indicated by:

        Unordered data (for preamble tests):

        — Ken

      my $HAS_Q = !!eval { pack 'Q', 1; 1 }; ...
Re: How can I do a numeric sort on a substring?
by AnomalousMonk (Archbishop) on Jun 25, 2021 at 16:22 UTC

    Here's another example of the Schwartzian mentioned earlier. I've written it as a "one-liner" although that's not the form I'd prefer for production code for reasons of readability/maintainability. I've added more test values in different formats to illustrate the effects of the regex that is used.

    Win8 Strawberry 5.8.9.5 (32) Fri 06/25/2021 11:50:17 C:\@Work\Perl\monks >perl use strict; use warnings; use Test::More tests => 1; 1..1 my @in = qw/a-3 21-2 a-2 bb-10 c--11-1 a-1 def-2-99/; my @want = qw/a-1 a-2 def-2-99 a-3 bb-10 c--11-1 21-2/; my @got = map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, /\d+/g ] +, @in; is_deeply \@got, \@want; ^Z ok 1
    Another consideration when sorting is "stability." See the discussion of this here and in the sort pragma. See also Guttman and Rosler's A Fresh Look at Efficient Perl Sorting for lots more info on Perl sorting.


    Give a man a fish:  <%-{-{-{-<

Re: How can I do a numeric sort on a substring?
by Marshall (Canon) on Jun 28, 2021 at 00:58 UTC
    I wouldn't worry about whether this is one line of code or not.
    A simple way is shown below.
    $a and $b are special variables used by sort.
    Here the numeric value at the end of string is used as the primary thing to sort upon.
    If the numeric values are equal, then string comparison is used as a "tie breaker".

    This will work fine performance wise for say 100 things.
    If there are 1,000 things, start to think about something more complex.
    If there are 10,000 things definitely think about more complex idioms if you do this "often".

    I would recommend keeping things simple unless there is an obvious performance reason that requires the sort to be faster.

    use strict; use warnings; my @array = qw(blah-13 yup-09 weird-2 stranger-1 strange-1); print "Input array:\n"; print "$_\n" for @array; @array = sort { my ($atext,$anum) = $a =~ /(\w+)-(\d+)/; my ($btext,$bnum) = $b =~ /(\w+)-(\d+)/; $anum <=> $bnum or $atext cmp $btext }@array; print "\nOutput array:\n"; print "$_\n" for @array; __END__ Input array: blah-13 yup-09 weird-2 stranger-1 strange-1 Output array: strange-1 stranger-1 weird-2 yup-09 blah-13
    An Update: I re-iterate my suggestion about the 10,000 items level. No, I don't present benchmarks of my own, but in my experience, that is where a very noticeable impact of say the ST will become apparent. If your program is interacting with a single user, even quite stunning performance increases might make little difference. A difference of say 20 ms is basically undetectable at the user UI level. If you are writing server level code then performance matters a lot more because it affects the number of users that can be serviced by your machine.

    Another update: A difference of 50 ms (1/20th of a second) will be noticeable by the user in things like audio playback of multiple files. Less than that doesn't make much difference. If your sort takes even 3ms which is a LONG time by modern computer standards, so what? Unless this a server process, I wouldn't worry about it.

Re: How can I do a numeric sort on a substring?
by swl (Parson) on Jun 25, 2021 at 23:43 UTC

    Use natkeysort from Sort::Key::Natural.

    Several of the other responses link to discussions which no doubt mention it, but this saves a bit of extra searching.

    Edit - and now I see this has already been explicitly mentioned in 11134278.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11134273]
Approved by marto
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2024-04-19 08:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found