How can I do a numeric sort on a substring?

misterperl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How can I do a numeric sort on a substring? by hippo (Bishop) on Jun 25, 2021 at 13:41 UTC
The general case would be to use a Schwartzian Transform but in this simplistic case for small values of n you can just perform the extractions within the sort: `use strict; use warnings; use Test::More tests => 1; my @in = qw/a-3 a-1 a-2/; my @want = qw/a-1 a-2 a-3/; my @have = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } @in; is_deeply \@have, \@want;` [download] See also the FAQ: How do I sort an array by (anything)? PS. Here's the same thing but with substr: `use strict; use warnings; use Test::More tests => 1; my @in = qw/a-3 a-1 a-2/; my @want = qw/a-1 a-2 a-3/; my @have = sort { substr ($a, 2) <=> substr ($b, 2) } @in; is_deeply \@have, \@want;` [download] 🦛	[reply] [d/l] [select]
Re^2: How can I do a numeric sort on a substring? by misterperl (Pilgrim) on Jun 25, 2021 at 14:01 UTC
TYVM that's very helpful. Is the [0] there in case there are more than one number matching like a-1-3 ? TY!	[reply]
Re^3: How can I do a numeric sort on a substring? by haukex (Archbishop) on Jun 25, 2021 at 14:07 UTC
in case there are more than one number matching like a-1-3 ? Do you mean something like Sort::Key::Natural? See also How do I do a natural sort on an array?	[reply]
Re^4: How can I do a numeric sort on a substring? by LanX (Saint) on Jun 25, 2021 at 16:48 UTC
Re^4: How can I do a numeric sort on a substring? by misterperl (Pilgrim) on Jun 25, 2021 at 14:15 UTC
Re^4: How can I do a numeric sort on a substring? by misterperl (Pilgrim) on Jun 25, 2021 at 14:09 UTC
Re^5: How can I do a numeric sort on a substring? by BillKSmith (Monsignor) on Jun 25, 2021 at 15:56 UTC
Some notes below your chosen depth have not been shown here
Re^3: How can I do a numeric sort on a substring? by hippo (Bishop) on Jun 25, 2021 at 15:12 UTC
Not specifically. The regex match is context sensitive, so the brackets around it enforce list context and then the `[0]` pulls out the first value of that list before the `<=>` gets a chance to enforce scalar context on it. Example: `#!/usr/bin/env perl use strict; use warnings; use feature 'say'; my $x = 'a-9'; say "Scalar: " . $x =~ /(\d+)/; say "List: " . ($x =~ /(\d+)/); say "First: " . ($x =~ /(\d+)/)[0];` [download] Read lots more in the Context tutorial. 🦛	[reply] [d/l] [select]
Re: How can I do a numeric sort on a substring? by haukex (Archbishop) on Jun 25, 2021 at 14:01 UTC
Note this question is very similar to your question How can I sort my array numerically on part of the string?, and many of the solutions from there apply here as well.	[reply]
Re^2: How can I do a numeric sort on a substring? by misterperl (Pilgrim) on Jun 25, 2021 at 14:11 UTC
yes TY, I was able to do the sort, but I'm seeking a 1-liner to do the work this time.	[reply]
Re: How can I do a numeric sort on a substring? by syphilis (Archbishop) on Jun 25, 2021 at 14:07 UTC
I want to sort on the numeric part using `$a <=> $b` Here's my guess: `use strict; use warnings; my @data = qw(a-2 a-10 a-9 a-8 a-0 a-1 a-3 a-6 a-5 a-4 a-7); my @sorted = map {'a-' . $_} sort {$b <=> $a} map {substr($_, 2, lengt +h($_) -2) } @data; print "@sorted\n"; __END__ Outputs: a-10 a-9 a-8 a-7 a-6 a-5 a-4 a-3 a-2 a-1 a-0` [download] If you want them to appear in the reverse order, instead do: `my @sorted = map {'a-' . $_} sort {$a <=> $b} map {substr($_, 2, lengt +h($_) -2) } @data;` [download] Cheers, Rob	[reply] [d/l] [select]
Re^2: How can I do a numeric sort on a substring? by misterperl (Pilgrim) on Jun 25, 2021 at 14:13 UTC
nicely done Rob! ++ vote also love it..	[reply]
Re: How can I do a numeric sort on a substring? [Benchmark] by kcott (Archbishop) on Jun 26, 2021 at 05:29 UTC
G'day misterperl, Other than a number of solutions involving modules, there were several pure Perl solutions. I was interested in how these might compare. The results I got were unexpected; I added a lot of variations (currently up to 14) but the same trend continued. I had thought that the Schwartzian and Guttman-Rosler transformations would have been the fastest, but that wasn't the case. In fact, those that performed extraction of the number, prior to comparison, within the `sort`, were the quickest. Given these extractions might have been done multiple times for the same data, I suspect Perl (v5.34.0) is performing some optimisation and perhaps caching results; I would be interested in results others receive using different Perl versions. In case there's concern about the number of tests, the entire benchmark should run in under a minute: a COUNT of zero indicates 3 seconds for each of the 14 tests, plus a few seconds for other overhead. Here's a rough breakdown of the results: Schwartzian Transform: slowest. Test names begin with `ST`. Guttman-Rosler Transform: second slowest but roughly twice as fast as Schwartzian Transform. Test names begin with `GRT`. syphilis' solution: about 40-50% faster than Guttman-Rosler Transform; removing the `length` calculation added a small boost. Test names begin with `mcs`. Other solutions only using `sort` were the hands-down winners by far. Their speeds varied slightly between runs; the results were too close to call. Test names begin with `s`. Here's a sample run. It might be worth using the "download" link to see this without wrapping. Perl & OS: v5.34.0 on cygwin Unordered data (for preamble tests): a-10 a-01 a-22 a-2 a-0 a-3 a-000 a-1 a-12345 a-1 Preamble tests: grt_pack_expr: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 grt_pack_expr_q: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_anchored: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_anch_expr_ni: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_anch_ni: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_expr_ni: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 st_regex_no_index: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 map_cat_substr: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 map_cat_substr_len: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 sort_pack: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 sort_regex: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 sort_regex_anchored: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 sort_substr: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a- +12345 Legend: GRTpe: grt_pack_expr GRTpeq: grt_pack_expr_q STr: st_regex STra: st_regex_anchored STraen: st_regex_anch_expr_ni STran: st_regex_anch_ni STren: st_regex_expr_ni STrn: st_regex_no_index mcs: map_cat_substr mcsl: map_cat_substr_len sp: sort_pack sr: sort_regex sra: sort_regex_anchored ss: sort_substr Benchmarks: Note: Unordered data extended with 'map "a-$_", shuffle 0..10000' Rate STraen STran STra STrn STren STr GRTp +e GRTpeq mcsl mcs sp sr ss sra STraen 68.2/s -- -0% -1% -2% -3% -5% -54 +% -54% -60% -61% -100% -100% -100% -100% STran 68.6/s 0% -- -0% -1% -3% -4% -54 +% -54% -60% -61% -100% -100% -100% -100% STra 68.6/s 1% 0% -- -1% -2% -4% -53 +% -54% -60% -61% -100% -100% -100% -100% STrn 69.3/s 2% 1% 1% -- -1% -3% -53 +% -54% -59% -60% -100% -100% -100% -100% STren 70.4/s 3% 3% 3% 2% -- -2% -52 +% -53% -59% -60% -100% -100% -100% -100% STr 71.5/s 5% 4% 4% 3% 2% -- -52 +% -52% -58% -59% -100% -100% -100% -100% GRTpe 147/s 116% 115% 115% 113% 109% 106% - +- -1% -13% -16% -100% -100% -100% -100% GRTpeq 149/s 119% 118% 118% 116% 112% 109% 1 +% -- -12% -15% -100% -100% -100% -100% mcsl 170/s 149% 148% 148% 145% 141% 138% 15 +% 14% -- -3% -100% -100% -100% -100% mcs 175/s 157% 156% 155% 153% 149% 145% 19 +% 17% 3% -- -100% -100% -100% -100% sp 263168/s 385629% 383801% 383272% 379410% 373718% 368096% 178350 +% 175994% 154699% 149996% -- -2% -2% -3% sr 268680/s 393708% 391841% 391301% 387358% 381547% 375807% 182088 +% 179682% 157941% 153139% 2% -- -0% -1% ss 269453/s 394842% 392970% 392428% 388474% 382646% 376890% 182612 +% 180200% 158396% 153581% 2% 0% -- -0% sra 270688/s 396651% 394771% 394227% 390254% 384400% 378617% 183450 +% 181026% 159122% 154285% 3% 1% 0% -- [download] There was a typo in the code; see update below. The original `Benchmarks:` section is in the spoiler below. Benchmarks: Note: Unordered data extended with 'map "a-$_", shuffle 0..10000' Rate STra STr STrn STran STraen STren GRTpe +q GRTpe mcsl mcs sra sp sr ss STra 69.0/s -- -0% -0% -1% -1% -2% -52 +% -52% -58% -60% -100% -100% -100% -100% STr 69.3/s 0% -- -0% -1% -1% -1% -52 +% -52% -58% -60% -100% -100% -100% -100% STrn 69.3/s 0% 0% -- -1% -1% -1% -51 +% -52% -58% -60% -100% -100% -100% -100% STran 69.7/s 1% 1% 1% -- -0% -1% -51 +% -51% -58% -60% -100% -100% -100% -100% STraen 69.7/s 1% 1% 1% 0% -- -0% -51 +% -51% -58% -60% -100% -100% -100% -100% STren 70.1/s 2% 1% 1% 1% 0% -- -51 +% -51% -58% -59% -100% -100% -100% -100% GRTpeq 143/s 107% 106% 106% 105% 105% 104% - +- -0% -14% -17% -100% -100% -100% -100% GRTpe 144/s 108% 107% 107% 106% 106% 105% 0 +% -- -14% -17% -100% -100% -100% -100% mcsl 166/s 141% 140% 140% 138% 138% 137% 16 +% 16% -- -4% -100% -100% -100% -100% mcs 173/s 150% 149% 149% 148% 148% 146% 21 +% 20% 4% -- -100% -100% -100% -100% sra 269609/s 390588% 389108% 388890% 386666% 386573% 384706% 188564 +% 187751% 162119% 156078% -- -1% -1% -1% sp 271738/s 393672% 392181% 391961% 389720% 389626% 387744% 190053 +% 189234% 163400% 157311% 1% -- -1% -1% sr 273197/s 395787% 394288% 394067% 391813% 391719% 389826% 191074 +% 190251% 164278% 158157% 1% 1% -- -0% ss 273197/s 395787% 394288% 394067% 391813% 391719% 389826% 191074 +% 190251% 164278% 158157% 1% 1% 0% -- [download] Here's the code: #!/usr/bin/env perl use strict; use warnings; use namespace::autoclean; use Benchmark 'cmpthese'; use List::Util 'shuffle'; my @unordered = qw{a-10 a-01 a-22 a-2 a-0 a-3 a-000 a-1 a-12345 a-1}; my %expanded_abbrev_for = ( sr => 'sort_regex', STr => 'st_regex', STrn => 'st_regex_no_index', STren => 'st_regex_expr_ni', sra => 'sort_regex_anchored', STra => 'st_regex_anchored', STran => 'st_regex_anch_ni', STraen => 'st_regex_anch_expr_ni', ss => 'sort_substr', mcs => 'map_cat_substr', mcsl => 'map_cat_substr_len', sp => 'sort_pack', GRTpe => 'grt_pack_expr', GRTpeq => 'grt_pack_expr_q', ); my %coderef_for = ( sr => \&sort_regex, STr => \&st_regex, STrn => \&st_regex_no_index, STren => \&st_regex_expr_ni, sra => \&sort_regex_anchored, STra => \&st_regex_anchored, STran => \&st_regex_anch_ni, STraen => \&st_regex_anch_expr_ni, ss => \&sort_substr, mcs => \&map_cat_substr, mcsl => \&map_cat_substr_len, sp => \&sort_pack, GRTpe => \&grt_pack_expr, GRTpeq => \&grt_pack_expr_q, ); print "Perl & OS:\n $^V on $^O\n"; print "Unordered data (for preamble tests):\n @unordered\n"; print "Preamble tests:\n"; my $tests_fmt = " %-22s %s\n"; for my $name (sort keys %coderef_for) { printf $tests_fmt, "$expanded_abbrev_for{$name}:", "@{[$coderef_for{$name}->()]}"; } exit if @ARGV && $ARGV[0] eq '--dry_run'; print "Legend:\n"; my $legend_fmt = " %-7s %s\n"; for my $abbrev (sort keys %expanded_abbrev_for) { printf $legend_fmt, "$abbrev:", $expanded_abbrev_for{$abbrev}; } # Extend @unordered for improved benchmarking push @unordered, map "a-$_", shuffle 0..10000; print "Benchmarks:\n"; print " Note: Unordered data extended with 'map \"a-\$_\", shuffle +0..10000'\n"; my $count = 0; cmpthese $count => \%coderef_for; sub sort_regex { sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } @unordered; } sub st_regex { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, (/(\d+)/)[0]] } @unordered; } sub st_regex_no_index { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, /(\d+)/] } @unordered; } sub st_regex_expr_ni { map $_->[0], sort { $a->[1] <=> $b->[1] } map [$_, /(\d+)/], @unordered; } sub sort_regex_anchored { sort { ($a =~ /(\d+)$/)[0] <=> ($b =~ /(\d+)$/)[0] } @unordered; } sub st_regex_anchored { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, (/(\d+)$/)[0]] } @unordered; } sub st_regex_anch_ni { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, /(\d+)$/] } @unordered; } sub st_regex_anch_expr_ni { map $_->[0], sort { $a->[1] <=> $b->[1] } map [$_, /(\d+)$/], @unordered; } sub sort_substr { sort { substr($a, 2) <=> substr($b, 2) } @unordered; } sub map_cat_substr { map { 'a-' . $_ } sort { $a <=> $b } map { substr $_, 2 } @unordered; } sub map_cat_substr_len { map { 'a-' . $_ } sort { $a <=> $b } map { substr $_, 2, length($_) - 2 } @unordered; } sub sort_pack { sort { pack(L => substr($a, 2)) cmp pack(L => substr($b, 2)) } @unordered; } sub grt_pack_expr { map substr($_, 4), sort map pack(L => substr($_, 2)) . $_, @unordered; } sub grt_pack_expr_q { map substr($_, 8), sort map pack(Q => substr($_, 2)) . $_, @unordered; } [download] Notes on pack: I'm not any great expert with this function. Happy to receive corrections, advice, etc. The doco indicates that the "`Q`" template is not universally available. If you don't have it, you might want to remove `&grt_pack_expr_q` and the two lines referencing it in `%expanded_abbrev_for` and `%coderef_for` (the key is "`GRTpeq`" in both cases). If you do have it, be aware others, with whom you share your code, might not. Update: There was a typo in my original code: the first key/value pair in `%coderef_for` was `sr => \&sort_substr` but it should have been `sr => \&sort_regex`. I've fixed my code locally and edited the code above to reflect this fix. I reran the code and have posted a new `Benchmarks:` section. All of the output before `Benchmarks:` is unchanged from the original posting. The old `Benchmarks:` section is now in a spoiler. The results are not particularly different — except `sr` and `ss` are no longer identical; although, still too close to call — and my original "rough breakdown of the results" has not changed. — Ken	[reply] [d/l] [select]
Re^2: How can I do a numeric sort on a substring? [Benchmark] by LanX (Saint) on Jun 26, 2021 at 13:28 UTC
> I had thought that the Schwartzian and Guttman-Rosler transformations would have been the fastest, but that wasn't the case. In fact, those that performed extraction of the number, prior to comparison, within the sort, were the quickest you are only testing with 10 elements in your `@unordered` array. for `n` elements you have in best case `O(nlog(n))` comparisons but `O(n)` packs and unpacks with ST and GRT, which means the overhead to extract n numbers will account much more for small n. use at least n >> 1000 elements for a real benchmark. Rule of thumb* : the choice of algorithm is almost always neglectable for small data. update it's like getting the Porsche out of the garage to buy a six-pack of beer just 10m around the corner. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: How can I do a numeric sort on a substring? [Benchmark] by kcott (Archbishop) on Jun 27, 2021 at 02:41 UTC
G'day Rolf, "you are only testing with 10 elements in your `@unordered` array." The code only contains one comment which I added for the express purpose of heading off such feedback: `# Extend @unordered for improved benchmarking push @unordered, map "a-$_", shuffle 0..10000;` [download] "use at least n >> 1000 elements for a real benchmark" Indeed. I used 10,011 elements "`for improved benchmarking`". The preamble tests were to check that all subroutines returned identical results. Each of the preamble tests were only run once and did not involve any benchmarking; they were intended as a sanity check (processing stopped here if `--dry_run` was used). The data used for this had 10 elements, which I considered sufficient for these tests, and was indicated by: `Unordered data (for preamble tests):` [download] — Ken	[reply] [d/l] [select]
Re^4: How can I do a numeric sort on a substring? [Benchmark] by LanX (Saint) on Jun 27, 2021 at 11:56 UTC
Re^5: How can I do a numeric sort on a substring? (context matters) by LanX (Saint) on Jun 27, 2021 at 13:46 UTC
Some notes below your chosen depth have not been shown here
Re^2: How can I do a numeric sort on a substring? [Benchmark] by Anonymous Monk on Jun 26, 2021 at 08:18 UTC
my $HAS_Q = !!eval { pack 'Q', 1; 1 }; ...	[reply]
Re: How can I do a numeric sort on a substring? by AnomalousMonk (Archbishop) on Jun 25, 2021 at 16:22 UTC
Here's another example of the Schwartzian mentioned earlier. I've written it as a "one-liner" although that's not the form I'd prefer for production code for reasons of readability/maintainability. I've added more test values in different formats to illustrate the effects of the regex that is used. `Win8 Strawberry 5.8.9.5 (32) Fri 06/25/2021 11:50:17 C:\@Work\Perl\monks >perl use strict; use warnings; use Test::More tests => 1; 1..1 my @in = qw/a-3 21-2 a-2 bb-10 c--11-1 a-1 def-2-99/; my @want = qw/a-1 a-2 def-2-99 a-3 bb-10 c--11-1 21-2/; my @got = map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, /\d+/g ] +, @in; is_deeply \@got, \@want; ^Z ok 1` [download] Another consideration when sorting is "stability." See the discussion of this here and in the sort pragma. See also Guttman and Rosler's A Fresh Look at Efficient Perl Sorting for lots more info on Perl sorting. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: How can I do a numeric sort on a substring? by Marshall (Canon) on Jun 28, 2021 at 00:58 UTC
I wouldn't worry about whether this is one line of code or not. A simple way is shown below. $a and $b are special variables used by sort. Here the numeric value at the end of string is used as the primary thing to sort upon. If the numeric values are equal, then string comparison is used as a "tie breaker". This will work fine performance wise for say 100 things. If there are 1,000 things, start to think about something more complex. If there are 10,000 things definitely think about more complex idioms if you do this "often". I would recommend keeping things simple unless there is an obvious performance reason that requires the sort to be faster. `use strict; use warnings; my @array = qw(blah-13 yup-09 weird-2 stranger-1 strange-1); print "Input array:\n"; print "$_\n" for @array; @array = sort { my ($atext,$anum) = $a =~ /(\w+)-(\d+)/; my ($btext,$bnum) = $b =~ /(\w+)-(\d+)/; $anum <=> $bnum or $atext cmp $btext }@array; print "\nOutput array:\n"; print "$_\n" for @array; __END__ Input array: blah-13 yup-09 weird-2 stranger-1 strange-1 Output array: strange-1 stranger-1 weird-2 yup-09 blah-13` [download] An Update: I re-iterate my suggestion about the 10,000 items level. No, I don't present benchmarks of my own, but in my experience, that is where a very noticeable impact of say the ST will become apparent. If your program is interacting with a single user, even quite stunning performance increases might make little difference. A difference of say 20 ms is basically undetectable at the user UI level. If you are writing server level code then performance matters a lot more because it affects the number of users that can be serviced by your machine. Another update: A difference of 50 ms (1/20th of a second) will be noticeable by the user in things like audio playback of multiple files. Less than that doesn't make much difference. If your sort takes even 3ms which is a LONG time by modern computer standards, so what? Unless this a server process, I wouldn't worry about it.	[reply] [d/l]
Re: How can I do a numeric sort on a substring? by swl (Parson) on Jun 25, 2021 at 23:43 UTC
Use `natkeysort` from Sort::Key::Natural. Several of the other responses link to discussions which no doubt mention it, but this saves a bit of extra searching. Edit - and now I see this has already been explicitly mentioned in 11134278.	[reply] [d/l]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks

How can I do a numeric sort on a substring?

update