Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Sorting Numbers & Text

by PriNet (Beadle)
on Jul 12, 2012 at 20:41 UTC ( #981489=perlquestion: print w/ replies, xml ) Need Help??
PriNet has asked for the wisdom of the Perl Monks concerning the following question:

Ok, I have an issue *heh* I have these values;
041351920234 Rabbit 0343120 041271024500 0430870 Apple 041460301399
I have tried several "comparators" to "alphabatize" the list(individually)to try to get;
Apple Rabbit 0343120 0430870 041271024500 041351920234 041460301399
To no avail, i tried using gt and lt as comparators but it puts them in the "wrong" order;
Apple Rabbit 0343120 041351920234 0430870 041271024500 041460301399
I tried > and < to compare the values and it doesn't sort the Apple and Rabbit correctly becuase (obviously) perl interpets these two as "0" integers.
Rabbit Apple 0343120 0430870 041271024500 041351920234 041460301399

I'm dealing with both "numbers" and "text" but i can't figure out how to get perl to handle the numbers correctly or the text vica-versa. I don't want to use the "sort { $a->[0] <=> $b->[0] }" approach because I'm using a very large database and don't want to have "not-enough" ram/memory problems.

Anyone have a soloution, routine, or procedure to accomplish what I'm trying to do here? Goal is to create an alphabatized file of items(/numbers).

I did try re-inventing the wheel...But it kept getting stuck on the corners

Comment on Sorting Numbers & Text
Select or Download Code
Re: Sorting Numbers & Text
by johngg (Abbot) on Jul 12, 2012 at 21:21 UTC

    Do the sort using a Schwartzian Transform by storing an indicator on whether the value is alpha or numeric as well as the value itself. The outer sort is alpha then numeric and the inner sort comparator is chosen in a ternary depending on the indicator.

    knoppix@Microknoppix:~$ perl -E ' > my @values = qw{ > 041351920234 > Rabbit > 0343120 > 041271024500 > 0430870 > Apple > 041460301399 > }; > > say for > map { $_->[ 0 ] } > sort { > $a->[ 1 ] <=> $b->[ 1 ] > || > $a->[ 1 ] > ? $a->[ 0 ] <=> $b->[ 0 ] > : $a->[ 0 ] cmp $b->[ 0 ] > } > map { [ $_, m{^\d} ? 1 : 0 ] } > @values;' Apple Rabbit 0343120 0430870 041271024500 041351920234 041460301399 knoppix@Microknoppix:~$

    I hope this is helpful.

    Cheers,

    JohnGG

      I get the same result with:
      #!/usr/bin/env perl use strict; use warnings; use Data::Printer; my @unsorted = qw( 041351920234 Rabbit 0343120 041271024500 0430870 Ap +ple 041460301399); my @sorted = sort { $a <=> $b || $a cmp $b } @unsorted; p @unsorted; p @sorted;

      Is your way more efficient memory-wise? Also, both of our approaches are giving me warnings like: Argument "Rabbit" isn't numeric in numeric comparison (<=>) at....

        It seems that the warnings are due to a precedence problem that can be fixed with parentheses,

        knoppix@Microknoppix:~$ perl -Mstrict -Mwarnings -E ' > my @values = qw{ > 041351920234 > Rabbit > 0343120 > 041271024500 > 0430870 > Apple > 041460301399 > }; > > say for > map { $_->[ 0 ] } > sort { > ( $a->[ 1 ] <=> $b->[ 1 ] ) > || > ( > $a->[ 1 ] > ? $a->[ 0 ] <=> $b->[ 0 ] > : $a->[ 0 ] cmp $b->[ 0 ] > ) > } > map { [ $_, m{^\d} ? 1 : 0 ] } > @values;' Apple Rabbit 0343120 0430870 041271024500 041351920234 041460301399 knoppix@Microknoppix:~$

        Thanks for pointing it out.

        Cheers,

        JohnGG

      I start with a "new" value that isn't in the "list(file)" yet (either alpha or numeric), then start reading out the values one at a time from the sourcefile, compare the two, print the "smallest" back to a tempfile while retaining the largest for the next read... and so on... (kind-of-a-bubble-sort) then swap the files back at the end. its the "comparing" of mixed types that i get stuck on. and i'm worried about running out of memory if i just load into array and <=> sort function (which i may end up doing *heh*) but you have given me an idea about "flagging" the first character as alpha or numeric ... *hmmmm*

      I did try re-inventing the wheel...
      But it kept getting stuck on the corners

        Another approach might be to read chunks of your very large database, perhaps 100k to 500k records at a time, and sort each chunk into its own temporary file. Once you have read and sorted all of the data, do a sort/merge of the temporary files into a final sorted file. My gut feeling is that this would be more efficient than the "two at a time" approach you are taking.

        Cheers,

        JohnGG

        Thinking about it further, read a chunk of your database then sort and print to two temporary files, one for letters, one for numbers. Then the sort/merge of the temporary files will be simpler keeping the two categories separate. Finally you can concatenate the letters and numbers merged files for your results file. In this code I am writing to in-memory scalars rather than disk files just to keep things tidy.

        knoppix@Microknoppix:~$ perl -Mstrict -Mwarnings -E ' > my @values = qw{ > 041351920234 > Rabbit > 0343120 > 041271024500 > 0430870 > Apple > 041460301399 > }; > > my $rsLets = do { \ my $lets }; > open my $letsFH, q{>}, $rsLets or die $!; > my $rsNums = do { \ my $nums }; > open my $numsFH, q{>}, $rsNums or die $!; > > say { $_->[ 1 ] ? $numsFH : $letsFH } $_->[ 0 ] for > sort { > ( $a->[ 1 ] <=> $b->[ 1 ] ) > || > ( > $a->[ 1 ] > ? $a->[ 0 ] <=> $b->[ 0 ] > : $a->[ 0 ] cmp $b->[ 0 ] > ) > } > map { [ $_, m{^\d} ? 1 : 0 ] } > @values; > > say ${ $rsLets }, q{-----------------}; > say ${ $rsNums }, q{-----------------};' Apple Rabbit ----------------- 0343120 0430870 041271024500 041351920234 041460301399 ----------------- knoppix@Microknoppix:~$

        I hope this is of interest.

        Cheers,

        JohnGG

Re: Sorting Numbers & Text
by dasgar (Deacon) on Jul 12, 2012 at 21:29 UTC
    ...because I'm using a very large database and don't want to have "not-enough" ram/memory problems.

    Since you're using data from a "very large database", it would seem that you're doing some kind of SQL query to retrieve the data from that database. If that's true, would it be possible to add a "sort by" clause to the SQL query to get the database to doing the sorting for you?

      uhhmmm.... you know? if i knew how to do sql, i probably wouldn't have half the problems i get *heh*, but alas, i have a brain block everytime i try to learn it... kinda like learning linux, which also would fix alot of my problems *heh*

      I did try re-inventing the wheel...
      But it kept getting stuck on the corners

        ok, heres one to really twist your coaxial pair...

        if (($OldItem > $NewItem) || (($OldItem == $NewItem) && ($OldItem gt $NewItem))) {}

        the first catches and "sorts" numbers... || ... the second checks to see if "alpha" and sorts texts...
        watcha think? i know there "has" to be a flaw...*heh*

        I did try re-inventing the wheel...
        But it kept getting stuck on the corners

      First of all index your database if not done yet, (or vacuum, analyze and refresh your existent indexes if necessary) and then ask your database to sort your results as indicated by dasgar.

      if i knew how to do sql, i probably wouldn't have half the problems i get...

      Don't panic. Take a look to the sql keyword: "order by"

Re: Sorting Numbers & Text
by davido (Archbishop) on Jul 12, 2012 at 23:24 UTC

    It's not as easy as it used to be... and actually it never worked all that well. We've always had to deal with the problem of "Rabbit" being sorted above "apple" because of case issues (that could be dealt with by normalizing the case inside of the sort routine, and then falling back to case-sensitive sorting in case of equality). And we've always had the issue of string comparisons of numbers numerically incorrectly. But now we also have to worry about Unicode issues.

    Additionally, you do want numbers to be sorted numerically, and alpha characters to be sorted alphabetically, I believe. I think that most dictionary style sorts put numbers ahead of alpha characters as well (you seemed to be looking for the opposite, but I'll ignore that for a moment).

    Here's one way to do it that should be Unicode safe:

    use strict; use warnings; use utf8; use Unicode::Collate::Locale; use Scalar::Util 'looks_like_number'; use feature qw/say unicode_strings/; binmode STDOUT, ':utf8'; my @unsorted = qw( 041351920234 Rabbit 0343120 041271024500 000000343119 0430870 Apple 041460301399 ); my $collator = Unicode::Collate::Locale->new(locale => 'en'); my @sorted = sort { ( looks_like_number($a) && looks_like_number($b) && $a <=> $b ) || $collator->getSortKey($a) cmp $collator->getSortKey($b) } @unsorted; say for @sorted;

    Pretty ugly, and still falls back to alphabetical sorting when a given string has non-numeric characters embedded within it.

    I'd love for someone to come along and show a better way to do it, as this just seems messy.


    Dave

Re: Sorting Numbers & Text
by aaron_baugher (Deacon) on Jul 13, 2012 at 01:50 UTC

    This seems a little like cheating, but one way around the conflict would be to split your database lookup in twain:

    select myid from mytable where myid rlike '^[A-Z].*' order by myid; select myid from mytable where myid rlike '^[0-9].*' order by abs(myid +);

    Process them in that order, and you'll have your alpha ones ahead of your numerical ones, and both already sorted -- case-insensitive string sort on the first command, and numeric sort on the second. If you need to do any further sorting on either batch, it should be simpler. Also, letting the database engine order them should be faster than doing it in Perl after fetching them, especially if you put an index on the column in question.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Sorting Numbers & Text
by kcott (Abbot) on Jul 13, 2012 at 06:18 UTC

    This gives the output you wanted:

    $ perl -Mstrict -Mwarnings -MScalar::Util=looks_like_number -E ' my @values = qw{041351920234 Rabbit 0343120 041271024500 0430870 Apple + 041460301399}; say for sort { looks_like_number($a) && looks_like_number($b) ? $a <=> $b : looks_like_number($a) ? 1 : looks_like_number($b) ? -1 : $a cmp $b } @values; ' Apple Rabbit 0343120 0430870 041271024500 041351920234 041460301399

    -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://981489]
Front-paged by NetWallah
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2014-07-13 14:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (249 votes), past polls