Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

O provider of cookies:

I have a few notes for you, but I'm going to apologize in advance--I don't have enough time to shorten this and clean it up nicely. So this is one of my occasional hideously long nodes, and is a bit rambly, and I'll leave out a few bits that I meant to say. I'll make use of the readonly tag, for those who don't want to see it all. With that disclaimer, here goes...

  • Your code to move index files to the top can fail because your comparison function isn't "stable".

    Near the end of perldoc -f sort, they have the sentence: A stable sort preserves the input order of elements that compare equal. The reason it's important is that many sorting algorithms don't check the relationship of all items, because if you provide a stable comparison function, it can greatly reduce the number of comparisons it has to make. But if you don't provide a stable comparison function, it'll put things in an incorrect order. The problem is this:

    # When sorting lists of files, I want the index file to always come fi +rst. if ($c =~ /^index\./) { return -1; } elsif ($d =~ /^index\./) { return 1; }

    Below are the results of a few calls to your subroutine. As you can see, if the first argument matches /^index\./, then it tells the sort routine to make that item precede the second item, no matter what the second item is!.

    $ perl PM1021473_3_unstable.pl index.0001 afile article -1 index.0001 index.0005 article -1 index.9999 index.0005 article -1

    The code:

    #!/usr/bin/perl # # PM1021473_3_unstable.pl # use strict; use warnings; use HTML::Entities; while (<DATA>) { s/\s+$//; my @args = split /\s+/,$_; push @args, my_sort(@args); printf "%-10s %-10s %-10s %d\n", @args; } ... your sort routine ... __DATA__ index.0001 afile article index.0001 index.0005 article index.9999 index.0005 article
  • The next thing I'd like to mention is that you're doing a good deal of work to make a single sort routine, with a parameter to tell it what to do. I'd suggest splitting your sort routine into smaller pieces.

    At the very least, I'd split it in two chunks:

    sub my_sort_article { my ($c,$d,$type) = @_; $c = lc($c); $d = lc($d); my $t = my_sort_index($c, $d); return $t if $t; # Written with the help of kent/n in #perl on freenode. for ($c, $d) { s/<.+?>//g; # Strip out any html tags. s/\s*\b(A|a|An|an|The|the)(_|\s)//xi; # Strip off leading articles + (in English). decode_entities($_); } if ( $c =~/^((\d|,|\.)+)(.*)$/ && $d =~ /^((\d|,|\.)+)(.*)$/) { # Get the leading number. (my $e = $c) =~ s/^((\d|,|\.)+)(.*)$/$1/; (my $f = $d) =~ s/^((\d|,|\.)+)(.*)$/$1/; # Take any commas out of the number. s/,//g for ($e,$f); # Get the remaining parts of the string. (my $g = $c) =~ s/^((\d|,|\.)+)(.*)$/$3/; (my $h = $d) =~ s/^((\d|,|\.)+)(.*)$/$3/; # First compare the numbers, then compare the remaining parts of t +he string. $e <=> $f || $g cmp $h } else { $c cmp $d; } } sub my_sort_name { # When I sort by name I prefer lastname firstname. # I have not yet written this to account for Sr., Jr., or Roman nume +rals after the last name. my ($c,$d,$type) = @_; $c = lc($c); $d = lc($d); # ?Did you really want this check for name sorting? my $t = my_sort_index($c, $d); return $t if $t; for ($c,$d) { s/\|.+$//; $_ = join(' ', (reverse split(/(?:_|\s)(?=[^_\s]+$)/, $_,2))) if $ +_ !~ /^_/; s/^_//; s/^(A|a|An|an|The|the)(_|\s)//; } return $c cmp $d; }

    It's no harder to call this way, and updating one won't break the other.

    In fact you can go a good deal further. Usually it's best to build with smaller subroutines. In fact, you could easily pull other interesting functions out that you could re-use. By breaking them into small pieces, and giving them good names, it makes your code easier to read and maintain in addition to giving you more bits of useful code for your toolbox.

    For example, splitting out the leading number and tuning it up could be a great little routine you could use in plenty of locations:

    sub split_out_leading_number { my $s = shift; if ( $s =~/^((\d|,|\.)+)(.*)$/) { my ($leading_number, $rest) = ($1,$3); # Take any commas out of the number. $leading_number =~ s/,//g; return ($leading_number, $rest); } die "split_out_leading_number received bogus input '$s'!\n"; }

    And once you have it in your toolbox, writing code using it is a little easier to read. Just take a look:

    sub my_sort_article { my ($c,$d,$type) = @_; $c = lc($c); $d = lc($d); my $t = my_sort_index($c, $d); return $t if $t; # Written with the help of kent/n in #perl on freenode. for ($c, $d) { s/<.+?>//g; # Strip out any html tags. s/\s*\b(A|a|An|an|The|the)(_|\s)//xi; # Strip off leading articles + (in English). decode_entities($_); } if ( $c =~/^((\d|,|\.)+)(.*)$/ && $d =~ /^((\d|,|\.)+)(.*)$/) { my ($num1, $text1) = split_out_leading_number($c); my ($num2, $text2) = split_out_leading_number($d); # First compare the numbers, then compare the remaining parts of t +he string. $num1 <=> $num2 || $text1 cmp $text2 } else { $c cmp $d; } }

    But you want to be careful with your naming. I don't like the name I came up with, but haven't thought of anything better yet. With good naming, you can remove many comments from your code, as it can become more self-documenting.

  • Next, your subroutine does a *lot* of manipulation, and since it will be called multiple times for every string in the collection, you could be doing a lot more work than necessary. For small lists, that's no big deal. But it can add up quickly. There's a technique called the Schwartzian transform that can greatly reduce data manipulation during sorting.

    For a quicksort, each time you double the size of your list, you have to transform *each* item on your list another time. It can grow fairly quickly:

    List sizeTimes each item
    is manipulated
    Total manipulations
    2 1 2
    4 2 4*2=8
    8 3 8*3=24
    16 4 16*4=64
    . . . . . . . . .
    1024 10 1024*10=10,240
    . . . . . . . . .
    1048576 201048576*20=20,971,520
    Nlog(N)/log(2)N*log(N)/log(2)

    In the Schwartzian transform, you simply build a *new* list that's much easier to sort, transforming all your data items once. Then you sort the new list. After the sort, you pull out the original data items to get your sorted list.

    Suppose that you had a large list of words in mixed case, and wanted to sort them alphabetically, irrespective of case. But you want your resulting list to retain the original case of the words. You could do it like:

    my @list = ( qw( apple Alpha aLBAtross aLbAcOre etc... ) ); my @result = sort {uc($a) cmp uc($b)} @original;

    but it would uppercase all items repeatedly. If your list was large enough that could slow your application down. But suppose your list was more like:

    my @list = ( [qw( APPLE apple )], [qw( ALPHA alpha )], [qw( ALBATROSS aLBAtross )], [qw( ALBACORE aLbAcOrE)], etc... );

    That would be much easier to sort: Just sort the array in the order of the first column:

    my @result = sort { $a->[0] cmp $b->[0] } @list;

    And then your list would look like this:

    my @list = ( [qw( ALBACORE aLbAcOrE)], [qw( ALBATROSS aLBAtross )], [qw( ALPHA alpha )], [qw( APPLE apple )], etc... );

    Retrieving the sorted data is simple enough: It's the second column of the array. Reading a discussion of the Schwartzian transform was one of the things that got me hooked on perl. (In fact, it was this article: Resorting to Sorting. It led me to realize how flexible perl can be. It can make sort work of simple list operations. For example, you can use the map function (perldoc -f map) to take your original list and reshape it. So to do our sort we could do it like this:

    my @list = ( qw( apple Alpha aLBAtross aLbAcOre etc... ) ); # Convert our list into the ( [APPLE apple], ... ) form with map: @result = map { [ uc($_), $_ ] } @list; # Now sort the list: @result = sort {$a->[0] cmp $b->[0]} @result; # Now convert the resulting list back into (aLbAcOrE aLBAtross...) for +mat: @result = map { $_->[1] } @result;

    This is kind of wordy, though. Perl lets us pipeline the operations, though, so we can do it in a single statement:

    my @list = ( qw( apple Alpha aLBAtross aLbAcOre etc... ) ); my @result = map {$_->[1] } # .sgnirts desacreppu eht + lla tou pirts # sort {$a->[0] cmp $b->[0]} # neht dna ,sgnirts desacrep +pu eht no tros # map { [ uc($_), $_ ] } # ,gnirts eht fo noisrev desa +creppu eht dda # @list; # + ,tsil a neviG #

    It takes a little getting used to, but just remember that the list operations (sort, map, grep) all take a list on the right side as input, and output a new list on the left side. So when you chain them together, just remember to read them from right to left. ;^D

    If you wanted to, you could even throw a grep function in there to only retain words fulfilling specific criteria. You might only want words longer than four characters for example.

Update: When I started writing this node, there were no responses. But now davido already hit all the the high points. On the bright side, now I have a little reading material for tomorrow. I've not heard of the Guttman-Rosler transform. (Update 2: Well, Guttman-Rosler is described in the node I referenced, so I'm guessing I just forgot about it....)

Update: s/albatrioss/albatross/ig; fixed a broken document link. (update very late: 20140811)

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re: RFC: I rewrote a custom sort function, but not sure if it is the best way. by roboticus
in thread RFC: I rewrote a custom sort function, but not sure if it is the best way. by Lady_Aleena

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (9)
    As of 2014-08-21 17:00 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (138 votes), past polls