No such thing as a small change | |
PerlMonks |
Re^3: sort array by dateby BrowserUk (Patriarch) |
on Jan 04, 2007 at 14:43 UTC ( [id://592954]=note: print w/replies, xml ) | Need Help?? |
There are lots of links around, many of them linked from the node I linked to, but if you do not understand the explanations there, the others may also leave you cold. I'll have a go at explaining them, but I doubt I will do much better than others. Simple sorting using sortThe basic way of sorting in perl is to use the built in sort function.
How sort actually works doesn't really matter. That it is about as fast as you are likely to get, is built-in to langauge so always available and understands Perl internals does. So, sorting data when you want to sort according the the entire value of each element of the array or list is easy. It gets a little more complex when you want to sort the elements by a sub field of each element. For example, say you have a set of ids (A123 B421 C987 etc.) and you want to order them by just the numeric part, then you need to extract that part before supplying it to the appropriate comparison function:
This works fine, but each element will be passed to the comparison block multiple times, and that means we are extracting the sort field from each element multiple times, which can slow things down. For more complex cases, like your dates problem, this could be significant. The purposes of the "advanced sorting" mechanisms is to reduce this slowdown by performing the extraction of the significant data from each input element once only. The STOne mechanism, and possibly the simplest to understand, it merlyn's ST (Schwartzian Transform). This works by pre-processing the elements to extract the fields and package them into anonymous arrays. The list of annonymous arrays are then sorted and finally the original elements are recovered from the list of sorted anonymous arrays and stored into the sorted array.
This is quite simple to follow and can be coded in a more compact form, by combining the 3 steps into a single statement and doing away with the intermediate arrays:
This is identical to the previous example, but more compact. It can also be formatted as a single line, though that will usually mean that it extends beyond the boundary at which PM does it's dumb wrapping. It can also be laid out in several different ways to that I've shown, but I find that this makes the three signifcant steps of the algorithm clearly delineated, by indenting the contents of the code blocks just as you would for any other block of code. Eg. The body of if and while statements etc. I like this consistency. The efficiency of the ST comes from only performing the extractions of the field data once. Indexing the anonymous arrays is very efficient and so the sorting and reassembly doesn't costs much extra. It also extends to sorting multiple fields in an obvious way:
Here we've ordered the data according to the numeric field as before, but this time we've used the alpha field to 'tiebreak' values when the numeric values are the same. It's easy to see how this can be further extended to handle as many fields of any type we need. This simple, obvious extension mechanism is the great strength of the ST. The caveatsThe downside of the ST is that for large datasets, the creation and destruction of all the small anonymous arrays can itself prove to be fairly expensive of time and memory. It's also the case that calling back into Perl code via the sort comparison block imposes a significant time cost. (Don't worry about this for the simple cases of sort{ $a <=> } ... and sort{ $b cmp $a } shown above, as Perl has optimisations to recognise these simple cases, and it doesn't actually do a callback at all). But for more complex sorting, these callbacks are expensive, and if you can avoid them, it will save some considerable time. The GRTThis is where the (Guttman Rossler Transform) comes in. The basic idea here is that instead of building an anonymous array in the pre-processing stage, we prepend the field data to the front of the string; sort the strings; then strip the bit we stuck on the front back off. Going back to the simpler example of a field sort above, we can construct the GRT as follows:
A cheatThis is actually something of a cheat of a GRT. It only works because in my carefully constructed example data, the numeric fields all have the same number of digits, with smaller values having leading zeros. Whilst this works fine, and makes the algorithm easy to follow (try adding print statements inside the code blocks to follow what is going on), real life data is rarely so accommodating. Whilst you could use sprintf to add leading zeros and so make your ascii encoded numeric fields sort correctly using the string comparisons (cmp), it turns out that there is a better (faster) way. That way is to use pack to convert the numeric fields into binary values that will sort correctly using a string comparison function. It is convenient that binary encode integers (NOTE:In 'network' format only. That is 'N'&'n' *NOT* 'V'&'v') will sort correctly using a string comparison function. As will floating point data in IEEE formats (the format used internally by Perl on most but not all systems). So, using pack, the previous example becomes:
That may not look so very different from the previous version, but the big advantage is that it will handle any integer values correctly, whereas with the previous version, you need to know how big the largest numeric field is, so that you can add the correct number of leading zeroes. It also trivially extends to handle the two fields example shown above for the ST:
The patternHopefully, the pattern is becoming obvious. Use pack with appropriate templates 'N', 'n', 'd' etc. to prepend the sort field data to the strings in the order or precedence. Then use unpack and the same template as you used for the sort fields, but wrapped in 'x[...]' to strip the prefix off and recover the original elements. Hmmm. Is that any clearer than the existing explanations? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In Section
Seekers of Perl Wisdom
|
|