Your skill will accomplish what the force of many cannot |
|
PerlMonks |
Re^2: The sum of absolute differences in the counts of chars in two strings.by BrowserUk (Patriarch) |
on Nov 20, 2011 at 07:57 UTC ( [id://939040]=note: print w/replies, xml ) | Need Help?? |
I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time. Time and again here, one or more of the monks have spotted a solution to a problem that at first sight appear intractable to optimisation. For example: in Comparing a large set of DNA sequences, the OP describes an apparently simple process of cross comparing 100,000 x 20-char strings, as "It takes on the orders of days to process". Various approaches to this problem have been suggested down the years including the use of any of several well-known fuzzy matching and edit distance algorithms, like String::Approx, Text::Levenshtien, Text::Soundex Text::WagnerFischer and others. But these brute-force, char-by-char O(M*N) comparisons are horrible expensive -- even when coded in XS -- and often not so useful for the desired results either, as edit-distances are a quite different metric to the required: 'are these two strings the same except for in at most n places'. Another frequently offered alternative is the use of regexes, perhaps programmically generated. Whilst these work fine and beat the edit-distance algorithms hands down, they generally require quite-to-very complex regexes that, by requirement, contain many backtrack points, with the result that their runtime performance is understandably very poor. The root of the problem is the O(N2/2) nature of the round-robin comparison process. The exponential growth in the numbers of comparisons required, means that even for relatively low numbers of strings -- and in genomic work, 100,000 is low -- every extra microsecond taken performing each comparison adds an hour and a half to the processing time. And by the time you get to a still relatively small 1 million strings, each extra microsecond will cost you an extra 138 hours. Almost an extra week of waiting, not to mention the ~15kWhs of electricity used in the process. It would likely not be at all obvious to you that using string-wise XOR would form the basis of the most efficient (pure Perl) method of doing this, but its use here reduces days of processing to a few hours. Even less obvious is the likelihood that anyone would offer a solution that reduces the number of comparisons required from 5 billion fuzzy matches to the order of 300,000 O(1) lookups, yet here it is. Whilst my question has yet to garner anything other than more efficiently implemented brute-force algorithms -- themselves very valuable in the absence of a different solution -- there are still at least a couple of avenues yet unexplored. So, whilst I couldn't think of any game changing algorithm or approach to the problem, I'm long enough in the tooth to know that it does no harm to ask and can often lead to surprises. So, you'll forgive me -- or not -- if I file your "I can’t think of anything.. non-response, under the heading of: "Saw the opportunity to garner a few XP, but couldn't think of anything useful to say, so I'll say just that, and wrap it in some other plausible but meaningless drivel." With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In Section
Seekers of Perl Wisdom
|
|