Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: The sum of absolute differences in the counts of chars in two strings.

by BrowserUk (Patriarch)
on Nov 20, 2011 at 07:57 UTC ( [id://939040]=note: print w/replies, xml ) Need Help??


in reply to Re: The sum of absolute differences in the counts of chars in two strings.
in thread The sum of absolute differences in the counts of chars in two strings.

I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time.

Time and again here, one or more of the monks have spotted a solution to a problem that at first sight appear intractable to optimisation.

For example: in Comparing a large set of DNA sequences, the OP describes an apparently simple process of cross comparing 100,000 x 20-char strings, as "It takes on the orders of days to process".

Various approaches to this problem have been suggested down the years including the use of any of several well-known fuzzy matching and edit distance algorithms, like String::Approx, Text::Levenshtien, Text::Soundex Text::WagnerFischer and others. But these brute-force, char-by-char O(M*N) comparisons are horrible expensive -- even when coded in XS -- and often not so useful for the desired results either, as edit-distances are a quite different metric to the required: 'are these two strings the same except for in at most n places'.

Another frequently offered alternative is the use of regexes, perhaps programmically generated. Whilst these work fine and beat the edit-distance algorithms hands down, they generally require quite-to-very complex regexes that, by requirement, contain many backtrack points, with the result that their runtime performance is understandably very poor.

The root of the problem is the O(N2/2) nature of the round-robin comparison process. The exponential growth in the numbers of comparisons required, means that even for relatively low numbers of strings -- and in genomic work, 100,000 is low -- every extra microsecond taken performing each comparison adds an hour and a half to the processing time.

And by the time you get to a still relatively small 1 million strings, each extra microsecond will cost you an extra 138 hours. Almost an extra week of waiting, not to mention the ~15kWhs of electricity used in the process.

It would likely not be at all obvious to you that using string-wise XOR would form the basis of the most efficient (pure Perl) method of doing this, but its use here reduces days of processing to a few hours.

Even less obvious is the likelihood that anyone would offer a solution that reduces the number of comparisons required from 5 billion fuzzy matches to the order of 300,000 O(1) lookups, yet here it is.

Whilst my question has yet to garner anything other than more efficiently implemented brute-force algorithms -- themselves very valuable in the absence of a different solution -- there are still at least a couple of avenues yet unexplored.

So, whilst I couldn't think of any game changing algorithm or approach to the problem, I'm long enough in the tooth to know that it does no harm to ask and can often lead to surprises.

So, you'll forgive me -- or not -- if I file your "I can’t think of anything.. non-response, under the heading of: "Saw the opportunity to garner a few XP, but couldn't think of anything useful to say, so I'll say just that, and wrap it in some other plausible but meaningless drivel."


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^2: The sum of absolute differences in the counts of chars in two strings.

Replies are listed 'Best First'.
Re^3: The sum of absolute differences in the counts of chars in two strings.
by sundialsvc4 (Abbot) on Nov 20, 2011 at 13:31 UTC

    Omitting ... please!! ... that final paragraph from your post, which was entirely unnecessary, I found your comments up to that point extremely interesting, and I immediately followed and read both the link that you gave in the second-from-last paragraph and the entire thread.

    Please enlighten me here ... I am quite, quite serious ... what is it about your original description of the problem that I have overlooked?   If the “alphabet” were a normal one, say no more than 256 elements, then we will have a 256-element array set to all-zero and we will tally each character with zero variance in time based on character value.   Then we will loop through exactly 256 accumulators, adding them up and setting them back to zero as we go.   And then we will have our answer.

    As do many of us, I have plenty of experience with (differing kinds of) “massive problems” in which very slight gains, repeated hundreds of millions of times if not many times more, make all the difference in the world.   Which, having said that, “is neither here nor there.”   So, I fully appreciate what you are saying right down to and including “15 kWh of electricity,” but I simply do not yet see what makes this particular problem “hard.”

    I know that it must be, or you would not have said it.   (I would be an idiot to question your knowledge or experience.)   But something in the description, I guess, eluded me.   What did I overlook?   I really do want to know.

      what is it ... that I have overlooked?

      Well, apart from the obvious: whatever any one of the other several hundred monks might think of that neither you and I did; you are asking the wrong question.

      Instead, try asking yourself: "What did I contribute to the thread?".

      And then you might ask yourself the same of Re: windows 7 HardwareID, amongst many others.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Esteemed colleague, I never intend to “fail to contribute to” any thread that I make or that I comment on ... in a genuinely meaningful way ... and may I cautiously suggest that to hammer me in public with regard to your opinions of my comments ... “does not contribute to the thread?”   I have no quarrel with you.   I regret that you seem to have an unending quarrel with me.   I hope some day to bury that hatchet, because I have never held that weapon in my hand and still don’t.

        The “C” solution that you recently posted is both informative and, as I am sure many of us suspected it would be, straightforward.   It might as you say appear to be “brute force,” but, a digital computer is a brutally-powerful machine.   The problem space is small and unchanging, and microseconds matter.   The overhead of “fancy pants” data structures is pointless and unwanted.   The algorithm you wrote runs as fast as it can, such that if you need it to run faster you just have to increase the clock-speed.   I appreciate you posting the source-code of what you came up with.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://939040]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (6)
As of 2024-04-23 16:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found