CaptainF has asked for the wisdom of the Perl Monks concerning the following question:
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: simple string comparison for efficiency
by GrandFather (Saint) on May 28, 2009 at 21:05 UTC | |
Tell us about the larger picture. There is likely to be room for improving whatever you are doing at present, or at least for us to target solutions to whatever the bigger problem is. Most gains in performance are not obtained by micro-optimizations such as you are asking for. Some tricks that will give huge performance gains in some situations will cripple the code in other situations. That said, you can build masks then xor the strings to give something like the fast match you are looking for where the strings to be matched are fairly long. Consider:
Prints:
If you can cache the masks (say you were matching all strings against all others for example) then you get a greater gain. True laziness is hard work | [reply] [d/l] [select] |
by CaptainF (Initiate) on May 29, 2009 at 00:16 UTC | |
| [reply] |
by GrandFather (Saint) on May 29, 2009 at 01:15 UTC | |
Add the line:
as the third line of sub match. True laziness is hard work | [reply] [d/l] |
Re: simple string comparison for efficiency
by Corion (Patriarch) on May 28, 2009 at 20:47 UTC | |
Searching this site for recommendations of BioPerl and/or Fasta will likely give you lots of results. For example, Lower-casing Substrings and Iterating Two Files together has a problem very similar to yours and also a solution. Of course, if the "extreme premium on efficiency" you cite will be backed up by an extreme premium of money, you can achieve better results. | [reply] |
by citromatik (Curate) on May 28, 2009 at 21:29 UTC | |
Searching this site for recommendations of BioPerl... If you are looking for an efficient solution, stay away from BioPerl citromatik | [reply] |
Re: simple string comparison for efficiency
by roboticus (Chancellor) on May 29, 2009 at 14:59 UTC | |
I think the first thing I'd do is to break the problem into a smaller set of problems to reduce the number of comparisons to make. If you have 10,000 strings to compare, a brute force approach has you making about 1x10^8 comparisons (roughly N^2). But if you can break your files into M roughly equal groups, you've already reduced the comparisons to roughly M * (N/M)^2. For M=32, it would be about 3.13x10^6 comparisons, nearly two orders of magnitude. Breaking your data set into multiple small groups will give you a big time savings. (Yes, it's hand-wavy and I made no attempt to be rigorous. Obviously a degenerate case wouldn't do you any good, and some files will appear in multiple groups, of course. But I'm guessing that your data won't fall into the worst case. If you want rigor, please send a check for the appropriate amount, and I'll take care of it--Heh, heh!) An approach something like the following might be able to break your list of files into plenty of groups:
(Note: Horribly untested, and I probably botched a good bit of the syntax.) There some improvements you can make, such as removing identical groups, etc. You can also apply this recursively, taking the next N bytes and further partitioning your sets until you run out of text. Another approach you might take is figuring out a coding scheme where you can compute a code for each string and build a hash of similar codes to find similar items. But I couldn't think of a suitable scheme that would work with wildcarded 'N' characters--hence the brute force approach. ...roboticus | [reply] [d/l] |
Re: simple string comparison for efficiency
by AnomalousMonk (Archbishop) on Jun 03, 2009 at 02:45 UTC | |
As one would expect from a C-coded solution, it is faster than the Pure Perl bitwise string differencing approach of GrandFather in Re: simple string comparison for efficiency: about 1000% for strings of 24,000 bases, about 350% for 240,000 bases (on my laptop: YMMV). It uses an approach that, by happy accident, works for base characters A, C, T and G, and the don't-care character N. It can be extended a bit for other 'bases' (I know other characters are used to represent common base sequences), but sooner or later it will break down. There is a similar (and as yet untested) approach that I can think of that could reliably handle a much wider range of 'bases', but it would require a separate (but I think fairly fast), character-substitution encoding step to convert input files to a form amenable to differencing. Please take a look at the material on my scratchpad and, if you think it is of interest, I can post it in this thread or in a new thread as you think best. | [reply] |
by BrowserUk (Patriarch) on Jun 03, 2009 at 07:16 UTC | |
Using XOR and a lookup for this in C code it kinda pointless. The reason using the XOR the technique works relatively quickly in Perl, is because accessing strings character by character in Perl is slow as it involves a function call (substr) for every character. And subroutine calls (even built-ins; they're both opcodes), are pretty slow in Perl. My bitwise XOR + tr technique allows you to compare all the bytes in the string using a single opcode. And do so in a way that leaves behind a detectable signature at non-matching byte positions (which other comparision opcodes do not do), which allows you to find where the differences are. Or just count the differences with another single opcode. But in C, accessing the bytes of a string is cheap. So, directly comparing the bytes against each other, and each against 'N', in a compound test with short-circuiting boolean operators:
will be quicker than xoring the bytes and then checking the results against a lookup table:
A simple C implementation:
runs substantially faster than your xor + lookup especially on short strings:
However, we can do even better, especially on the long strings. On 32 & 64-bit cpus, accessing and manipulating byte registers is relatively slow compared to manipulating natural register-sized entities. By comparing the strings in 4 or 8-byte chunks, and only comparing bytes when a mismatch is detected, you can achieve further substantial reductions in the number of comparisons made. Using 32-bit compares:
I also tried a 64-bit version which intuition says should be faster still but that proved not to be so. AT first that confused me, but looking more closely at you test data I realised that the reason is that no single 8-byte compare will succeed--as your test strings have differences in every 8-byte unit: 'ATGCATCGATGN' vs. 'ATGCATNGATGN--and the 64-bit version has to compare 8 bytes for each failure rather than 4 with the 32-bit version. If I modify the strings so that they are substantially similar:
then the 64-bit version wins hands down as expected:
/msg me if you want the modified version of your benchmark. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by AnomalousMonk (Archbishop) on Jun 03, 2009 at 09:42 UTC | |
I realize I had naively (and unquestioningly) assumed that the look-up approach would allow one to effectively combine multiple operations into one, thus saving time. But now I see that And that doesn't begin to address the fact that the look-up approach will not scale to register-wide operations! ... and enlightening. | [reply] [d/l] [select] |