Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Search a hash for closest match

by aquinom (Sexton)
on Nov 01, 2011 at 16:56 UTC ( [id://935157]=perlquestion: print w/replies, xml ) Need Help??

aquinom has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm trying to write an algorithm that can search a genomic coordinate (chromosome number, position) against a list of other coordinates (from eQTL data) to find the nearest coordinate(s) (eQTL) to the search term.

I was planning on using a hash structure such as $hash{$chr}{$pos}='' for my search terms and another $eQTLhash{$chr}{$pos}='...'. Could anyone recommend to me a good algorithm to accomplish this task?

e.g. (also strange case in which more than 1 site would have to be returned)

search term = 1 50

eqtl hash contains 1=>15, 1=>49, 1=>51, 1=>79

should produce output of 49 AND 51 (this is probably a very rare case and if it's too difficult don't worry about creating both outputs).

Replies are listed 'Best First'.
Re: Search a hash for closest match
by BrowserUk (Patriarch) on Nov 01, 2011 at 17:23 UTC

    Hashes don't help with finding nearest, only exact matches.

    You would be better using a HoAs rather than a HoHs:

    $hash{ 1 } = [ 15, 49, 51, 79 ];

    And then use a binary search to find the closest position.

    More information about the details of your data might lead to a more efficient solution.

    Ie. What is the range of the positions? How many chr/pos pairs in your hash? How many lookups do you need to do?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Okay, I'll use a HoA instead I wasn't thinking about the data structure as I was hung up on thinking about how to make an efficient algorithm to do this. I was also thinking about doing a binary search as you suggested.

      The range of positions is going to be quite large(between 1-250M for large chromosomes, 25 chromosomes) for the eQTL set but only 928,000 positions in total. My query list will be probably in the dozens at most. I will need to do a lookup for each item in my query list.

        Given small number of searches and the ~40k values per search, you may well find that a simple linear search using List::MoreUtils::firstidx() is more than adequate for your needs.

        Indeed, you may well find that as the linear search is coded in C, that it beats a binary search in Perl.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Search a hash for closest match
by kennethk (Abbot) on Nov 01, 2011 at 17:24 UTC
    Since you are looking for a closest match and not an exact match, you necessarily need to iterate over the target set. To me, that means you probably want to use a hash of arrays, not a hash of hashes -- see perllol for info on differences between the structures.

    I don't really follow your example, but as best as I can tell, your search algorithm should look something like

    #!/usr/bin/perl -w use strict; my @values = ( 15, 49, 51, 79, ); my $pos = 50; my $diff; my @results; for my $value (@values) { if (not defined $diff or abs($pos - $value) < $diff) { $diff = abs($pos - $value); @results = $value; } elsif (abs($pos - $value) == $diff) { push @results, $value; } } print join "\n", @results;

    Note that there are more efficient search procedures, but this may be good enough before you profile.

      Thanks Kenneth, I had something similar in mind. I think I'm also going to pre-sort the HoA then try to do a binary search to speed up the lookup. Sound like a reasonable approach?

        Profile - like the great Knuth said, "Premature optimization is the root of all evil."

        Based upon the specs you give in Re^2: Search a hash for closest match, I would expect that linear search is not even remotely your optimal solution, and that using Perl's sort (Merge sort) followed by binary search is probably pretty derned good. However, better than fast code is functional code. I would highly recommend developing your script with a very simple bit of logic like this and small data sets, and improving the search algorithm later in the development cycle once it becomes necessary. YMMV.

Re: Search a hash for closest match
by JavaFan (Canon) on Nov 01, 2011 at 17:56 UTC
    Either use a sorted array, and perform binary search (if the dataset is static), or use a balanced search tree (if the set changes after it's build). Both can be build in O(n log n) time, with an O(log n) query time.

    Of course, if you only have o(log n) (that's little-o) queries, it's faster to just do full scans, and leave the list unsorted.

      Thanks. :)
Re: Search a hash for closest match
by Anonymous Monk on Nov 04, 2011 at 12:57 UTC
    Another possibility is to try to match a "handful" of records at the same time as you are passing through the list sequentially. And there might be a pretty good argument against using a fancy-tree lookup structure here ... that argument being "page faults." If you are randomly probing a large structure then you are also randomly incurring page faults. It would therefore be best to make the most of each virtual-memory page once you have incurred the cost of paging it in. Whatever pre-construction testing you do, be sure to do it under realistic loads that actually cause page-fault activity to occur on the target machine. (And if memory is capacious enough that it doesn't, then any ol' algorithm will do.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://935157]
Approved by marto
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-23 07:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found