Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

finding tuples

by Anonymous Monk
on Jun 23, 2009 at 18:18 UTC ( [id://774131]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

for analysis I have small sets of nucleotides. They can occur multiple times. The goal is to find all configurations of 4-tuple (set) where all are either the same or sequential alphabetic neighbours for any given set. Bioperl doesn't seem to have anything suitable.

I next started out matching with regex but soon had to abandon that path because regex keep in a string what they have matched (or, same idea, different point-of-view, do not remember that a certain has already been matched earlier) and they also do not deal well with overlaps. For example, on the sorted set AAAAADDDDDEFFGMMSSTVVVVV the only solution is AAAA;ADEF;DDDD;FGMS;MSTV;VVVV, you can see the overlaps at D and MS. Also, I have to be a bit smart about deciding what to match first. Sometimes it is not advisable to form a sameness-tuple because then I destroy a solution involving multiple sequences, for example AADDDEEEEFFFFGGMMMMMMMMMMSTV: I should not pick EEEE or FFFF or else I cannot find ADEF;ADEF;DEFG;EFGM;MMMM;MMMM;MSTV.

Then I tried to bruteforce by enumerating all permutes and checking each 4-substring, but let me tell you, it takes forever even with small sets! Also, there are a lot of conceptual duplicates.

Any advice how to go on? Yes, this is homework. Yes, I will program it myself. Just help me break down the concept into smaller steps.

Replies are listed 'Best First'.
Re: finding tuples
by Roy Johnson (Monsignor) on Jun 23, 2009 at 18:48 UTC
    Your description of the problem is a bit puzzling. It looks like you want to run through the string in order and select four characters from it. After you've taken those four out, you want to go back and get four more from what's left, and so on, until you can't form any more tuples. The rules for selecting 4-tuples are that they must all be the same letter (chosen from a cluster of at least 4), or they must be one letter from each of four consecutive clusters of letters.

    Is that right?

    And then to add to the difficulty, you want to get the largest possible set. I think that is a hard problem. I don't have a solution for you, but I hope I've made your requirements clearer to others.

    Update
    I have come up with a script that does what I think you want, although it does not skip matches in favor of more optimal ones. It just finds the leftmost tuple until they're all gone. (update again: think it's debugged now) Anyway, it's a start.

    #!perl use strict; use warnings; while (<DATA>) { chomp; my ($tuple, @set); print "Starting with $_\n"; while (($tuple) = /((\w)\2\2\2|(\w)(?:\3*)(?!\3)(\w)(?:\4*)(?!\4)(\w +)(?:\5*)(?!\5)(\w))/) { $tuple =~ y///cs; # Only one of any character if (length($tuple) == 1) { $tuple x= 4; } print "Found $tuple!\n"; # Remove tuple for my $char (split //, $tuple) { # If it is the last of its kind, # no more matches across it are possible # I put a space in there, so it won't match \w s/(?<!$char)$char(?!$char)/ / or s/$char//; } print "Next round: $_\n"; } } __DATA__ AAAAADDDDDEFFGMMSSTVVVVV AADDDEEEEFFFFGGMMMMMMMMMMSTV AADEEFFG

    Caution: Contents may have been coded under pressure.

      Your ruleset is basically right, except it's a set, not a string, and order does not matter. I sorted it and made it into a string for convenience because perl regex operate on strings, and there is a substring function, but not a subset function. I repeat back what you wrote with updated wording:

      I want to run through the set and select four characters from it. After I've taken those four out, I want to go back and get four more from what's left, and so on, until I can't form any more tuples. The rules for selecting 4-tuples are that they must all be the same letter (quantity of at least 4), or they must be one letter from each of four consecutive letters.

      I have tried your program, and it does only very little. I do not understand that big regex. I will explain what is missing. For the first data line, the remainder after taking out the same-tuples is ADEFFGMMSSTV. It should be split this way: ADEF;FGMS;MSTV. For the second data line, after taking out the same-tuples the remainder is ADEFFGGMMSTV. From this, no solution can be found anymore, as no matter how you turn it, there are only at most two 4-tuples left that satisfy the second condition and the last 4 are not consecutive. For the third data line, the program says nothing, it correctly recognises this set has no solution, but shouldn't it at least try to remove the obvious 4-tuples ADEF or DEFG?

        There is one algorithm called EM algorithm... it can probably give you more insight since it is related to shot-gun sequencing. I hope it can help Tell me what comes up your way Take care
Re: finding tuples
by johngg (Canon) on Jun 23, 2009 at 20:10 UTC

    I'm not at all sure if this is what is required. It only takes tuples of all the same letter if they are at the start of the string.

    use strict; use warnings; my @sets = qw{ AAAAADDDDDEFFGMMSSTVVVVV AADDDEEEEFFFFGGMMMMMMMMMMSTV }; foreach my $set ( @sets ) { print qq{$set\n}; my @tuples; while ( length $set >= 4 ) { if ( $set =~ m{^(.)\1{3}} ) { push @tuples, substr $set, 0, 4, q{}; } else { my @boundaries = ( 0 ); push @boundaries, pos $set while $set =~ m{(.)(?=.)(?!\1)} +g; last if scalar @boundaries < 3; push @tuples, join q{}, reverse map substr( $set, $_, 1, q{} ), reverse @boundaries[ 0 .. 3 ]; } } print qq{ $_\n} for @tuples; print qq{ Left over: $set\n} if $set; }

    The output.

    AAAAADDDDDEFFGMMSSTVVVVV AAAA ADEF DDDD FGMS MSTV VVVV AADDDEEEEFFFFGGMMMMMMMMMMSTV ADEF ADEF DEFG EFGM MMMM MMMM MSTV

    I hope this is of interest.

    Cheers,

    JohnGG

      Thanks for the program, it gave me insight how to do decomposition in a better way what I have programmed earlier.

      Your program deals well with same-tuples but forms 4-tuples that are not consecutive. Consider the sets ADEFGSTV or AEEGSSVV which have no solution. I'll hack on it to see whether I can extend it to reject those sets.

Re: finding tuples
by moritz (Cardinal) on Jun 23, 2009 at 18:50 UTC
    It's still entirely unclear to me how the tuples must be formed.

    It seems that they can overlap, and don't even have to be substrings of the string you're searching on. What are the restrictions? when are two tuples "conceptual duplicates"? How long are your input strings, typically? Is your character repertoire always limited to A-Z?

    Any advice how to go on?

    First formulate your requirements in precise language, so that anybody how doesn't know about biology can help if you.

      As I said above, it is true they are subsets, not substrings. For example, the single solution of AAAAEEEE is AAAA;EEEE, and EEEE;AAAA is identical to it, just written differently, so it should not count twice. Likewise the order within a subset: AAAA;GMST is identical to AAAA;TSGM and should count only once.

      Duplicates occured when I used permutation. I give a minimal example. A permutation of AADE is AAED. But order does not matter. Therefore the permutation algorithm wasted time.

      Typical set sizes range from 8 to 24 in steps of 4, but mostly longer than shorter.

      The repertoire is limited to ADEFGMSTV.

      I hope this helps your understanding.

        So let me summarize:
        • You have a set of input characters and their repetition counts
        • You look for a maximal set of ordered four-tuples that each consist of either four identical or four completely different letters
        • Only if a character is repeated at least four times in the input it can be used in tuple with four identical characters

        That's the only set of rules that I could extract from what you wrote so far. However that doesn't explain this: Sometimes it is not advisable to form a sameness-tuple because then I destroy a solution involving multiple sequences

        What is the exact rule of how one tuple can "destroy" a solution, if overlap is allowed anyway?

Re: finding tuples
by ikegami (Patriarch) on Jun 23, 2009 at 21:13 UTC

    If I understand correctly, it would be best to break down the input into smaller inputs.

    • If the input has only one of each of 4 subsequent letters (e.g. "A", "D", "E" and "F"), then they must form a tupple if there's a solution. Find the solutions without these letters present, and add this tupple to each of them.

    • If the input has none of a letter (e.g. "G"), then you can find the solutions for everything up to that letter and the solutions for everything above that letter separately, and find their product.

    Update: Oops, the first one isn't quite right. Consider when there is 5 subsequent letters with a count of 1.

      If the input has none of a letter

      That appears to be the key to speed it up. Thanks!

Re: finding tuples (single solution)
by ikegami (Patriarch) on Jun 26, 2009 at 19:58 UTC

    The following finds a solution in O(1) time after a very quick O(N) setup:

    #!/usr/bin/perl use strict; use warnings; my $alphabet = uc('ADEFGMSTV'); my $N = length($alphabet); my $L = $N - 1; my @ltr_lkup = $alphabet =~ /./g; my %idx_lkup = map { substr($alphabet, $_, 1) => $_ } 0..$L; sub tupple { return join '', @ltr_lkup[ sort { $a <=> $b } @_ ]; } { my ($input) = @ARGV or die("usage: $0 string\n"); $input = uc($input); my @counts; while ($input =~ /(.)/g) { defined($idx_lkup{$1}) or die("'$1' not in alphabet\n"); ++$counts[ $idx_lkup{$1} ]; } $counts[$_] ||= 0 for 0..$L; # Avoid warnings. my @extracted; for my $i (0..$L) { if ($counts[$i] >= 4) { $counts[$i] %= 4; push @extracted, tupple( ($i) x 4 ); } if ($counts[$i] > 0) { die("No solution\n") if $i+3 > $L; my $instances = $counts[$i]; for (0..3) { die("No solution\n") if $counts[$i+$_] < $instances; $counts[$i+$_] -= $instances; } push @extracted, tupple( $i+0 .. $i+3 ); } } print(join(';', @extracted), "\n"); }
    $ perl single.pl AAAAADDDDDEFFGMMSSTVVVVV AAAA;ADEF;DDDD;FGMS;MSTV;VVVV $ perl single.pl AADDDEEEEFFFFGGMMMMMMMMMMSTV ADEF;DEFG;EFGM;MMMM;MSTV $ perl single.pl AAAADDDDEEEEFFFFGGGG AAAA;DDDD;EEEE;FFFF;GGGG

    It wouldn't take that much work to make this find all the permutations because the only time you'll have permutations is when you have something of the form

    nnnnooooppppqqqq Solution 1: (nnnn,oooo,pppp,qqqq) Solution 2: (nopq)
Re: finding tuples
by BioLion (Curate) on Jun 23, 2009 at 19:46 UTC

    BioPerl won't help you with this problem, it not being a biological question beyond using residue symbols.
    If i understand correctly though, you want the maximum number of 4 character tuples?
    Then you could try some kind of search path algorithm, and then compare the different paths for the longest?
    You would have to have some way of remembering your searchpath, splitting the path every time you come across somewhere where you could take alternate routes? You have some simple rules already, like if two sequential letters are the same you can either form an 'AAAA' type tuple or an 'ABCD' type, etc...

    There is a lot of info on search path / graph search algorithms in wikipedia etc... e.g. http://en.wikipedia.org/wiki/Dijkstra's_algorithm you just want the opposite of what they usually do...

    I would be interested in seeing what you come up with

    Just a something something...
      you want the maximum number of 4 character tuples?

      Not quite. The maximum number is not interesting, except the tuples make up the whole set. In other words: either a set cannot be completely decomposed, then it has no solution; or in can be completely decomposed in certain ways, then those are the solutions.

      I have read the material about search paths, but I cannot see how it relates to a set. Can you tell me how I make search path out of a set?

        I don't have much experience in this, but i think the first thing you would need to do is work out some kind of tuple set descriptor, probably the positions of the letters if you broke them down into an array. e.g.
        $paths = [$path1, $path2 ... $pathN,];

        With each path being:
        $path = [[1,2,3,4,],[5,8,9,11],[6,10,12,13,],[etc...]...];

        You would then need a way of continuing the search path from a search path stub (i.e. a searchpath that has reached a breakpoint). After that it is a case of being really clear on what could cause a breakpoint...

        I am not being very helpful here, but it is a tough question! I suspect you may need to fork the job out as you go ( Parallel::ForkManager and Introduction to Parallel::ForkManager ) and recombine their output ( IPC::Shareable )

        Just a something something...
Re: finding tuples (all solutions)
by ikegami (Patriarch) on Jun 26, 2009 at 22:42 UTC
    #!/usr/bin/perl use strict; use warnings; my $alphabet = uc('ADEFGMSTV'); my $N = length($alphabet); my $L = $N - 1; my @ltr_lkup = $alphabet =~ /./g; my %idx_lkup = map { substr($alphabet, $_, 1) => $_ } 0..$L; sub tupple { return join '', @ltr_lkup[ sort { $a <=> $b } @_ ]; } sub find { my ($counts) = @_; my %solutions; local *try_straight_tupple = sub { my ($counts, $i, $solution) = @_; my $instances = $counts->[$i]; if ($instances == 0) { process($counts, $i+1, $solution); return; } return if $i+3 > $L || $counts->[$i+1] < $instances || $counts->[$i+2] < $instances || $counts->[$i+3] < $instances; my @counts = @$counts; my @solution = ( @$solution, tupple( $i+0 .. $i+3 ) ); $counts[$i+$_] -= $instances for 0..3; process(\@counts, $i+1, \@solution); }; local *try_quad_tupple = sub { my ($counts, $i, $solution) = @_; try_straight_tupple($counts, $i, $solution); if ($counts->[$i] < 4) { return; } my @counts = @$counts; my @solution = ( @$solution, tupple( ($i) x 4 ) ); while ($counts[$i] >= 4) { $counts[$i] -= 4; try_straight_tupple(\@counts, $i, \@solution); } }; local *process = sub { my ($counts, $i, $solution) = @_; if (!grep $_, @$counts) { ++$solutions{ join(';', @$solution) }; return; } if ($i > $L) { return; } try_quad_tupple($counts, $i, $solution); }; try_quad_tupple($counts, 0, []); return [ sort keys %solutions ]; } { my ($input) = @ARGV or die("usage: $0 string\n"); $input = uc($input); my @counts; while ($input =~ /(.)/g) { defined($idx_lkup{$1}) or die("'$1' not in alphabet\n"); ++$counts[ $idx_lkup{$1} ]; } $counts[$_] ||= 0 for 0..$L; # Avoid warnings. my $solutions = find(\@counts); die "No solution\n" if !@$solutions; print("$_\n") for @$solutions; }
    $ perl all.pl AAAAADDDDDEFFGMMSSTVVVVV AAAA;ADEF;DDDD;FGMS;MSTV;VVVV perl all.pl AADDDEEEEFFFFGGMMMMMMMMMMSTV ADEF;DEFG;EFGM;MMMM;MSTV $ perl all.pl AAAADDDDEEEEFFFFGGGG AAAA;DDDD;EEEE;FFFF;GGGG AAAA;DEFG ADEF;GGGG

    I think this can be optimised for strings with a lot of repetition, but I haven't found out how yet.

Re: finding tuples
by Anonymous Monk on Jun 24, 2009 at 11:12 UTC
    I would try dividing the letters in buckets( like AADEFFVV -> {A=>2,D=>1,E=>1,F=>2,V=>2} and then try to make the maximum combinations, if I get it right, the maximum for a given string is its length/4 so my string here has a maximum of 2 sets, then, with this upper bound in mind I would try make 2 sets, if I can't, try 1 and on..until I find a combination that works and this one will be for sure the maximum possible for this string.

      That's not helpful, I already wrote that I tried bruteforcing and that takes too long for the larger sets.

        its not exactly a bruteforce, but it can't have a straight answer either. any algorithm that solve this kind of problem will use heuristics, you can't do better than this. here is my try, didn't tried my algorithm against the others, don't know which is faster, anyway

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://774131]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-03-29 06:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found