Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Table shuffling challenge

by glow_gene (Initiate)
on Aug 23, 2013 at 16:42 UTC ( #1050689=perlquestion: print w/replies, xml ) Need Help??
glow_gene has asked for the wisdom of the Perl Monks concerning the following question:

I suppose I have more of a challenge than a question. Feel free to tell me this isn't the appropriate place and no one has the time but if you do, here is my problem:

I have a tab-delimited table with 11 columns and approximately 110,000 rows. It has column headings and the first column is merely a count of the rows (1, 2, 3, 4, 5 etc.). Each entry in the table is either a 1 or a 0. I need to randomly select an entry from columns 2-11, sum up their values and record the sum (will be a number between 0 and 10). I need to do this until all values in the table are gone and no values are used more than once per table. Aaaand here's the kicker: I need to do this 1,000,000 times (ie, repeat for 1,000,000 tables).

Example table:

Head1 Head2 Head3 Head4 Head5 Head6 Head7 Head8 Head9 Head10 + Head11 1 0 1 1 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 1 0 0 3 1 0 0 0 1 0 0 0 1 0
I have code that will successfully complete this task once in about 1 second, but that would mean I would have to run the code for about 11 days to complete 1,000,000 iterations. Not going to happen. 24 hours is doable but multiple days is not. I have two working programs that take about the same amount of time. Both start by saving all variables in each column to separate arrays (excluding headers). One program then shuffles each array and sums up all the nth-numbered values in the array. The other one uses the following subroutine to select values from each array until the arrays are empty:
sub find_and_remove () { # pulls the referenced array into the subroutine and saves it in v +ariable $reference $reference = $_[0]; # selects a random number from array size. $random_number = int(rand(@col2c)); # saves that-numbered variable from the array as $nowvalue $nowvalue = @$reference[$random_number]; # removes the variable from the array so it can't be selected agai +n splice (@$reference, $random_number, 1); # returns the selected value return $nowvalue; }
Do you have any suggestions for how to make this subroutine faster? Do you have any suggestions on how to tackle this problem in general; I'm willing to start over! I have benchmarked both programs to try and find the time-culprit but each step individually takes 0 time. Iterating through the whole table takes a second. (Also, feel free to tell me that this is simply a ridiculous amount of values and it is just going to take a long time to complete without a super-computing cluster. This will make me feel better, if not solve my problem).

UPDATE: Thank you so much for your replies! I'm relatively new to perl and still have a busy afternoon ahead of me, so it may take me some time to try these new approaches/see if they will work. Again, thank you for your speedy responses!

Replies are listed 'Best First'.
Re: Table shuffling challenge
by zork42 (Monk) on Aug 24, 2013 at 05:00 UTC
    Let me see if I understand. Please say if each question is correct or not...

    Q1: You have 110,000 biomarkers, represented by the 110,000 rows?
    Q2: and 10 different cancerous tissue samples, represented by the 10 columns (ignoring the first column)?

    Q3: The first table represents the results of the tests?
    I'll call this the Test Results Table, or TRT.

    Q4: Are the other 999,999 tables derrived from the TRT in some way?

    Q5: So at the moment you're going:
    TRT --> (some function) --> extra 999,999 tables --> process 1,000,000 tables?

    Q6: Would it be quicker to go:
    TRT --> process 1 table, using (some function) internally?

    ===
    I have a tab-delimited table with 11 columns and approximately 110,000 rows. It has column headings and the first column is merely a count of the rows (1, 2, 3, 4, 5 etc.). Each entry in the table is either a 1 or a 0. I need to randomly select an entry from columns 2-11, sum up their values and record the sum (will be a number between 0 and 10). I need to do this until all values in the table are gone and no values are used more than once per table. Aaaand here's the kicker: I need to do this 1,000,000 times (ie, repeat for 1,000,000 tables).
    Q7: What do you mean by "I need to randomly select an entry from columns 2-11, sum up their values and record the sum (will be a number between 0 and 10)."?

    Q8: Do you mean "I need to randomly select a row, sum up the row's 10 columns and record the sum (will be a number between 0 and 10)."?
    I have two working programs that take about the same amount of time. Both start by saving all variables in each column to separate arrays (excluding headers). One program then shuffles each array and sums up all the nth-numbered values in the array. The other one uses the following subroutine to select values from each array until the arrays are empty:
    Q9: What does "sums up all the nth-numbered values in the array" mean?

    Q10: Could you explain each stage in the process by using a short (maybe 10 row) example table?

      Well asked++ The description of the requirements has so far left me also completely bewildered.

      I suspect there are vast improvements -- 2, 3, even 4 orders of magnitude -- in processing performance to be had for this problem; but we need to understand the problem first.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Table shuffling challenge
by frozenwithjoy (Priest) on Aug 23, 2013 at 18:13 UTC

    UPDATE: Try my original approach, but just shuffle at the very last step!

    my %data; <DATA>; while (<DATA>) { my ($row, @values) = split; $data{$row} = sum @values; } say $data{$_} for shuffle keys %data;

    Maybe try something like the following. I start by making a random array of numbers equal to the number of data rows you have. Next I read the file and build a hash. The keys are shifted from the random array and the values are the row sums. Then iterating over the hash (sorted by keys) will let you do whatever you want with the permuted row sums.

    #!/usr/bin/env perl use strict; use warnings; use feature 'say'; use List::Util qw(sum shuffle); my %data; my @row_nums = shuffle 1..5; <DATA>; while (<DATA>) { my ($old_row, @values) = split; my $new_row = shift @row_nums; $data{$new_row} = sum @values; } say $data{$_} for sort { $a <=> $b } keys %data; __DATA__ Head1 Head2 Head3 Head4 Head5 Head6 Head7 Head8 + Head9 Head10 Head11 1 0 1 1 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 1 0 0 3 1 0 0 0 1 0 0 0 1 0 4 0 1 1 1 1 0 0 0 0 1 5 0 0 0 0 0 0 0 1 0 0

    Quick question: Are you really doing this for 1 million different tables or are you doing 1 million permutations of the same table? If the latter, just read in and sum the records once in the original order. This will result in a 110,000 element array of row sums that you can just shuffle a million times.

      Essentially, I need 1,000,000 different tables. They would be related in that if column 1 in the current table has 5 "1"s and 5 "0"s, each version of column 1 in every table would also have 5 "1"s and 5 "0"s but in a different order each time. The same would be true for all 10 columns.

      I apologize if I am not explaining this well. My lack of coding lingo and an inability to communicate without hand gestures is doing me a disservice. If the relevance helps you understand, then:

      Each column is a different cancerous tissue sample. The rows are different biomarkers. A "1" means that tumor has that biomarker, a "0" means it does not. We have several biomarkers that are in all 10 samples and we want to know if this is statistically significant. To do this we need to mix up all the 1s and 0s from each tumor and see, at random, how many times you would get a 1 in every column (ie a row value of 10).

        Really, to me the fact that you're finding it difficult to explain your problem is a real red flag.

        Computers are as dumb as a very dumb thing, so if you can't explain your problem to other human beings how can you expect to successfully write a program? The program will only do exactly what you tell it to do, so you need to do all the critical thinking.

        Also, how will you test the your program to understand if it's producing significant results or just random junk? Just because it doesn't crash doesn't mean that it's working properly.

        Being able to clearly explain your problem is hugely important, as demonstrated by the well known debugging technique : CardboardProgrammer Rubber Duck Debugging.

        Perhaps you should start by working with a very small dataset until you've got a good understanding of your problem space.

        You seem to be trying to derive some sort of probabilities, so perhaps there's a way to calculate what you need rather than trying this brute force approach. So maybe some time reading a statistics textbook would be time well spent?

        In the quest for speed I've written this code in a way I wouldn't normally but hopefully it reflects your requirement. A 100 iterations takes about a minute on my desktop so the million would take 170 hours !! - I'll work on speeding it up. poj
Re: Table shuffling challenge
by Perlbotics (Chancellor) on Aug 23, 2013 at 17:05 UTC

    Some ideas:

    • keep all lines in memory (110,000 x 11 doesn't seem too big)
    • avoid splice() and other list operations
    • creative use of pack()/unpack()?
    • try this: shuffle columns (1..10, but not 0) once (List::Util) for each row; then perform operations for one column after another (fixed index: 1..10)
    • maybe combine shuffle and looping for each row with a suitable buffer, so you need to visit all 110,000 rows only once?

    HTH

Re: Table shuffling challenge
by Laurent_R (Canon) on Aug 23, 2013 at 17:43 UTC

    Each entry in the table is either a 1 or a 0. I need to randomly select an entry from columns 2-11, sum up their values and record the sum (will be a number between 0 and 10).

    If you select randomly an entry, then the result will be either 0 or 1, not a number between 0 and 10. Or did I miss out something? Anyway, I don't quite see the connexion between what you describe and your explanation of what your program does. So I probably missed something.

    Can you provide the code you are using, we are more likely to understand exactly what you are trying to do and also to see if we can spot an opportunity for performance improvement.

Re: Table shuffling challenge
by protist (Monk) on Aug 23, 2013 at 20:48 UTC

    I would advise to avoid actually removing elements.

    Instead have multidimensional hash of column and row.

    You can use this multidimensional hash to record

    whether a given index has already been supplied before.

    Note the use of a closure to make the function stateful.

    That is the reason for the reset_hash function.

    You must be able to reset the hash for another run.

    ## just a sketch my %columns; ## to associate columns with ## column numbers # ## $columns{1} should return ## a reference to the ## first column list { ## closure cleanliness my $seen; ## keeps track of what we have seen ## used to reset hash within closure between runs sub reset_hash{ undef $seen } ##function expects column number sub find_and_dont_actually_remove_but_say_we_did{ my $column_number = shift; my $column = $columns{$column_number}; ## keeping as a referen +ce to ## avoid obscene copy-p +asta my $rand; until (not $seen->{$column_number}->{$rand = int(rand(@$column +))}){ ## nothing to see here } $seen->{$column_number}->{$rand} = "looks familiar"; return $column->[$rand] } }

      Ah just noticed you need to do this UNTIL EMPTY.

      My method could be inefficient then as it may take

      forever to guess the last few indices. There are

      many other approaches, but I am going to lunch soon. :D

Re: Table shuffling challenge
by abualiga (Scribe) on Aug 24, 2013 at 03:02 UTC

    This sounds a bit like permutation analysis of logical matrices without replacement. Maybe an R package already exists for something similar.

Re: Table shuffling challenge
by code-ninja (Scribe) on Aug 24, 2013 at 06:43 UTC
    ok, I'm at my work place so cannot submit the code right now but I'll post my algorithm at least because my boss is not around and this post got me ticking... ;).
    I'll upload my rendition of the code once I get home.

    A Sparse Matrix

    A matrix populated mostly with 0s (or null for objects) is called a sparse matrix. Refer this. Keeping this in mind and re-reading your question, I can represent your data as a sparse matrix (critical analysis of this assumption is welcome).

    To parse a sparse matrix, my algorithm is as follows:

    # ALGORITHM: parseSparse (M) // M is the given matrix //first, we "incrementally" parse the matrix //this enables us to capture only non-zero entries //of each row. set sum = 0; foreach $row{ foreach $col{ if M($row, $col) != 0 $key = concatenate ($row, $col); $hash{$key} = M($row, $col); } } //now that my hash is ready, I can do random access faster /* implement what you want to implement, I will just sum all the n +on zero elements of M */ foreach $key (keys %hash) { sum += $hash{$key}; }

    The algorithm essentially creates a 3-tuple (row, column, value) and using a hash makes random access faster as compared to arrays. The wiki link above elucidates more algorithms to represent a sparse matrix.

      (critical analysis of this assumption is welcome) ... using a hash makes random access faster as compared to arrays

      Not so. Both hashes and arrays are classed as O(1) for access; but that 1 is much higher for hashes than for arrays:

      @array = 1..100;; @hash{ 0 .. 99 } = @array;; cmpthese -1,{ a=>q[my$t=0; $t+=$_ for @array;], b=>q[my$t=0; $t+=$hash{$_} for keys %hash;] };; Rate b a b 35255/s -- -66% a 103177/s 193% --

      Sparse matrix are only really useful when the ratio of zeros to non-zeros is greater than say 8:1 and the size is 100s of millions.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1050689]
Front-paged by Arunbear
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (8)
As of 2018-06-22 00:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (120 votes). Check out past polls.

    Notices?