Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Table shuffling challenge

by frozenwithjoy (Curate)
on Aug 23, 2013 at 18:13 UTC ( #1050712=note: print w/ replies, xml ) Need Help??


in reply to Table shuffling challenge

UPDATE: Try my original approach, but just shuffle at the very last step!

my %data; <DATA>; while (<DATA>) { my ($row, @values) = split; $data{$row} = sum @values; } say $data{$_} for shuffle keys %data;

Maybe try something like the following. I start by making a random array of numbers equal to the number of data rows you have. Next I read the file and build a hash. The keys are shifted from the random array and the values are the row sums. Then iterating over the hash (sorted by keys) will let you do whatever you want with the permuted row sums.

#!/usr/bin/env perl use strict; use warnings; use feature 'say'; use List::Util qw(sum shuffle); my %data; my @row_nums = shuffle 1..5; <DATA>; while (<DATA>) { my ($old_row, @values) = split; my $new_row = shift @row_nums; $data{$new_row} = sum @values; } say $data{$_} for sort { $a <=> $b } keys %data; __DATA__ Head1 Head2 Head3 Head4 Head5 Head6 Head7 Head8 + Head9 Head10 Head11 1 0 1 1 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 1 0 0 3 1 0 0 0 1 0 0 0 1 0 4 0 1 1 1 1 0 0 0 0 1 5 0 0 0 0 0 0 0 1 0 0

Quick question: Are you really doing this for 1 million different tables or are you doing 1 million permutations of the same table? If the latter, just read in and sum the records once in the original order. This will result in a 110,000 element array of row sums that you can just shuffle a million times.


Comment on Re: Table shuffling challenge
Select or Download Code
Re^2: Table shuffling challenge
by glow_gene (Initiate) on Aug 23, 2013 at 20:37 UTC

    Essentially, I need 1,000,000 different tables. They would be related in that if column 1 in the current table has 5 "1"s and 5 "0"s, each version of column 1 in every table would also have 5 "1"s and 5 "0"s but in a different order each time. The same would be true for all 10 columns.

    I apologize if I am not explaining this well. My lack of coding lingo and an inability to communicate without hand gestures is doing me a disservice. If the relevance helps you understand, then:

    Each column is a different cancerous tissue sample. The rows are different biomarkers. A "1" means that tumor has that biomarker, a "0" means it does not. We have several biomarkers that are in all 10 samples and we want to know if this is statistically significant. To do this we need to mix up all the 1s and 0s from each tumor and see, at random, how many times you would get a 1 in every column (ie a row value of 10).

      In the quest for speed I've written this code in a way I wouldn't normally but hopefully it reflects your requirement. A 100 iterations takes about a minute on my desktop so the million would take 170 hours !! - I'll work on speeding it up. poj

      Really, to me the fact that you're finding it difficult to explain your problem is a real red flag.

      Computers are as dumb as a very dumb thing, so if you can't explain your problem to other human beings how can you expect to successfully write a program? The program will only do exactly what you tell it to do, so you need to do all the critical thinking.

      Also, how will you test the your program to understand if it's producing significant results or just random junk? Just because it doesn't crash doesn't mean that it's working properly.

      Being able to clearly explain your problem is hugely important, as demonstrated by the well known debugging technique : CardboardProgrammer Rubber Duck Debugging.

      Perhaps you should start by working with a very small dataset until you've got a good understanding of your problem space.

      You seem to be trying to derive some sort of probabilities, so perhaps there's a way to calculate what you need rather than trying this brute force approach. So maybe some time reading a statistics textbook would be time well spent?

        I used a much smaller dataset to create and troubleshoot the code I currently have. It does what I want it to do...just much more slowly than I would like. With a pen and paper I can very quickly and succinctly explain my problem, it's just difficult to explain without a visual aid. I do agree that I need to work on my communication skills concerning code; I have only taken one self-taught course and it is an area in which I need to improve.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1050712]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (15)
As of 2014-08-20 15:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (116 votes), past polls