Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^4: regular expessions question: (replacing words)

by $new_guy (Acolyte)
on Sep 27, 2010 at 14:04 UTC ( [id://862216]=note: print w/replies, xml ) Need Help??


in reply to Re^3: regular expessions question: (replacing words)
in thread regular expessions question: (replacing words)

Dear Perl monks,

I have a successive question. Now how do I select two columns at random, count ONLY all the z's common to both columns.

I would like to repeat this say 10 times and finally get the mean of all counts (i.e 10 random selection).

It gets more complicated. In the next round of random selection, I want to pick 3 columns and count the z's common to all of them, repeat this ten times. Do this .... until say n = 18 columns! getting the mean at each at the end of each instance! At the moment I have no idea on how to go about it! A hint would be really appreciated

Thanks

Replies are listed 'Best First'.
Re^5: regular expessions question: (replacing words)
by jethro (Monsignor) on Sep 27, 2010 at 14:59 UTC

    Does your data fit into memory? If not, it gets more complicated (or you just have to wait a long time for the data file to get read dozens of times). You would either have to store it into a database or compress it (i.e. 'z' is 1, not-z is 0, so that every element uses just one bit)

    If yes, read the file into an Array of Arrays:

    my @data; my $n=0; while ($organized=<DATA2>) { chomp; $organized=~s/(\s)\w+/$1z/g; push @{$data[$n++]}, (split /\s+/, $organized); }

    Now accessing column 5 of line 2 is just a simple $data[2][5]

    Now to get it easier, split your problem into easier parts. Create a subroutine that gets as parameter an arbitrary number of columns. This subroutine just counts all rows that have a 'z' in all these columns. You can do that with a loop (over the selected columns) inside a loop (over all rows).

    If you got that working (test it with some simple data), just create another array, add a random number. Then repeatedly add a random number (that is not already in the array) to the array, call the subroutine with it. Do that 18 times.

      Hi Jethro,

      Thanks for the explanation!

      Yes the data fits in memory! And yes it would be appropriate to say every z is 1 and non-z is 0.

      I still don't understand! How do I select two columns at random, then count only the z's that are common to all rows in the two columns. By count I meant if a z occurs in column 1 at row 6 and column 2 at row six then my count of z's would be 1. Notice my count will increase as I go down comparing the rows.

      Thanks

        That with the 0 and 1 would only have been necessary if you needed to compress the data , i.e. save memory. Which you say isn't the case.

        Ok, here is the subroutine that counts rows with all 'z' in specific columns:

        sub countrows { # First parameter is a pointer/reference to the data # Second parameter is an array of columns numbers my ($data,@f)= @_; my $count=0; foreach my $row (@$data) { my $success=1; foreach (@f) { if ($row->[$_] ne 'z') { $success=0; last; } } $count+= $success; } return $count; } my @data= ( ['z',4,'z',4,'z'],['z',4,'z',4,4],['z','z','z',4,'z'] ); print countrows(\@data,0,2),' ',countrows(\@data,1,3,2),' ',countrows( +\@data,4),"\n"; # print 3 0 2

        Now to get an array of random numbers. To make sure I don't get numbers twice I generate an array of all numbers up to the number of columns and pick (i.e. extract and delete) random numbers from that array

        sub randomarray { my ($columns,$count)= @_; my @all; push @all, $_ foreach (0..($columns-1)); my @randarray; while ($count-- >0) { push @randarray, splice(@all, int(rand(@all)),1); } return @randarray; } print join ' ',randomarray( scalar @{$data[0]} , 2 ),"\n"; print join ' ',randomarray( scalar @{$data[0]} , 3 ),"\n"; print join ' ',randomarray( scalar @{$data[0]} , 4 ),"\n"; # might print 3 4 3 0 1 0 1 4 2

        You see how I cut the problem into smaller pieces that are easier to tackle? Ok, the subroutines are still not trivial. But you should be able to connect them in a sensible way to solve your problem

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://862216]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-03-29 05:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found