radnorr has asked for the wisdom of the Perl Monks concerning the following question:
Hi I have a bit of a snag with my perl code. I am trying to take a random unique sampling of a file containing >100,000 sequences. But right now the way the data is listed I can't seem to figure out how to write a good code.
What I have so far.....>
#!/user/bin/perl -w
+
# Use your own path
+
use strict; # always use strict
+
my $file = "josh_vamps_seqs_ALL.fa";
open (FILE, $file) or die "$0: $!"; # printing errors is a Good Thing
+(tm)
my @file = <FILE>; # Store all the lines in an array
+
close FILE;
open OUT, ">rarefaction.fa" or die "$0: $!"; # Caution: will overwrite
+ any existing files
my $i;
for ($i = 0; $i<4266;$i++) { # Start sampling loop
+
my $rand = int(rand($#file));
# rand gives you a random number between 0 and the number of lines rem
+aining in the array, note
# that this number is by necessity dynamic, it will drop 1000, 999, 99
+8, etc.
my $sample = splice @file, $rand, 1; # Cut out the $rand-th line,
+offset 1 means just one line
print OUT $sample;
} # end of for loop
What it gives me ......>
CTTTTCTTCGGACTACTTACAAGGTGTTGCATGGTCGTC
>FW4WBAJ01DVAX5.ICM_PML_Bv6.PML_43_2003_06_09
CGAGTCAACGCGCAGAACCTTACCAACACTTGACATGTTCGTCGCGACTCTAAGAGATTA
TCTCTATGCGCAACGCGAAAACCTTACCTGGCCTTGACATGCATCTCTAAGCGTGTGAAA
>FMS0R7002J2YH1.ICM_CAM_Bv6.CAM_0011_2000_03_26
TGGTGCCTTCGGGAACGCAGTGACAGGTGATGCATGG
AAACCCTCAGAGACTTCGGTTAATGACATGTTTACAGGTGATGCATGGCCGTCG
>E6SXMJY02I00IR.ICM_BMO_Bv6.BMO_0005_2007_09_22
TTCGGTTCGGCCGGACGAAACACAGGTGT
TAGTGCGACGCGAAGAACCTTACCAGGGCTTAAATGTAGTGGGACAGGTCTAGAGATAGA
GGGTGCCCTTCGGGGAATCTAGTGAGAGGTGTTGCATGGCCGTCG
GTGAGCAACGCGCAGAACCTTACCAACCCTTGACATCCTGTGCTACTACCAGAGATGGTA
TACATCTACGCGAAGAACCTTATCTACACTTGACATACAGAGAACTTACCAGAGATGGTT
TGGTGCCTTCGGGAATCTAGTGACAGGTGATGCATGGCTGTCG
CACACCAACGCGAAAAACCTTACCAACACTTGACATGTTCGTCGCGACTCTAAGAGATTA
TTCGGTTCGGCCGGACGAAACACAGGTGTTGCATGGCTGTC
What I actually want .........>
FW4WBAJ01DVAX5.ICM_PML_Bv6.PML_43_2003_06_09
FMS0R7002J2YH1.ICM_CAM_Bv6.CAM_0011_2000_03_26
E6SXMJY02I00IR.ICM_BMO_Bv6.BMO_0005_2007_09_22
What input file looks like ......>
>FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23
ACTGCCAACGCGCAGAACCTTACCAGGTCCTGACTTCCTGACTATGGTTATTAGAAATAA
TTTCCTTCAGTTCGGCTGGGTCAGTGACAGGTGATGCATGGCCGTC
>FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23
ACTGCCTAACCGATGAACCTTACCTACACTTGACATGCAGAGAACTTTCCAGAGATGGAT
TGGTGCCTTCGGGAACTCTGACACAGGTGATGCATCGCCGTC
>FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23
ACTGCCTACGCGAAGAACCTTACCTACACTTGACATACAGAGAACTTACCAGAGATGGTT
TGGTGCCTTCGGGAACTCTGATACAGGTGATGCATGGCTGTC
>FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23
ACTGCCAACGCGCAGAACCTTACCAACCCTTGACATCCAGAGAATTTTCTAGAGATAGAT
TTGTGCCTTCGGGAACTCTGTGACAGGTGATGCATGGCTGTC
I don't know if there is a way for the random function of perl to recognize a specific piece of the input say the ">" and then print the line that follows? Any pointers would be greatly appreciated.>
Re: Get random unique lines from file
by BrowserUk (Patriarch) on Aug 17, 2012 at 22:28 UTC
|
This script takes the name of the file and the number of sequences you want and outputs that number of randomly selected sequences.
It is simple, fast and should handle any size of input file with minimal memory usage:
#! perl -slw
use strict;
use List::Util qw[ shuffle ];
$/ = '>';
open FASTA, '<:raw', $ARGV[0] or die $!;
my @seqPosns;
push @seqPosns, tell( FASTA ) while <FASTA>;
@seqPosns = shuffle @seqPosns;
for ( @seqPosns[ 0 .. $ARGV[ 1 ] // 10 ] ) {
seek FASTA, $_, 0;
my $seq = <FASTA>;
chomp( $seq ); chop( $seq );
print '>', $seq;
}
close FASTA;
__END__
C:\test>988096.pl C:/dell/test/LCS/bioMan.fasta 2
>af418682
TTCCACAACTTTCCACCAAGCTCTACAAGATCCCAGAGTCAGGGGCCTGTATTTTCC
TGGGTCTTTTGGGCTTTGCCGCTCCATTTACACAATGTGGTTATCCTGCATTAATGC
ACTTCTTTCCTTCAGTACGAGATCTCCTAGATACCGCCTCAGCTCTATATCGGGAAG
TCAAACAATCCAGATTGGGACTTCAACCCCATCAAGGACCACTGGCCACAAGCCAAC
>ab033557
CTCCACGACATTCCACCAAGCTCTGCTAGATCCCAGAGTGAGGGGCCTTTACTTTCC
TGGGTCTTTTGGGCTTTGCTGCCCCTTTTACACAATGTGGCTATCCTGCCTTAATGC
ACTTCTTTCCTTCCATTCGAGATCTTCTCGACACCGCCTCTGCTCTGTATCGGGAGG
TCAAACAATCCAGATTGGGACTTCAACCCCAACAAGGATCAATGGCCAGAAGCAAAT
>x97850
CTCCACAACTTTCCTCCAAACTCTTCAAGATTCCAGAGTCAGGGCCCTGTACCTTCC
TGGGTCTTTTGGGGTTTGCCGCCCCTTTCACGCAATGTGGATATCCTGCTTTAATGC
ACTTTTTTCCTTCTATTCGAGATCTCCTCGACACCGCCTCTGCTCTGTATCGGGAGG
TCAGAAAATCCAGATTGGGACCTCAACCCGCACAAGGACAACTGGCCGGACGCCAAC
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [Watch: Dir/Any] [d/l] |
|
for ( @seqPosns[ 0 .. $ARGV[ 1 ] // 10 ] ) {
...blah....}
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
Re: Get random unique lines from file
by kennethk (Abbot) on Aug 17, 2012 at 22:00 UTC
|
Your post could be served greatly by improved formatting. Specifically, by wrapping your code, input and output in <code> tags, your code and samples wouldn't get mangled, and helpful monks like myself could help you with less effort. See Markup in the Monastery.
Rather than doing your reformatting after the fact, I would make it so that your array of lines (@file) is ready after input. I would accomplish this by slurping the file and then manually splitting where you want breaks. Perhaps something like:
my @file = do {
local $/;
$_ = <FILE>;
split /(?=>)/;
};
As well, rather than randomly splicing, I would think you'd get similar performance and better clarity by using the shuffle method from (the core module) List::Util. Then you could just print them off in the new order, or shift/pop off values as you needed them.
Be aware that these approaches are likely sub-optimal for very large files.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Get random unique lines from file
by Marshall (Canon) on Aug 17, 2012 at 22:06 UTC
|
The formatting of the code/data and intent was hard for me to understand. But maybe this works for you?:
#!/usr/bin/perl -w
use strict;
use Data::Dumper; # a core utility
use List::Util qw(shuffle); # a core utility
my @data;
while (<DATA>)
{
next if /^\s*$/; #skip blank lines
my ($important) = (split(' ',$_))[0];
push @data, $important;
}
print Dumper \@data; #debug - just to verify the split()
# so @data is about 100,000 entries
# this is a big array, but not that big
# the use of splice() works, but is relatively inefficient
# perhaps just use List::Util shuffle to randomize the
# indicies?
# This will create a list with 100,000 values
# but it will do it efficiently.
# Within the parameters of what you have said,
# 1/4 million array/list elements will work ok, even on
# my wimpy Win XP32 bit machine
foreach my $index (shuffle (0..@data-1))
{
print "$index $data[$index]\n";
}
=prints
VAR1 = [
'FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23',
'FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23',
'FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23',
'FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23'
];
3 FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23
2 FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23
1 FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23
0 FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23
=cut
__DATA__
FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTAC
+CAGGTCCTGACTTCCTGACTATGGTTATTAGAAATAA TTTCCTTCAGTTCGGCTGGGTCAGTGACAGG
+TGATGCATGGCCGTC
FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTAACCGATGAACCTTAC
+CTACACTTGACATGCAGAGAACTTTCCAGAGATGGAT TGGTGCCTTCGGGAACTCTGACACAGGTGAT
+GCATCGCCGTC
FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTACGCGAAGAACCTTAC
+CTACACTTGACATACAGAGAACTTACCAGAGATGGTT TGGTGCCTTCGGGAACTCTGATACAGGTGAT
+GCATGGCTGTC
FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTAC
+CAACCCTTGACATCCAGAGAATTTTCTAGAGATAGAT TTGTGCCTTCGGGAACTCTGTGACAGGTGAT
+GCATGGCTGTC
| [reply] [Watch: Dir/Any] [d/l] |
Re: Get random unique lines from file
by roboticus (Chancellor) on Aug 17, 2012 at 22:54 UTC
|
radnorr:
Some time ago, someone here at PM showed a pretty cool way to select a random line from a long file. Basically, as you read the file, you store the line if rand gives you a value lower than 1 / (line number). I tried to generalize it here:
#!/usr/bin/perl
#
# sample_random_lines_from_file.pl <FName> <NumSamples>
use 5.10.1;
use strict;
use warnings;
use autodie;
my @samples;
my $FName = shift // die "missing: <filename> <numsamples>";
my $num = shift // die "missing: <numsamples>";
open my $FH, '<', $FName;
while (<$FH>) {
if ($num/$. > rand) {
my $i = @samples;
if ($i > $num) { $i = rand @samples; }
#print "slot $i, size=" . scalar(@samples) . ", line $.\n";
$samples[$i]=[ $., $_ ];
}
}
print "random samples:\n";
print $$_[1] for sort { $$a[0] <=> $$b[0] } @samples;
I haven't tested it extensively: It works, but I haven't convinced myself that it doesn't have a bias yet. Anyway, the little testing I did was first to generate a file with a million lines in it, and run it a few times:
$ perl -e 'print "$_\n" for 1 .. 1000000' >a_million_lines
marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
29748
135818
143918
164669
216447
245165
267754
404776
419876
487740
893947
marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
163918
434324
435340
534748
596221
611074
677311
682939
719979
842687
998139
There may be a "bias" in it, in that there may be a preference for one end or the other. I haven't played with it enough to determine whether it has a bias, nor figured out a way to correct it if it does. Anyway, the changes I made to adapt the algorithm are rather simple: Instead of having a probability of 1/(line number) as the indicator whether to keep a line, I use (desired num samples)/(line number) as a flag to store the line. Then I select a random slot in the @samples array to stuff the line into (after we gather enough samples to fill @samples).
I hope you find it useful.
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
radnorr:
I had a little more time on my hands, so I tested it a bit. It turns out that there's a slight bias for lines from the start of the file. I haven't done any of the math for it, so I don't know if it'll be easy to fix or not. (The bias is small enough that I'm thinking it might be something as simple as never overwriting the last entry in @sample or something. So it might get an early line and have it stick around until the end.)
I don't really have the time or inclination to run it down at the moment, but if I were going to do so, I'd do a couple things:
- Run my test a couple more times with different numbers of samples, to see if I could easily vary the slope of the bias.
- Make the sample array have *3* slots per entry instead of 2. I'd use the third slot to hold the number of times the slot was overwritten and histogram that as well. Just in case it's something like the fencepost error mentioned earlier.
Anyway, to test it, I did 1000 runs of 25 samples against the million-line file generated in my previous post:
$ for J in {0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,
+8,9}; do ./pm_sample_lines_from_file.pl a_million_lines 25 >t.$J; don
+e
While it was cooking, I quickly put together something to crunch the files and check:
#!/usr/bin/perl
+
#
# pm_sample_histo_test.pl <FNames>
#
# Make a brief histogram to see if there's any obvious bias in the sam
+pler.
# Assumes you're sampling from a file containing a million lines, each
+ of
# which contains just the line number.
#
use 5.10.1;
use strict;
use warnings;
my %H;
for my $FName (@ARGV) {
print "-- $FName --\n";
open my $FH, '<', $FName or die "can't open $FName\n";
while (<$FH>) {
next if /^random/;
my $I = int $_/10000;
$H{$I}++;
}
}
my $max=0;
for (keys %H) { $max = $H{$_} if $H{$_} > $max; }
for my $I (0 .. 99) {
my $bar = substr('*' x int(50 * ($H{$I}/$max)) . ' 'x50, 0, 50);
if ($I > 1 and $I < 98) {
my $avg;
my $sum = 0;
$sum += $H{$_} for $I-2 .. $I+2;
$avg = $sum/5;
substr($bar,int(50 * ($avg/$max)),1)='+';
}
printf "% 3u: (% 3u) %s\n", $I, $H{$I} // 0, $bar;
}
It accumulates the samples from the million-line file into 100 buckets (0..10000 for the first one, 10001..20000 for the second one, etc.). It then plots a histogram, and overlays it with a 5-sample moving average:
$ ./pm_sample_histo_test.pl t.*
0: (321) **************************************************
1: (290) *********************************************
2: (290) *********************************************+
3: (291) *********************************************+
4: (281) ******************************************* +
5: (307) *********************************************+*
6: (290) ********************************************+
7: (283) ********************************************+
8: (280) ******************************************* +
9: (283) ********************************************+
10: (309) ********************************************+***
11: (287) *******************************************+
12: (267) ***************************************** +
13: (253) *************************************** +
14: (292) *****************************************+***
15: (257) **************************************** +
16: (278) ******************************************+
17: (254) *************************************** +
18: (286) ******************************************+*
19: (250) ************************************** +
20: (291) ******************************************+**
21: (277) ******************************************+
22: (270) ******************************************+
23: (262) **************************************** +
24: (261) ****************************************+
25: (263) ***************************************+
26: (246) ************************************** +
27: (237) ************************************ +
28: (261) ***************************************+
29: (245) ************************************** +
30: (265) ***************************************+*
31: (267) ***************************************+*
32: (234) ************************************ +
33: (269) ****************************************+
34: (252) *************************************** +
35: (272) ****************************************+*
36: (259) ****************************************+
37: (243) ************************************* +
38: (266) ***************************************+*
39: (237) ************************************ +
40: (259) ***************************************+
41: (256) ***************************************+
42: (249) ************************************** +
43: (269) ****************************************+
44: (274) *****************************************+
45: (266) ****************************************+
46: (267) ***************************************+*
47: (229) *********************************** +
48: (233) ************************************ +
49: (265) ***************************************+*
50: (247) ************************************** +
51: (281) ****************************************+**
52: (243) ************************************* +
53: (249) ************************************** +
54: (249) ************************************** +
55: (251) **************************************+
56: (262) *************************************+**
57: (228) *********************************** +
58: (217) ********************************* +
59: (262) *************************************+**
60: (260) **************************************+*
61: (225) *********************************** +
62: (263) ****************************************+
63: (297) ****************************************+*****
64: (271) ******************************************+
65: (250) ************************************** +
66: (275) *****************************************+
67: (267) ***************************************+*
68: (260) ***************************************+
69: (224) ********************************** +
70: (256) **************************************+
71: (241) ************************************* +
72: (257) ***************************************+
73: (256) ***************************************+
74: (257) ***************************************+
75: (246) ************************************** +
76: (267) ***************************************+*
77: (267) **************************************+**
78: (232) ************************************ +
79: (239) ************************************* +
80: (252) **************************************+
81: (269) **************************************+**
82: (234) ************************************ +
83: (239) ************************************* +
84: (240) ************************************* +
85: (259) ***************************************+
86: (290) ****************************************+****
87: (249) ************************************** +
88: (256) ***************************************+
89: (235) ************************************ +
90: (242) ************************************* +
91: (261) **************************************+*
92: (245) **************************************+
93: (245) ************************************** +
94: (248) **************************************+
95: (274) ***************************************+**
96: (230) *********************************** +
97: (263) **************************************+*
98: (241) *************************************
99: (236) ************************************
As you can see, at the bottom side there are 290-ish lines selected, and at the top end, there are 250-ish lines selected.
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
BrowserUk:
Ah, well, I know jack about FASTA files, so I didn't consider that. Of course, by changing the reader to accumulate records instead of lines, it could be adapted. Though since there are already a couple working examples from you and Marshall, and since mine has a bias in it, there's no real reason to do so.
I know that *you* know how to do the changes, but if someone tripping across this node in the future wants to do it, you can do so something (untested!) like this:
my @record;
while (<$FH>) {
if (/start of record marker/) {
++$cnt_recs;
if ($num/$cnt_recs > rand) {
my $i=@samples;
if ($i > $num) { $i = rand @samples; }
$samples[$i]=[$cnt_recs, [@record]];
}
}
else {
# Accumulate record
push @record, $_;
}
}
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [Watch: Dir/Any] [d/l] |
|
|
|
Re: Get random unique lines from file
by Cristoforo (Curate) on Aug 17, 2012 at 21:25 UTC
|
Look at the BioPerl package, (in particular Bio::SeqIO and Bio::Seq).
Bio::SeqIO provides a method for opening a FASTA file and Bio::Seq provides methods to access the particular info you're after.
Chris
Update: I just now realized that doesn't solve the problem of getting the random lines you want. :-( | [reply] [Watch: Dir/Any] |
|
|