Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Counter - number of tags per interval

by baxy77bax (Chaplain)
on May 04, 2013 at 10:16 UTC ( #1032018=perlquestion: print w/ replies, xml ) Need Help??
baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I am in a need for a faster solution. The problem that I am playing with is related to counting. What I have are two files:
1. Tab separated file containing interval coordinates
2. tab separated file containing intervals and some tags
What I need to do is to count how many tags in file 2 there are per interval specified in the first file.

File 1: ID start stop a 12 15 a 12 17 a 13 14 a 14 19 b 10 15 b 12 15 ...
As you can see all possible overlaps are possible.
file 2: Position Tag #a 1 0 2 1 3 0 4 0 ... 10 0 11 1 12 1 13 0 14 0 ... #b 1 0 2 1 3 0 4 0 ... 10 0 11 0 12 1 13 1 14 1 ...
The procedure that I am using goes like this:
#!/usr/bin/perl use strict; use Getopt::Long; use Data::Dumper; my ($optInput1, $optInput2); GetOptions ('i=s' => \$optInput1, 't=s' => \$optInput2); open (IN, "<", $optInput1)|| die "$!"; open (IN1, "<", $optInput2)|| die "$!"; my %hash_pos; # set my intervals while (<IN1>){ chomp; /^([^\t]*)\t([^\t]*)\t([^\t]*)/; $hash_pos{$1}->{$2}->{$3} =0; } close IN1; my %hash_stop; my $p =0; my $id = ""; #count tags per interval while(my $t = <IN>){ chomp($t); if ($t =~/#(.*)\s/){ $p=0; $id = $1; %hash_stop = (); next; } $p++; # sometimes the first column is set to 0 $t =~/^(\d+)\t(\d+)/; # set start and stop boundaries for each interval # if start position has been reached if(exists $hash_pos{$id}->{$p}){ foreach my $o (keys %{$hash_pos{$id}->{$p}}){ $hash_stop{$p}->{$o}=1; } } # foreach interval count ++ if tag == 1 foreach my $keyu (keys %hash_stop){ foreach my $key (keys %{$hash_stop{$keyu}}){ $hash_pos{$id}->{$keyu}->{$key}++ if $2 eq '1'; # if end reached you are don with this # delete it from the hash_stop delete($hash_stop{$keyu}->{$key}) if $p == $key; } } } close IN; print Dumper(\%hash_pos);
but on large files this takes for ever. What i could do is to split file 2 according to its id (a,b,...) and the parallelize the procedure but that is a lot of work. so my question is does anyone has a better idea on how to improve this?

thank you

baxy

UPDATE:

ok.... hm here is the the complete example: i get the right numbers(more or less) file.1

a 12 15 a 12 17 a 13 14 a 14 19 b 10 15 b 12 15
file.2
#a 1 0 2 1 3 0 4 0 5 0 6 1 7 0 8 1 9 1 10 1 11 0 12 0 13 0 14 1 15 1 16 1 17 1 18 1 19 0 20 0 21 0 #b 1 0 2 1 3 0 4 0 5 0 6 1 7 0 8 1 9 1 10 1 11 0 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 0 20 0 21 0
./test.pl -i file.2 -t file.1
OUTPUT:
$VAR1 = { 'a ' => { '12' => { '17' => 0 } }, 'a' => { '13' => { '14' => 1 }, '12' => { '15' => 2 }, '14' => { '19' => 5 } }, 'b' => { '10' => { '15' => 5 }, '12' => { '15' => 4 } } };
everything looks ok except the first a (which is a bug in my input)

without the mistake in my input file 1 $VAR1 = { 'a' => { '13' => { '14' => 1 }, '12' => { '17' => 4, '15' => 2 }, '14' => { '19' => 5 } }, 'b' => { '10' => { '15' => 5 }, '12' => { '15' => 4 } } };
which is ok

the same code with more consistent variable names. and no obscure comments.

#!/usr/bin/perl use strict; use Getopt::Long; use Data::Dumper; my ($i1, $i2); GetOptions ('i=s' => \$i1, 't=s' => \$i2); open (IN, "<", $i1)|| die "$!"; open (IN1, "<", $i2)|| die "$!"; my %hash1; # set my intervals while (<IN1>){ chomp; /^([^\t]*)\t([^\t]*)\t([^\t]*)/; $hash1{$1}->{$2}->{$3} =0; } close IN1; my %hash2; my $p =0; my $id = ""; #count tags per interval while(<IN>){ chomp; if (/#(.*)/){ $p=0; $id = $1; %hash2 = (); next; } $p++; # sometimes the first column is set to 0 /^(\d+)\t(\d+)/; if(exists $hash1{$id}->{$p}){ foreach my $o (keys %{$hash1{$id}->{$p}}){ $hash2{$p}->{$o}=1; } } foreach my $key1 (keys %hash2){ foreach my $key2 (keys %{$hash2{$key1}}){ $hash1{$id}->{$key1}->{$key2}++ if $2 eq '1'; delete($hash2{$key1}->{$key2}) if $p == $key2; } } } close IN; print Dumper(\%hash1);
UPDATE2:

sorry for deleting my ps comment , i thought it is completely irrelevant since i found "readmore" tags and since it wasn't related to the initially problem.
ok these are some data from some simulation experiments. there is only one file.1 and only one file.2. sizes: there ate approximately 3 mil intervals specified in file.1 and about 50 distinct id's and each id in file.2 has approximately 5-7 * 10^9 lines.
cheers

ps: also the clustering of intervals cannot be assumed.

Comment on Counter - number of tags per interval
Select or Download Code
Re: Counter - number of tags per interval
by hdb (Prior) on May 04, 2013 at 10:42 UTC

    As a concept:

    • Use strings of 0s and 1s that can be interpreted as binary numbers.
    • In file 1 a  12  15 would become "111100000000000" with 1s on positions 12 to 15 (backwards).
    • All "a" tags in file 2 become on string in a similar fashion "...00110...0010, all "b" tags another etc
    • If you combine the strings from file 1 and file 2 with binary "and", you only need to count the 1s in the result.
    I hope I did properly understood your question.

Re: Counter - number of tags per interval
by BrowserUk (Pope) on May 04, 2013 at 10:57 UTC

    Your description of your purpose is very scant; it is not easily discerned from the code; and run against your sample inputs it produces:

    C:\test>1032018 -i 1032018-1.dat -t 1032019-2.dat $VAR1 = { '' => { '' => { '' => 0 } } };

    Which is less than informative. Could you post sufficient (matching) input data to allow it to produce enough output that the purpose becomes clear?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Counter - number of tags per interval
by LanX (Canon) on May 04, 2013 at 10:59 UTC
    From what I can see your code is not only hard to read but also conceptually wrong.

    Your using the 3rd level of %hash_pos at the same time for indicating the interval stop

    foreach my $o (keys %{$hash_pos{$id}->{$p}}){ $hash_stop{$p}->{$o}=1;

    AND for counting

                 $hash_pos{$id}->{$keyu}->{$key}++ if $2 eq '1';

    Thats why each count is interpreted as a new stop, resulting in exponentially growing run-time!

    As another consequence the results should be plain wrong.

    Did you check them before posting?

    Update

    Not sure anymore, this code is really too hard to read!

    Please try more descriptive and consistent variable names!

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Counter - number of tags per interval
by LanX (Canon) on May 04, 2013 at 12:20 UTC
    The input you're showing is too short for producing performance problems!

    So what are the real dimensions?

    For instance

    • Size of input?
    • Do you apply same "interval-files" to different "point-files"?
    • How many point-files?
    • Are your points always between 1 and 21?

    Knowing these parameters help preprocessing costly parts.

    Considering var names:

    Names like input1, input2, hash1, hash2, key1, key2, p and o really don't improve readability!

    Try descriptive names like $point, $start, $stop, %count and %active.

    And please don't delete comments!

    HTH!

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    update

    I don't know if its much faster, but an approach reading all data into array and counting over slices will at least improve readablity:

    DB<159> use List::Util qw/sum/ => 0 DB<160> \@a => [0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1] DB<161> $start=4;$stop=11 => 11 DB<162> $count{"a"}{$start}{$stop}=sum @a[$start..$stop] => 3

    I'm pretty sure this approach could be translated to counting single bits with vec or pack

Re: Counter - number of tags per interval
by LanX (Canon) on May 04, 2013 at 13:43 UTC
    Please don't reply via updates!

    > sorry for deleting my ps comment , i thought it is completely irrelevant since i found "readmore" tags and since it wasn't related to the initially problem. ok these are some data from some simulation experiments. there is only one file.1 and only one file.2. sizes: there ate approximately 3 mil intervals specified in file.1 and about 50 distinct id's and each id in file.2 has approximately 5-7 * 10^9 lines. cheers

    thats a lot

    > ps: also the clustering of intervals cannot be assumed.

    so you are not restricted to 1..21 !?!

    atomic sub-ranges

    Well one fast way to go is to prepare and count in "atomic intervals", that are intervals whose elements always appear together, i.e. are never divided by other intervals.

    for instance:

    a1 12 15 a2 12 17 a3 13 14 a4 14 19

    so atomic intervals are [12],[13],[14],[15],[16..17],[18..19]

    Since they are disjunct you don't need nested loops to count there appearance. After reading all data for an id, you can calculate the real intervals by summing the atomics up, e.g. $sum_a3 = sum @count[14,15,16,18] (intervals identified by starting point)

    This looks silly for the sample data you provided because its already very dense, you're the one who can judge...

    hulls / implications / lattices

    Another - similar - approach is to precalculate "minimal hulls" and "implications".

    12 => a1 => a2 13 => a3 => a1 14 => [14] => a3,a4 ...

    the hull of a point is found by cutting all ranges including that point, and often a hull is already identical to a range. (e.g. 18 => a4 )

    Following the implications results in a minimum number off add-operations needed!

    Both algorithms might look overly complicated but they scale very well, no matter if ranges are dense or sparse.

    Count sparse points and sum hash-slice

    A simplified (but less efficient) version of above algorithms was already shown with counting points and summing over array-slices.

    If intervals are sparse, i.e. they don't cover many potential points, then rather count in a hash and sum over a hash-slice:

    $sum{a1} = sum @count{12..15}

    Implementation

    ... is left for exercise! =)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      don't really need and implementation just an idea, and your solutions sound promising. Thank you !

      PS

      and no restriction is not 21 the whole range is between , as i wrote, 5-7 * 10^9 per id and the files are encoded and compressed to reduce space. the exaple i have written and the code is, again, just an example, simplification not to be used in in real case since this would be wasteful (considering space). but if i wrote the real thing i bet nobody would even look at the post. So i tried to be as simple (failed) as possible with my example. ++

      cheers

Re: Counter - number of tags per interval
by Laurent_R (Parson) on May 04, 2013 at 14:30 UTC

    Hi

    I do not know if this will help and also don't know if this will be faster, but I would try to use a different data structure: a hash of arrays with each array containing 3 elements: the interval lower limit, the interval upper limit and the tag count.

    As a (very light) proof of concept, I will simplify the problem by simply forgetting about the a, b et so on, but just using a simple list of ranges. Or, if you prefer, I will deal only with your data related to letter a. Since I do not understand what your tags are about, I will just add 1 to the count when the value of the variable falls within the range.

    This gives me the following session under the Perl debugger:

    DB<49> @c = ([12,15], [12,17], [13,14], [14,19]) DB<50> x @c 0 ARRAY(0x20498b98) 0 12 1 15 1 ARRAY(0x20499588) 0 12 1 17 2 ARRAY(0x2049c428) 0 13 1 14 3 ARRAY(0x2049c488) 0 14 1 19 DB<51> $e = 17 DB<52> $_->[2]++ foreach grep {$e <= $_->[1]} grep {$e >= $_->[0]} @ +c DB<53> x @c 0 ARRAY(0x20498b98) 0 12 1 15 1 ARRAY(0x20499588) 0 12 1 17 2 1 2 ARRAY(0x2049c428) 0 13 1 14 3 ARRAY(0x2049c488) 0 14 1 19 2 1 DB<54> $e = 13; DB<55> $_->[2]++ foreach grep {$e <= $_->[1]} grep {$e >= $_->[0]} @ +c DB<56> x @c 0 ARRAY(0x20498b98) 0 12 1 15 2 1 1 ARRAY(0x20499588) 0 12 1 17 2 2 2 ARRAY(0x2049c428) 0 13 1 14 2 1 3 ARRAY(0x2049c488) 0 14 1 19 2 1 DB<57>
Re: Counter - number of tags per interval
by BrowserUk (Pope) on May 04, 2013 at 15:05 UTC
    50 distinct id's and each id in file.2 has approximately 5-7 * 10^9 lines.

    File 2 has ~350 billion lines?

    What is the maximum coordinate?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      yes that is true (why is this so odd ?? in bioinformatics you have 50-500gb read-mapps (files) and nobody finds that odd :) ). Maximum coordinate ?? well that depends on a simulation. but i would say for id = a, max is app. 5*10^9 i mean i need to find this out by going through the file.
        why is this so odd ?? i

        Not odd. Just important to know as it greatly affects the possible solutions.

        How long does/do you estimate it to take to process those two files with your current code?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Counter - number of tags per interval
by Laurent_R (Parson) on May 04, 2013 at 16:45 UTC

    Hi,

    I thought again about your problem while I was doing something else and it suddenly came to my mind that it might be much faster if you did it just completely the other way around (depending naturally on your actual data).

    Your program takes a lot of time because it has to loop over a large data structure for every single line of your very big file. What you could do instead is to start by reading the big file and summarize its content. My understanding is that you have a lot of repetitive information that you want to count. You could just parse the big file and record into an array or a hash the number of times each value comes up. This means you no longer have to do lengthy loops for each line of the big file

    Once you've read the big file (or possibly just one ID section of it, say the 'a' section), you can use the array or hash summary against the intervals of the other file. This way, the nested loops of your program run against a presumably much smaller summary of the data and it might be much faster.

    Of course, I have made some assumptions on the data which may turn out to be wrong. We have no idea of what your data looks like, which makes it very difficult for us to help you or even offer useful advice. In particular, I have assumed that the data summary could hold in an array or a hash, this may not be possible, it really depends on how repetitive your data is.

      This process would also benefit from being run, say, in a cluster.   If your program accepted as parameters the starting and ending line-numbers that it is to process, then multiple independent instances of the same process, running on different machines, could each summarize a chunk of the file, doing it in the manner just suggested:

      1. Tally all of the unique values that occur in the chunk, with the number of times that they occur.
      2. Process the ranges-file once, totaling the value-counts found within that range.
      3. Output a file that can be merged with all the other cluster outputs once all the cluster instances have finished doing their job.

      The last step will be a “classic merge,” because you know that each process read the ranges-file sequentially, and that it output either zero or one record for each range, in that order.   Thus, the merge process (which generates an output that can also if need be be used as its input ...) might go something like this:

      1. Open the ranges-file and n not-yet-processed input files (including output of the last iteration of this program, if need be).
      2. “Prime the pump” by reading the first record from each input other than the ranges-file.
      3. Now, iterate through the ranges-file.   For each range, examine all of the input-files which are not yet empty to see if they have a count for that range.   If so, add it to the total and read the next record (if any) from that file.   If no files contain the range, the answer for that range is zero.   Continue until the ranges-file is exhausted.
      Once again, merging is a purely-sequential process, and if there are many files to process they can be processed in stages by using the output of one stage as an input to the next iteration.

      Run as many instances as you can, provided that each instance never exhausts the amount of real, not virtual, memory that is available in the machine for it.   Each instance reads each file sequentially.   All of the tallying and range-comparison should occur entirely in memory and should make maximum use of memory as an “instantaneous file.”   Each process should exhibit a large working-set size but no page-outs.

Re: Counter - number of tags per interval
by LanX (Canon) on May 04, 2013 at 22:08 UTC
    Just another idea based on tuning your current algorithm which is quite clever! (Well after being understood - you know my opinion about clear naming ;-)

    After analyzing the numbers you provided, I suppose that intersections of different ranges are comparatively rare and if they appear they stretch over long distances.

    So it doesn't really make sense to go into the count loop for each individual point, just go on counting the "1" till you reach the next "boundary" and add them collectively to your counter instead incrementing them one by one with++.

    The boundaries are nothing else as the sorted list of range starts-1 and stops. In your example for ID "b"

    ID start stop ... b 10 15 b 12 15

    this means [9,11,15].

    and results in the following flow

    0-9 => ignore 10-11 => count 1 x "1" => add 1 to range [10-15] 12-15 => count 4 x "1" => add 4 to range [10-15] and [12-15] 16- => ignore

    As you see, you can also stop counting between ranges and even try to jump forward in the file with seek.

    HTH! =)

    Update: Approximation

    3e6 Intervals/50 IDs *2 Borders = 120000 Borders

    6e9 Points / ID

    => in average 6e9/12e4=50000 points between two adjacent borders, where you can avoid the whole count loop.

    This should lead to a considerable boost!

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Counter - number of tags per interval
by roboticus (Canon) on May 04, 2013 at 23:31 UTC

    baxy77bax:

    It took a while to figure out what your algorithm was, but I finally doped it out. It seemed overly complex for what you're doing, though. So I coded it up in a form that's a little easier to understand--at least for me! I don't know if it's any faster than yours, or not, so give it a try and let me know how it looks.

    Basically, I just keep a list of active intervals, and increment all of them when I see a '1'. When I hit a column that begins a new interval, I move it from the "waiting" list and into the "current" list. When I hit an end column, I remove all expired lists from the active list. That's pretty much it.

    Please let me know if it's any better for you...

    $ cat 1032018_d.pl #!/usr/bin/perl use warnings; use strict; use Getopt::Long; my ($i1, $i2, %hash1); GetOptions ('i=s' => \$i1, 't=s' => \$i2); open (IN, "<", $i1)|| die "$!"; open (IN1, "<", $i2)|| die "$!"; # Build intervals for each ID while (<IN1>){ chomp; my ($id, $beg, $end) = split /\s+/,$_; push @{$hash1{$id}}, [ $beg, $end, 0 ]; } close IN1; my $p =0; my $id = ""; my @add_intervals; my @del_intervals; my @cur_intervals; #count tags per interval while(<IN>){ s/\s+$//; #print "$.: $_\n"; if (/#(.*)/){ $p=0; $id = $1; # Intervals for this ID @add_intervals = sort {$a->[0] <=> $b->[0]} @{$hash1{$id}}; # List of interval ends my %uniq = map { $_->[1], 0 } @add_intervals; @del_intervals = sort { $a <=> $b } keys %uniq; # Start with no active intervals @cur_intervals = ( ); next; } $p++; # sometimes the first column is set to 0 /^(\d+)\s+(\d+)/ or next; # add new intervals that start on this column while (@add_intervals and ($p == $add_intervals[0][0])) { #print "Adding interval $id:$add_intervals[0][0] .. $add_inter +vals[0][1]\n"; push @cur_intervals, shift @add_intervals; } # Increment all active intervals when we find a hit if (@cur_intervals and $2 eq '1') { ++$_->[2] for @cur_intervals; } # remove ranges that end on this column if (@del_intervals and $p==$del_intervals[0]) { #print "Deleting intervals ending on column $p\n"; shift @del_intervals; @cur_intervals = grep { $_->[1] > $p } @cur_intervals; } } close IN; for my $id (sort keys %hash1) { for my $interval (@{$hash1{$id}}) { print "$id ($interval->[0] .. $interval->[1]) : $interval->[2] +\n"; } } $ perl 1032018_d.pl -i=1032018.file_2 -t=1032018.file_1 a (12 .. 15) : 2 a (12 .. 17) : 4 a (13 .. 14) : 1 a (14 .. 19) : 5 b (10 .. 15) : 5 b (12 .. 15) : 4

    Note: I think you might mean to use the first capture in the last regex as your column ($p), but I didn't change it.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Counter - number of tags per interval (75%+ reduction in runtime)
by BrowserUk (Pope) on May 05, 2013 at 13:21 UTC

    Once it becomes clear that each of your two data files contains half of each of 50 independent data sets, it becomes immediately clear that the main source of time consumption in your processing is doing repeated lookups into your HoHoH.

    Now, given that each section of your file2 (-i), consists of upto 7 billions lines of upto 12 characters, where each line represents a single bit of data; it becomes pretty obvious that you should be inverting the logic of your lookups.

    That is; instead of building a huge, in-memory data structure (I estimate 4GB+ HoHoH), and then comparing each bit position against each of the ~60,000 ranges for that id; you should be building a single bitvector from the 5-7 billion bits in each section (<1GB) and then comparing each range against the small subset of bits it covers using vec.

    In this way, you vastly reduce the memory requirement -- by only holding the 2% of lookup data you need in memory at any given time -- and also the combinatorial multipliers involved by only doing direct lookups for the (small) number of bits in each range against just those bits in the bitvector.

    To build the bitvector from (1 section of) the positions file, you can do:

    my( $id ) = $bitsIn =~ m[^#(.)\s*$] or die 'Bits ID missing'; my $bits = chr(0); $bits x= 75e6; $bitsIn =~ m[(\d+)\s+(\d+)] and vec( $bits, $1, 1 ) = $2 until eof( BITS ) or ( $bitsIn = <BITS> ) =~ m[^#];

    And then to compare the ranges against that bitvector, accumulate the counts and output the results is just:

    my $range = <RANGES>; until( eof( RANGES ) ) { my( $start, $end ) = $range =~ m[^$id\t(\d+)\t(\d+)] or last; my $count = 0; vec( $bits, $_, 1 ) and ++$count for $start .. $end; print "$id ($1 .. $2) : $count"; $range = <RANGES> }

    Putting that together with an outer loop to process the 50 different datasets you get:

    #! perl -slw use strict; open RANGES, '<', '1032018-t.dat' or die $!; open BITS, '<', '1032018-i.dat' or die $!; my $bitsIn = <BITS>; until( eof BITS ) { my( $id ) = $bitsIn =~ m[^#(.)\s*$] or die 'Bits ID missing'; my $bits = chr(0); $bits x= 75e6; $bitsIn =~ m[(\d+)\s+(\d+)] and vec( $bits, $1, 1 ) = $2 until eof( BITS ) or ( $bitsIn = <BITS> ) =~ m[^#]; warn 'Bitvector built; processing ranges : ' . localtime; my $range = <RANGES>; until( eof( RANGES ) ) { my( $start, $end ) = $range =~ m[^$id\t(\d+)\t(\d+)] or last; my $count = 0; vec( $bits, $_, 1 ) and ++$count for $start .. $end; print "$id ($1 .. $2) : $count"; $range = <RANGES> } } close BITS; close RAANGES

    When run on a pair of files containing 1 dataset (600e6 0/1s and a full 60,000 ranges), building the bitstring took 20 minutes and checking the ranges took just 2 minutes.

    Since my (randomly generated; thus possibly non-representative) dataset requires approximately 1/10 the work of one of yours I estimate that this would cut your overall run time to about 1/4.

    Of course, with the simple expedient of pre-processing the datasets into 50 separate file pairs, you could could run 4 (8/16/64/256) datasets concurrently and divide your runtime by the equivalent factor.

    Indeed, if you were to have the preprocessing step construct the bitstrings and save them to file(s), running the ranges against them would only take 2 or 3 minutes times the number of IDs divided by the number of processors you have. No time at all really.

    (And potentially there are ways of reducing the time taken to construct the bitstrings. My current method invokes the regex engine for every bit, which with 7e9 bits is going to some time. In theory at least, the 0 or 1 you are interested in is always the (second)last character on the line which means it could be extracted using substr. And if every bit position is (sequentially) represented in the file, you do not need to parse and interpret the bit-position numbers.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Counter - number of tags per interval
by bioinformatics (Friar) on May 06, 2013 at 01:26 UTC

    The fastest way to do this is actually counterintuitive. You're looking to see the number of sequence tags in a given genomic interval. You can do something along the lines of what you've chosen to do, or you can try a different method. Read the sequence file in and add the tag counts to fixed-size genomic intervals. Then, read in the bed file and retrieve the bins (and their counts) that are in the same region. The only limitation on this method here is on the precision, as the bins are of fixed width. I'm attaching code meant for plotting heatmaps, but the idea is similar enough that I'll include code that may help with another downstream task simultaneously. It is older code, and the subroutine approach is a little silly perhaps, but the code is readable :)

    #!/usr/bin/perl use Getopt::Long; use warnings; use strict; # Retrieves the normalized values in each bin in a window # around some peak region. The output is designed to be used # to plot a heatmap in R. # init default variables my $bin_size = 50; my $scaling_factor = 10000000; my $number_of_bins = 80; my $fragment_length = 200; my $bin_count = 0; my %hash = (); my %peaks = (); my %ids = (); sub main { # get the command line input my ($infile, $peaks, $outfile) = @ARGV; my $options = GetOptions("bin_size=i" => \$bin_size, "scal_factor=i" => \$scaling_factor, "bin_num=i" => \$number_of_bins); $bin_count = $fragment_length / $bin_size; # exceptions if (!defined $infile or !defined $peaks or !defined $outfile) { print STDERR "Usage: perl binForHeatmap.pl <infile> <peaks> <o +utfile> \n"; print STDERR "--bin_size <int> --scal_factor <int> --bin_num < +int> \n"; print STDERR "This script produces output that can be plotted +in R.\n"; print STDERR "The input or output filename is missing!\n" and +exit; } # bin the reads and output print STDERR "Generating genomic bins of size $bin_size...\n"; binReads($infile, \%hash); print STDERR "Outputing $number_of_bins bins around the summit of +the sorted peaks...\n"; getBinsForPeaks($peaks, \%hash, $outfile); } sub binReads { my ($file, $hashref) = @_; # open the file and parse # expecting a bowtie format input file! open(IN, "$file") || die "Cannot open $file: $!\n"; my $total = 0; while ( <IN> ) { chomp; my ($id, $strand, $chr, $start, undef) = split("\t", $_); $total++; if ($strand eq '+') { $start -= 75; my $bin = $start - ($start % $bin_size); #$hashref->{$chr}->{$bin} += 1; for (my $i = 0; $i < $bin_count; $i++) { $hashref->{$chr}->{$bin+($bin_size*$i)} += 1; # we ass +ume a fragment size of 200 bp, } } else { $start += 75; my $bin = $start - ($start % $bin_size); #$hashref->{$chr}->{$bin} += 1; for (my $i = 0; $i < $bin_count; $i++) { $hashref->{$chr}->{$bin-($bin_size*$i)} += 1; } } $ids{$id} = 1; # get nearest bin and add 1 #my $bin = $start - ($start % $bin_size); #$hashref->{$chr}->{$bin} += 1; } # count the number of id keys #$total = keys %ids; #print "$total\n" and exit; close IN; # normalize the bins my $multiplier = $scaling_factor / $total; foreach my $chr (keys %$hashref) { foreach my $bin (keys %{$hashref->{$chr}}) { $hashref->{$chr}->{$bin} *= $multiplier; } } } sub getBinsForPeaks { my ($file, $hashref, $outfile) = @_; open(IN, "$file") || die "Cannot open $file: $!\n"; open(OUT, ">$outfile") || die "Cannot open $outfile: $!\n"; my $counter = 1; while ( <IN> ) { chomp; # looking for the .xls output of arem/macs! next if $_ =~ m/#/; next if $_ =~ m/start/; my ($chr, $start, $end, $length, $summit, $score) = split("\t" +, $_); next if !defined $end; # capture that annoying blank line! # normalize the score by the length $score /= $length; $peaks{$counter}{"summit"} = ($start + $summit); $peaks{$counter}{"score"} = $score; $peaks{$counter}{"chrom"} = $chr; $counter++; } close IN; # sort on the score and output bins around the summit foreach my $key (sort {$peaks{$a}{"score"} <=> $peaks{$b}{"score"} +} keys %peaks) { # this sorts backwards, small to large; R plots things backwar +ds, so it plots correctly my $summit = $peaks{$key}{"summit"}; my $chr = $peaks{$key}{"chrom"}; my $score = $peaks{$key}{"score"}; #print STDERR "$chr\t$summit\t$score\n" and sleep(2); my $bin = $summit - ($summit % $bin_size); my $left_bin = $bin - (($number_of_bins / 2) * $bin_size); #my %hash = %$hashref; # prevents annoying output to stdout wh +en no bin exists for (my $i = 0; $i < $number_of_bins; $i++) { my $count = (exists $hashref->{$chr}->{($left_bin + ($i * +$bin_size))}) ? $hashref->{$chr}->{($left_bin + ($i * $bin_size))} : +0; #print STDERR "$count\n" and sleep(2); if ($i == ($number_of_bins - 1)) { print OUT "$count\n"; } else { print OUT "$count\t"; } } } close OUT; } # call main main();
    Bioinformatics

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1032018]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (13)
As of 2014-10-21 08:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (99 votes), past polls