Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Once it becomes clear that each of your two data files contains half of each of 50 independent data sets, it becomes immediately clear that the main source of time consumption in your processing is doing repeated lookups into your HoHoH.

Now, given that each section of your file2 (-i), consists of upto 7 billions lines of upto 12 characters, where each line represents a single bit of data; it becomes pretty obvious that you should be inverting the logic of your lookups.

That is; instead of building a huge, in-memory data structure (I estimate 4GB+ HoHoH), and then comparing each bit position against each of the ~60,000 ranges for that id; you should be building a single bitvector from the 5-7 billion bits in each section (<1GB) and then comparing each range against the small subset of bits it covers using vec.

In this way, you vastly reduce the memory requirement -- by only holding the 2% of lookup data you need in memory at any given time -- and also the combinatorial multipliers involved by only doing direct lookups for the (small) number of bits in each range against just those bits in the bitvector.

To build the bitvector from (1 section of) the positions file, you can do:

my( $id ) = $bitsIn =~ m[^#(.)\s*$] or die 'Bits ID missing'; my $bits = chr(0); $bits x= 75e6; $bitsIn =~ m[(\d+)\s+(\d+)] and vec( $bits, $1, 1 ) = $2 until eof( BITS ) or ( $bitsIn = <BITS> ) =~ m[^#];

And then to compare the ranges against that bitvector, accumulate the counts and output the results is just:

my $range = <RANGES>; until( eof( RANGES ) ) { my( $start, $end ) = $range =~ m[^$id\t(\d+)\t(\d+)] or last; my $count = 0; vec( $bits, $_, 1 ) and ++$count for $start .. $end; print "$id ($1 .. $2) : $count"; $range = <RANGES> }

Putting that together with an outer loop to process the 50 different datasets you get:

#! perl -slw use strict; open RANGES, '<', '1032018-t.dat' or die $!; open BITS, '<', '1032018-i.dat' or die $!; my $bitsIn = <BITS>; until( eof BITS ) { my( $id ) = $bitsIn =~ m[^#(.)\s*$] or die 'Bits ID missing'; my $bits = chr(0); $bits x= 75e6; $bitsIn =~ m[(\d+)\s+(\d+)] and vec( $bits, $1, 1 ) = $2 until eof( BITS ) or ( $bitsIn = <BITS> ) =~ m[^#]; warn 'Bitvector built; processing ranges : ' . localtime; my $range = <RANGES>; until( eof( RANGES ) ) { my( $start, $end ) = $range =~ m[^$id\t(\d+)\t(\d+)] or last; my $count = 0; vec( $bits, $_, 1 ) and ++$count for $start .. $end; print "$id ($1 .. $2) : $count"; $range = <RANGES> } } close BITS; close RAANGES

When run on a pair of files containing 1 dataset (600e6 0/1s and a full 60,000 ranges), building the bitstring took 20 minutes and checking the ranges took just 2 minutes.

Since my (randomly generated; thus possibly non-representative) dataset requires approximately 1/10 the work of one of yours I estimate that this would cut your overall run time to about 1/4.

Of course, with the simple expedient of pre-processing the datasets into 50 separate file pairs, you could could run 4 (8/16/64/256) datasets concurrently and divide your runtime by the equivalent factor.

Indeed, if you were to have the preprocessing step construct the bitstrings and save them to file(s), running the ranges against them would only take 2 or 3 minutes times the number of IDs divided by the number of processors you have. No time at all really.

(And potentially there are ways of reducing the time taken to construct the bitstrings. My current method invokes the regex engine for every bit, which with 7e9 bits is going to some time. In theory at least, the 0 or 1 you are interested in is always the (second)last character on the line which means it could be extracted using substr. And if every bit position is (sequentially) represented in the file, you do not need to parse and interpret the bit-position numbers.)


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Counter - number of tags per interval (75%+ reduction in runtime) by BrowserUk
in thread Counter - number of tags per interval by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (5)
    As of 2014-10-22 01:00 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (112 votes), past polls