Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Tough problem! You could switch to reading the file as you go but that's likely to make your program much slower since you need to access every line so many times.

I'd start by processing the input file into a more efficient representation - something that can be accessed using mmap() from C for example. If all of your characteristics are boolean (and they are in your example) you can represent them as 15 unsigned 32-bit integers. Add an integer for the patient # and you can represent a row in just 512 bits. You can write the pre-processing code in Perl using pack() or Bit::Vector.

Then I'd write some Inline::C code to mmap() the data file and provide access to "rows". The code to compare one row to another should also be written in C. It's basically an XOR of the characteristics and a bit-count of the result, so not hard to write at all. I'd definitely look at whether a lookup table can speed things up - perhaps at the 8-bit or 16-bit level. Or you could look at caching comparisons.

Finally, I'd use Parallel::ForkManager to make it 8-way parallel. Have each working processes take 1/8 of the patient space and write to its own output file. When you're done, cat all the output files together and you should be done.

I'd be shocked if this didn't run 100x faster than the Perl code you've got now.

-sam


In reply to Re: Huge data file and looping best practices by samtregar
in thread Huge data file and looping best practices by carillonator

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-23 08:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found