Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: System call doesn't work when there is a large amount of data in a hash

by roboticus (Chancellor)
on Apr 30, 2020 at 20:30 UTC ( [id://11116289]=note: print w/replies, xml ) Need Help??


in reply to System call doesn't work when there is a large amount of data in a hash

Nicolasd:

It looks like you're already getting some help on solving the problem you posted, so I won't elaborate on that.

However, your program looks like it could be fun to play around in and try to optimize a bit. To that end, I'd like to run the program, but I don't know enough about your field or the terminology to be able to figure out how to come up with a configuration file that will actually run and do something. Can you post a few simple config files that set up some simple runs using the test dataset you provided? If you can do that, I may be able to do some tweaking on your program to improve things a bit, and send a few pull requests your way.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: System call doesn't work when there is a large amount of data in a hash
by Nicolasd (Acolyte) on Apr 30, 2020 at 23:14 UTC
    Hi,

    Improvements are always welcome! Respect if you can read that huge file of messy code! :)
    I did made a new version with some comments in the code, should I upload this one, maybe a tiny bit more clear?
    The test datasets also have config files that are ready to use (any additional questions you may ask)

    Those test datasets are very small so will go fast, but most users will have very large datasets (so large hashes).
    Loading all the data (can be around 600 GB of raw data) in the hashes is relatively slow, but not sure if much improvements are possible there.
    A huge improvement would be parallelisation of the code after loading the hashes, which I tried with a few methods, but they turned out slowing down the process or impossible because it would duplicate the hash (similar problem as before).

    I know nobody that knows Perl, so I am the only that looked at the code, so always welcome and if you see something that would improve the speed of memory efficiency greatly I can add you to the next paper. To make improvements in what it does, I think you need a genetic background.

    Greets

      Hi again,

      I'll just suggest once more that you let go of the idea that you must load all your data into an in-memory hash in order for your program to be fast. For one very fast approach please look at mce_map_f in MCE::Map (also by the learned marioroy) which is written especially for optimized parallel processing of huge files.

      (As an aside, have you profiled your code? I would think that Perl could load data from anywhere (file, database, whatever) faster than a shell call to an external analytical program would return ... or does your program not expect a response?)

      As far as your finding that

      "parallelisation of the code after loading the hashes ... turned out slowing down the process or impossible because it would duplicate the hash"
      ... please see MCE::Shared::Hash.

      Hope this helps!


      The way forward always starts with a minimal test.
        Hi,

        I think I tried MCE::Map a few years ago, but will check it to be sure. I tried many methods so that is why I am convinced about the big hash, but I could be wrong of course, as there is much of Perl I don't know.
        But small differences in speed will make a big difference because the script has to access the hash millions of time (I actually build 3 hashes), so some alternatives work fine at first sight, but on large datasets it slows down a lot.
        Similar software (in C++ or python) usually need even more memory than mine (although they use a different graph based method so hard to compare)

        (As an aside, have you profiled your code? I would think that Perl could load data from anywhere (file, database, whatever) faster than a shell call to an external analytical program would return ... or does your program not expect a response?)
        Sorry I don't understand the question, is this about the system call? And I guess I didn't profile the code, as I don't know what that means :)

        I think I tried this one (MCE::Shared::Hash) and it turned out too slow, but again I need to verify this, I will check If find the code, else I will try it.
        Thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11116289]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-03-29 10:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found