Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^4: MCE: How to access variables globally

by biohisham (Priest)
on Dec 21, 2021 at 06:27 UTC ( [id://11139790]=note: print w/replies, xml ) Need Help??


in reply to Re^3: MCE: How to access variables globally
in thread MCE: How to access variables globally

I appreciate the comprehensive reply Marshall. I elaborated on the data and the actual code in one of the responses to this thread.

The current project that is blowing the RAM is the smallest benchmark before scaling to the wider dataset and I can foresee issues with reading and retrieving despite MCE gettomg to the rescue (for now). Hence, I see the case for SQL/SQLite (which remained outside my toolbox since ages because of the bittersweet relationship I had with Oracle DBs).

The next step is to statistically analyse the final data through R. I am kinda apprehensively curious about how R might take reading the file.

In fact the DBM:Deep::Cookbook documents the slow performance

Because DBM::Deep is a conncurrent datastore, every change is flushed to disk immediately and every read goes to disk. This means that DBM::Deep functions at the speed of disk (generally 10-20ms) vs. the speed of RAM (generally 50-70ns), or at least 150-200x slower than the comparable in-memory datastructure in Perl. There are several techniques you can use to speed up how DBM::Deep functions. Put it on a ramdisk The easiest and quickest mechanism to making DBM::Deep run faster is to create a ramdisk and locate the DBM::Deep file there. Doing this as an option may become a feature of DBM::Deep, assuming there is a good ramdisk wrapper on CPAN. Work at the tightest level possible It is much faster to assign the level of your db that you are working with to an intermediate variable than to re-look it up every time.


Something or the other, a monk since 2009

Replies are listed 'Best First'.
Re^5: MCE: How to access variables globally
by Marshall (Canon) on Dec 21, 2021 at 11:07 UTC
    This is a separate post with a program to use the DB from the previous post.
    You could make a Perl hash table to calculate a histogram of say the State1 column. However we can just ask SQL to do that for us and give us the results in descending order of frequency. The DB will figure out how to do this within the memory that it has to work with.

    My "SQL kung-foo" is not very advanced, but some really amazing things can be done with SQL when coded by an expert.

Re^5: MCE: How to access variables globally
by Marshall (Canon) on Dec 21, 2021 at 09:28 UTC
    Ok, I looked at the data that you referenced. Below is some sample code to create an SQLite DB from the data that you showed. I have no idea whatsoever of what this data means, so I took some guesses. Each column has to have a unique name and I just put some guess in for a heading.

    You can hardly blink fast enough before this code finishes. I would estimate that this code will take about 15*10 seconds or 150 seconds to create 15M line table. Ok, 2 1/2 minutes for table creation. Run this with your 15M line data set and see how long it does takes on your machine.

    15M rows is "not big" as these things go. Reads are going to be much faster than writes. I could perhaps make the table creation run 2x as fast, but to what point? I think the real question is what processing of the data do you want once the table is created? I still don't understand that part.

    Forget about MCE stuff for the time being. I have a 4 core machine. With a very compute bound job, I can use 4 cores and get the job done maybe 3.8x faster. At this point, focus on the order of magnitude improvements and getting something at a small scale to produce the result you want.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11139790]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2025-01-22 10:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which URL do you most often use to access this site?












    Results (63 votes). Check out past polls.