Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^2: Best way to store/access large dataset?

by Speed_Freak (Sexton)
on Jun 22, 2018 at 14:18 UTC ( [id://1217199]=note: print w/replies, xml ) Need Help??


in reply to Re: Best way to store/access large dataset?
in thread Best way to store/access large dataset?

Thanks for the response! I'm currently playing with your code trying to get it to work on my dataset. Currently it's just returning to the prompt after about 2.5 minutes without displaying anything.(I vaguely remember something about there being an issue creating a text file in windows and then trying to read it in while working in linux?)

Once I figure out what I'm doing wrong, I'm going to attempt modifying it to create individual totals for each attribute by category. So the end output would be a list of each attribute in the first column, then each category would be listed across the top, then the totals for each attribute in each category would fill in the table.

Like so:

#Table Square Circle Triangle Rectangle 1 4 4 0 4 2 4 4 0 0 3 0 0 4 4 4 0 4 4 4 5 0 0 4 0 6 0 0 0 4 7 0 4 0 4 8 4 4 0 4 9 0 0 0 4 10 0 0 4 0 11 0 0 4 4 12 4 4 4 0 13 0 4 0 0 14 0 4 0 4 15 0 4 0 0 16 4 0 0 0 17 4 0 0 0 18 0 4 0 0 19 4 4 4 4 20 0 4 4 4 21 0 0 0 4 22 4 4 4 4 23 4 4 4 4 24 0 0 0 0 25 0 0 0 0 26 4 4 4 0 27 0 0 4 0 28 3 0 0 4 29 0 0 0 4 30 3 0 0 4

The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.) But with the database connections in mind, it seems like using threads to speed this up would not be recommended. So do you see a way to fork this? Or would forking not help in this case? I think I read that forking will chew up some more memory, but I think I can handle that overhead. (I have 20 cores/40 threads and 192GB ram to work with.)

Replies are listed 'Best First'.
Re^3: Best way to store/access large dataset?
by erix (Prior) on Jun 22, 2018 at 19:53 UTC

    The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.)

    I have to ask:

    "ultimate goal is pulling data from a database"? Then why were you talking about these .txt files in the OP?

    "creating the binaries"? What are "binaries"?

    Why pull data from a database to do "calculations" (apparently external from the db) when a database can do efficient calculations for you?

      I missed this response, but I think I've answered the questions throughout the post. But if not I'll give it a shot now.

      The database doesn't exist yet, and I need to do the work as a proof of concept. So once the database exists, the script will be changed to point there instead of the files.
      Binaries are just a presence/absence representation of an attribute. They are calculated from a series of raw values by evaluating the relationships of those values a few different ways.
      I'm all for the database doing the calculations if it can. I'm in completely unfamiliar territory here, so recommendations are appreciated.

Re^3: Best way to store/access large dataset?
by Speed_Freak (Sexton) on Jun 22, 2018 at 15:29 UTC

    The problem lies in my actual file names and the way the column variable is assigned. I have a couple types of file formats unfortunately...

    Type 1 = combinationoftextnumbersandcaharacter.fileextension Type 2 = combinationoftextnumbersandcaharacter.combinationoftextnumber +sandcaharacter.fileextension

    In either case, only the first block is needed. The second block in Type 2 can be ignored as well as the file extension for both.
    I'm going to look at regular expressions and try to make that work.

      I was able to read in the ID file by doing the following:

      my @split_names = split(/\./,$fileext); my ($column) = $split_names[0];

      But that only creates problems in the follow on summation block.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1217199]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-04-24 07:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found