in reply to Best way to store/access large dataset?

You still haven't really clearly said what the (filtering) query would look like.

So after enlarging your attributes.txt to 1 million rows I used postgres as a front-end to read that whole file (via file_fdw).

To read and "calculate" the whole file takes 2 seconds.

This is the SQL I used:

select attribute as attr , _1_file_ext + _4_file_ext + _13_file_ext + _16_file_ext as square , _2_file_ext + _5_file_ext + _11_file_ext + _12_file_ext as triangle , _3_file_ext + _6_file_ext + _7_file_ext + _10_file_ext as circle , _8_file_ext + _9_file_ext + _14_file_ext + _15_file_ext as rectangle from public.atts

A more interesting part would probably be a WHERE-clause, or possibly an ORDER BY clause, that you would need but I don't know how that would look from what you've said so far.

UPDATE: I typoed the order of the column names so fixed that.

Replies are listed 'Best First'.
Re^2: Best way to store/access large dataset?
by Speed_Freak (Sexton) on Jun 22, 2018 at 16:33 UTC

    The post filtering would be (in words) something like: For category 1 (square, circle, whatever), list all attributes that occur greater than 75% of the time in the items listed in category 1, but less than 25% of the time in the items listed in category 2, 3, 4, and 5, and less than 5% of the time in the items listed under category 6. (Ultimately each category will be set up with it's own set of variables for custom tailored percentages for each comparison.)

    The end output would be a list of attributes by category that are unique to that category.

    EDIT: And I think my issues with speed aren't here yet. I am anticipating it though as this transitions from reading from a set file, to gathering a series of raw values from the database and calculating the binary for the attributes.