Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Best way to store/access large dataset?

by Speed_Freak (Sexton)
on Jun 21, 2018 at 22:25 UTC ( #1217139=perlquestion: print w/replies, xml ) Need Help??
Speed_Freak has asked for the wisdom of the Perl Monks concerning the following question:

Hoping not to get flogged here, but I wanted to post the question tonight so maybe I would have a starting point when I come back in tomorrow.
I'm an absolute newb when it comes to programming, but I think PERL will be good for what I'm trying to do...and I have the books, so I'm hoping to fumble through this.

I'm trying to load in data from two files. One file has the category ID for each item I am interested in. (eg. item.1.ext = square, item.2.ext = circle, etc.)
The second file contains all the attributes for these items. Each attribute has a binary yes/no represented by 1 or 0. My files can have a couple hundred items, with a million attributes for each item.

What I am looking to do is find a good way to process through the attributes by category and score them. I was thinking that I would read in the files and attempt to create a count for each group. And then use the number of times the attribute was present in a category set over the number of items in that category to create a series of scoring criteria.(Like which attributes occur in each category more than 75% of the time, but less than 25% of the time in any other category. Basically looking for category unique attributes.)

But as I've learned with PERL, there are 7000 different ways to skin a cat, so I'm up for any suggestions. I'm trying to make this a fairly quick process because it will be repeated OFTEN.(Datasets will be ~200 items, 4-10 categories, and 1 million attributes.)

data example in readmore.

#ID's File ID 1.file.ext Square 2.file.ext Triangle 3.file.ext Circle 4.file.ext Square 5.file.ext Triangle 6.file.ext Circle 7.file.ext Circle 8.file.ext Rectangle 9.file.ext Rectangle 10.file.ext Circle 11.file.ext Triangle 12.file.ext Triangle 13.file.ext Square 14.file.ext Rectangle 15.file.ext Rectangle 16.file.et Square #Attributes attribute 1.file.ext 2.file.ext 3.file.ext 4.file.ext 5 +.file.ext 6.file.ext 7.file.ext 8.file.ext 9.file.ext +10.file.ext 11.file.ext 12.file.ext 13.file.ext 14.file.e +xt 15.file.ext 16.file.et 1 1 0 1 1 0 1 1 1 1 1 0 0 1 +1 1 1 2 1 0 1 1 0 1 1 0 0 1 0 0 1 +0 0 1 3 0 1 0 0 1 0 0 1 1 0 1 1 0 +1 1 0 4 0 1 1 0 1 1 1 1 1 1 1 1 0 +1 1 0 5 0 1 0 0 1 0 0 0 0 0 1 1 0 +0 0 0 6 0 0 0 0 0 0 0 1 1 0 0 0 0 +1 1 0 7 0 0 1 0 0 1 1 1 1 1 0 0 0 +1 1 0 8 1 0 1 1 0 1 1 1 1 1 0 0 1 +1 1 1 9 0 0 0 0 0 0 0 1 1 0 0 0 0 +1 1 0 10 0 1 0 0 1 0 0 0 0 0 1 1 0 + 0 0 0 11 0 1 0 0 1 0 0 1 1 0 1 1 0 + 1 1 0 12 1 1 1 1 1 1 1 0 0 1 1 1 1 + 0 0 1 13 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 14 0 0 1 0 0 1 1 1 1 1 0 0 0 + 1 1 0 15 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 16 1 0 0 1 0 0 0 0 0 0 0 0 1 + 0 0 1 17 1 0 0 1 0 0 0 0 0 0 0 0 1 + 0 0 1 18 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 19 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 20 0 1 1 0 1 1 1 1 1 1 1 1 0 + 1 1 0 21 0 0 0 0 0 0 0 1 1 0 0 0 0 + 1 1 0 22 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 23 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 24 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 26 1 1 1 1 1 1 1 0 0 1 1 1 1 + 0 0 1 27 0 1 0 0 1 0 0 0 0 0 1 1 0 + 0 0 0 28 0 0 0 1 0 0 0 1 1 0 0 0 1 + 1 1 1 29 0 0 0 0 0 0 0 1 1 0 0 0 0 + 1 1 0 30 0 0 0 1 0 0 0 1 1 0 0 0 1 + 1 1 1

Replies are listed 'Best First'.
Re: Best way to store/access large dataset?
by NetWallah (Canon) on Jun 22, 2018 at 00:16 UTC
    This is essentially about a data structure to hold the data you need.

    Here is one way to do it .. I was not clear on what analysis you wanted so it is a simple sum:

    use strict; use warnings; my %info; open my $id_file,"<","data01.txt" or die "NO ID FILE: $!"; while(<$id_file>){ chomp; my ($fileext,$name) = split; my ($column) = $fileext=~m/^(\d+)/ or next; $info{$fileext}{NAME} = $name; $info{$fileext}{COLUMN} = $column; $info{$fileext}{ATT_COUNT} = 0; } close $id_file; my @column_att; for (keys %info) { $column_att[ $info{$_}{COLUMN} ] = \$info{$_}{ATT_COUNT}; # For ef +ficiency } open my $attr_file,"<","data02.txt" or die "NO ATTR FILE: $!"; while (<$attr_file>){ chomp; next unless m/^\d+\s+[01\s]+$/; my @atts = split; for (my $i=1; $i<=$#atts;$i++){ ${$column_att[$i]} += $atts[$i]; } #$. < 4 and print "$. --- \n", Print_info(); } close $attr_file; Print_info(); exit 0; #-------------------------- sub Print_info{ for (sort {$info{$a}{COLUMN} <=> $info{$b}{COLUMN}} keys %info){ print "$_ \t",$info{$_}{ATT_COUNT}, " \t $info{$_}{NAME}\n"; } }
    OUTPUT:
    1.file.ext 10 Square 2.file.ext 12 Triangle 3.file.ext 15 Circle 4.file.ext 12 Square 5.file.ext 12 Triangle 6.file.ext 15 Circle 7.file.ext 15 Circle 8.file.ext 17 Rectangle 9.file.ext 17 Rectangle 10.file.ext 15 Circle 11.file.ext 12 Triangle 12.file.ext 12 Triangle 13.file.ext 12 Square 14.file.ext 17 Rectangle 15.file.ext 17 Rectangle 16.file.et 12 Square

                    Memory fault   --   brain fried

      Thanks for the response! I'm currently playing with your code trying to get it to work on my dataset. Currently it's just returning to the prompt after about 2.5 minutes without displaying anything.(I vaguely remember something about there being an issue creating a text file in windows and then trying to read it in while working in linux?)

      Once I figure out what I'm doing wrong, I'm going to attempt modifying it to create individual totals for each attribute by category. So the end output would be a list of each attribute in the first column, then each category would be listed across the top, then the totals for each attribute in each category would fill in the table.

      Like so:

      #Table Square Circle Triangle Rectangle 1 4 4 0 4 2 4 4 0 0 3 0 0 4 4 4 0 4 4 4 5 0 0 4 0 6 0 0 0 4 7 0 4 0 4 8 4 4 0 4 9 0 0 0 4 10 0 0 4 0 11 0 0 4 4 12 4 4 4 0 13 0 4 0 0 14 0 4 0 4 15 0 4 0 0 16 4 0 0 0 17 4 0 0 0 18 0 4 0 0 19 4 4 4 4 20 0 4 4 4 21 0 0 0 4 22 4 4 4 4 23 4 4 4 4 24 0 0 0 0 25 0 0 0 0 26 4 4 4 0 27 0 0 4 0 28 3 0 0 4 29 0 0 0 4 30 3 0 0 4

      The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.) But with the database connections in mind, it seems like using threads to speed this up would not be recommended. So do you see a way to fork this? Or would forking not help in this case? I think I read that forking will chew up some more memory, but I think I can handle that overhead. (I have 20 cores/40 threads and 192GB ram to work with.)

        The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.)

        I have to ask:

        "ultimate goal is pulling data from a database"? Then why were you talking about these .txt files in the OP?

        "creating the binaries"? What are "binaries"?

        Why pull data from a database to do "calculations" (apparently external from the db) when a database can do efficient calculations for you?

        The problem lies in my actual file names and the way the column variable is assigned. I have a couple types of file formats unfortunately...

        Type 1 = combinationoftextnumbersandcaharacter.fileextension Type 2 = combinationoftextnumbersandcaharacter.combinationoftextnumber +sandcaharacter.fileextension

        In either case, only the first block is needed. The second block in Type 2 can be ignored as well as the file extension for both.
        I'm going to look at regular expressions and try to make that work.

Re: Best way to store/access large dataset?
by erix (Parson) on Jun 22, 2018 at 16:14 UTC

    You still haven't really clearly said what the (filtering) query would look like.

    So after enlarging your attributes.txt to 1 million rows I used postgres as a front-end to read that whole file (via file_fdw).

    To read and "calculate" the whole file takes 2 seconds.

    This is the SQL I used:

    select attribute as attr , _1_file_ext + _4_file_ext + _13_file_ext + _16_file_ext as square , _2_file_ext + _5_file_ext + _11_file_ext + _12_file_ext as triangle , _3_file_ext + _6_file_ext + _7_file_ext + _10_file_ext as circle , _8_file_ext + _9_file_ext + _14_file_ext + _15_file_ext as rectangle from public.atts

    A more interesting part would probably be a WHERE-clause, or possibly an ORDER BY clause, that you would need but I don't know how that would look from what you've said so far.

    UPDATE: I typoed the order of the column names so fixed that.

      The post filtering would be (in words) something like: For category 1 (square, circle, whatever), list all attributes that occur greater than 75% of the time in the items listed in category 1, but less than 25% of the time in the items listed in category 2, 3, 4, and 5, and less than 5% of the time in the items listed under category 6. (Ultimately each category will be set up with it's own set of variables for custom tailored percentages for each comparison.)

      The end output would be a list of attributes by category that are unique to that category.

      EDIT: And I think my issues with speed aren't here yet. I am anticipating it though as this transitions from reading from a set file, to gathering a series of raw values from the database and calculating the binary for the attributes.

Re: Best way to store/access large dataset?
by Veltro (Pilgrim) on Jun 22, 2018 at 18:13 UTC

    I would suggest you try a two step approach.

    1. Get your data structures as you want them

    2. Process your data

    For example build something like this first to get the data (I show you a rather simple script here, but I wonder if the database is not already coming from the database, why don't you poor it into the correct format already...)

    use strict ; use warnings ; use Data::Dumper ; my $data = () ; my $attrs = () ; sub getData { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getAttrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } # Gather data my $context = 0 ; my $counter = -1 ; while(<DATA>) { chomp ; if ( $_ =~ /ID\'s/ ) { $context = 1 ; $counter = -1 ; next ; } if ( $_ =~ /Attributes/ ) { $context = 2 ; $counter = -1 ; next ; } if ( $context == 1 && $counter == -1 ) { ++$counter ; next ; } elsif ( $context == 1 && $counter > -1 ) { getData($counter, $_) ; ++$counter ; } if ( $context == 2 && $counter == -1 ) { ++$counter ; next ; } elsif ( $context == 2 && $counter > -1 ) { getAttrs($counter, $_) ; ++$counter ; } } ; foreach ( @{$data } ) { print $_ . " " ; } print "\n" ; foreach ( @{$attrs->[0] } ) { print $_ . " " ; } print "\n" ; __DATA__

    Once you have collected your data then move on to your algorithm. In the following example I have reduced the amount of input data to reduce the output and I use hashes for their behavior. Further I don't know what you exactly want with the 25/75% thingy, but you can easily add another counter to this algorithm and count the times a 0 is encountered. I would work from there if you want some statistical calculation or something.

    my @data = qw(Square Triangle Circle Square Triangle Circle Circle Rec +tangle Rectangle Circle Triangle Triangle Square Rectangle Rectangle +Square) ; $data = \@data ; $attrs = [ [1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1], [1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1], ] ; my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i]} ; } } ; push @result, \%subres ; } print Dumper(\@result) ;

    The output is:

    $VAR1 = [ { 'Square' => 4, 'Circle' => 4, 'Rectangle' => 4, 'Triangle' => 0 }, { 'Rectangle' => 0, 'Triangle' => 0, 'Circle' => 4, 'Square' => 4 } ];

      I think i'm butchering this trying to get it to work. Is this the right approach to get my data in?

      use strict ; use warnings ; use Data::Dumper ; open my $data,"<","ID_file.txt" or die "NO ID FILE: $!"; open my $attrs,"<","Attribute_file.txt" or die "NO ATTR FILE: $!"; sub getdata { while(<$data>){ my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } } sub getattrs { while(<$attrs>){ my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } }

      I know this isn't right, just not sure why.

        Try it with this, make sure that you remove the first two lines from your both your data files or it won't work:

        use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "ID_file.txt" or die "NO ID FILE: $!"; open my $dataIn2, "<", "Attribute_file.txt" or die "NO ATTR FILE: $!"; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } while( <$dataIn1> ) { chomp ; # In my previous example I used # a counter which is not available # here, so that is why the first # value is 0 getdata( 0, $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( 0, $_ ) ; } print Dumper( $data ) ; print Dumper ( $attrs ) ;
Re: Best way to store/access large dataset?
by BrowserUk (Pope) on Jun 22, 2018 at 15:38 UTC

    So the first file is just a lookup of the (upto 200) names for each column; and the second file has one column per name and each line contains 1 boolean attribute for each item and there can be 1 million attributes?

    And you want to basically count the number of true attributes for each item?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

      Yes, and sum the attribute counts by the assigned category from the first file.

      So if I have 200 items spread equally across 5 categories, I need a sum value for each of the 1 million attributes as they appear in the items grouped by category. So attribute 1 was found 6 times in the 40 items listed as category 1, 12 times in the 40 items listed as category 2...etc, for each category and each attribute.

Re: Best way to store/access large dataset?
by sundialsvc4 (Abbot) on Jun 22, 2018 at 13:15 UTC

    Whereas, what immediately pops into my mind is that you should use a database ... for example, “SQLite,” which is a universally-available database system that uses simple files – no database server.   While you might (or might not) find yourself having to go through some calisthenics to get the data into a proper database format, once you have done so it appears to me that many of your subsequent activities can readily be accomplished using database JOINs and summary features such as GROUP BY.   Furthermore, once you do get your data into that form, many tools – including any ubiquitous “spreadsheet” – might be put to the task.   (Data volumes and relationships such as you describe are “thoroughly un-exciting” to these database technologies.   They do this sort of thing for breakfast.   “A million rows?”   Yawn ...)

    In your grander investigations, I would also point you to the novel open-source statistical programming language, “R.”   It can do a great many useful things in ways that are a breath of fresh air versus other packages such as SPSS® and SAS®.

    So, if it seems that I seem to be nudging you away from Perl, actually I am not.   In my opinion, Perl is (still) the locomotive-of-choice for handling data files.   But, do not automatically jump to the conclusion that you will find yourself writing custom programs to do this entire job, because quite-likely you won’t have to.

      SQLite is the last thing that should come to mind when dealing with large datasets. I can't imagine why you would even think about using SQLite over Postgres, or depending on your needs, loading it up in to S3 and querying it using Athena, or Dynamodb. There's a wealth of technology out there made specifically for processing vast amounts of data and running calculations on it, for relatively cheap. SQLite is not one of them.

      Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

        I wish I knew more on this subject. If it makes a difference, there wont be any writing going on to the database during all of this. It is purely read. And from what little I've read, Postgres isn't recommended when you're looking for speed in purely read operations?.
        The database doesn't exist outside of its current test iteration, so there is still time to change it. But I wouldn't know what would be a better option.

      EDIT: I realized that your response was to the title of the thread. So I should clarify that I was leaning towards what was the best way to read in that database data into PERL to manipulate it. And if it would need to be stored in a file instead of memory.

      The database that is (will be soon) housing the data is MariaDB. And I think getting that data will be fairly easy. The interface will allow the user to select the items and categories of interest. Which then will trigger the script to create the attribute list by using a series of qualifiers in the SELECT statement. (I'm way oversimplifying, but the database isn't ready for me to even start figuring out how that's going to look.)

      That initial pull of data will be around 1.8 billion calculations if the qualifiers are relatively simple. The qualifiers are user definable, so they could range from simple greater than/less than, to various combinations of percentages of different values from the database.
      Following that comes this script which will ultimately perform an additional ~49 million calculations on the summary table to find the unique attributes.(A chain of greater than less than qualifiers based on attribute count and category count for each attribute in each category.)

      While a spreadsheet can indeed handle this second lift, it takes quite a while, and isn't automated. (All of my proof of concept work has been done in 64 bit excel, which takes about 45 minutes to apply all of the calculations.)

      I've had a colleague trying to tackle this in R as well, but he's having limited success due to the data size and R's memory usage. I know he is making headway, but it's not his primary task. And My limited knowledge of PERL is still 100 fold more than my non existent knowledge in R.

      I may be wrong, but I see the whole chain of scripts taking quite a bit of time, so I'm wanting to streamline as much as possible in anticipation of having users stack up query requests, with each request being unique.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1217139]
Approved by Athanasius
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2018-12-17 10:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many stories does it take before you've heard them all?







    Results (72 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!