Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Best way to store/access large dataset?

by Speed_Freak (Sexton)
on Jun 21, 2018 at 22:25 UTC ( #1217139=perlquestion: print w/replies, xml ) Need Help??
Speed_Freak has asked for the wisdom of the Perl Monks concerning the following question:

Hoping not to get flogged here, but I wanted to post the question tonight so maybe I would have a starting point when I come back in tomorrow.
I'm an absolute newb when it comes to programming, but I think PERL will be good for what I'm trying to do...and I have the books, so I'm hoping to fumble through this.

I'm trying to load in data from two files. One file has the category ID for each item I am interested in. (eg. item.1.ext = square, item.2.ext = circle, etc.)
The second file contains all the attributes for these items. Each attribute has a binary yes/no represented by 1 or 0. My files can have a couple hundred items, with a million attributes for each item.

What I am looking to do is find a good way to process through the attributes by category and score them. I was thinking that I would read in the files and attempt to create a count for each group. And then use the number of times the attribute was present in a category set over the number of items in that category to create a series of scoring criteria.(Like which attributes occur in each category more than 75% of the time, but less than 25% of the time in any other category. Basically looking for category unique attributes.)

But as I've learned with PERL, there are 7000 different ways to skin a cat, so I'm up for any suggestions. I'm trying to make this a fairly quick process because it will be repeated OFTEN.(Datasets will be ~200 items, 4-10 categories, and 1 million attributes.)

data example in readmore.

#ID's File ID 1.file.ext Square 2.file.ext Triangle 3.file.ext Circle 4.file.ext Square 5.file.ext Triangle 6.file.ext Circle 7.file.ext Circle 8.file.ext Rectangle 9.file.ext Rectangle 10.file.ext Circle 11.file.ext Triangle 12.file.ext Triangle 13.file.ext Square 14.file.ext Rectangle 15.file.ext Rectangle 16.file.et Square #Attributes attribute 1.file.ext 2.file.ext 3.file.ext 4.file.ext 5 +.file.ext 6.file.ext 7.file.ext 8.file.ext 9.file.ext +10.file.ext 11.file.ext 12.file.ext 13.file.ext 14.file.e +xt 15.file.ext 16.file.et 1 1 0 1 1 0 1 1 1 1 1 0 0 1 +1 1 1 2 1 0 1 1 0 1 1 0 0 1 0 0 1 +0 0 1 3 0 1 0 0 1 0 0 1 1 0 1 1 0 +1 1 0 4 0 1 1 0 1 1 1 1 1 1 1 1 0 +1 1 0 5 0 1 0 0 1 0 0 0 0 0 1 1 0 +0 0 0 6 0 0 0 0 0 0 0 1 1 0 0 0 0 +1 1 0 7 0 0 1 0 0 1 1 1 1 1 0 0 0 +1 1 0 8 1 0 1 1 0 1 1 1 1 1 0 0 1 +1 1 1 9 0 0 0 0 0 0 0 1 1 0 0 0 0 +1 1 0 10 0 1 0 0 1 0 0 0 0 0 1 1 0 + 0 0 0 11 0 1 0 0 1 0 0 1 1 0 1 1 0 + 1 1 0 12 1 1 1 1 1 1 1 0 0 1 1 1 1 + 0 0 1 13 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 14 0 0 1 0 0 1 1 1 1 1 0 0 0 + 1 1 0 15 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 16 1 0 0 1 0 0 0 0 0 0 0 0 1 + 0 0 1 17 1 0 0 1 0 0 0 0 0 0 0 0 1 + 0 0 1 18 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 19 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 20 0 1 1 0 1 1 1 1 1 1 1 1 0 + 1 1 0 21 0 0 0 0 0 0 0 1 1 0 0 0 0 + 1 1 0 22 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 23 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 24 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 26 1 1 1 1 1 1 1 0 0 1 1 1 1 + 0 0 1 27 0 1 0 0 1 0 0 0 0 0 1 1 0 + 0 0 0 28 0 0 0 1 0 0 0 1 1 0 0 0 1 + 1 1 1 29 0 0 0 0 0 0 0 1 1 0 0 0 0 + 1 1 0 30 0 0 0 1 0 0 0 1 1 0 0 0 1 + 1 1 1

Replies are listed 'Best First'.
Re: Best way to store/access large dataset?
by NetWallah (Canon) on Jun 22, 2018 at 00:16 UTC
    This is essentially about a data structure to hold the data you need.

    Here is one way to do it .. I was not clear on what analysis you wanted so it is a simple sum:

    use strict; use warnings; my %info; open my $id_file,"<","data01.txt" or die "NO ID FILE: $!"; while(<$id_file>){ chomp; my ($fileext,$name) = split; my ($column) = $fileext=~m/^(\d+)/ or next; $info{$fileext}{NAME} = $name; $info{$fileext}{COLUMN} = $column; $info{$fileext}{ATT_COUNT} = 0; } close $id_file; my @column_att; for (keys %info) { $column_att[ $info{$_}{COLUMN} ] = \$info{$_}{ATT_COUNT}; # For ef +ficiency } open my $attr_file,"<","data02.txt" or die "NO ATTR FILE: $!"; while (<$attr_file>){ chomp; next unless m/^\d+\s+[01\s]+$/; my @atts = split; for (my $i=1; $i<=$#atts;$i++){ ${$column_att[$i]} += $atts[$i]; } #$. < 4 and print "$. --- \n", Print_info(); } close $attr_file; Print_info(); exit 0; #-------------------------- sub Print_info{ for (sort {$info{$a}{COLUMN} <=> $info{$b}{COLUMN}} keys %info){ print "$_ \t",$info{$_}{ATT_COUNT}, " \t $info{$_}{NAME}\n"; } }
    OUTPUT:
    1.file.ext 10 Square 2.file.ext 12 Triangle 3.file.ext 15 Circle 4.file.ext 12 Square 5.file.ext 12 Triangle 6.file.ext 15 Circle 7.file.ext 15 Circle 8.file.ext 17 Rectangle 9.file.ext 17 Rectangle 10.file.ext 15 Circle 11.file.ext 12 Triangle 12.file.ext 12 Triangle 13.file.ext 12 Square 14.file.ext 17 Rectangle 15.file.ext 17 Rectangle 16.file.et 12 Square

                    Memory fault   --   brain fried

      Thanks for the response! I'm currently playing with your code trying to get it to work on my dataset. Currently it's just returning to the prompt after about 2.5 minutes without displaying anything.(I vaguely remember something about there being an issue creating a text file in windows and then trying to read it in while working in linux?)

      Once I figure out what I'm doing wrong, I'm going to attempt modifying it to create individual totals for each attribute by category. So the end output would be a list of each attribute in the first column, then each category would be listed across the top, then the totals for each attribute in each category would fill in the table.

      Like so:

      #Table Square Circle Triangle Rectangle 1 4 4 0 4 2 4 4 0 0 3 0 0 4 4 4 0 4 4 4 5 0 0 4 0 6 0 0 0 4 7 0 4 0 4 8 4 4 0 4 9 0 0 0 4 10 0 0 4 0 11 0 0 4 4 12 4 4 4 0 13 0 4 0 0 14 0 4 0 4 15 0 4 0 0 16 4 0 0 0 17 4 0 0 0 18 0 4 0 0 19 4 4 4 4 20 0 4 4 4 21 0 0 0 4 22 4 4 4 4 23 4 4 4 4 24 0 0 0 0 25 0 0 0 0 26 4 4 4 0 27 0 0 4 0 28 3 0 0 4 29 0 0 0 4 30 3 0 0 4

      The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.) But with the database connections in mind, it seems like using threads to speed this up would not be recommended. So do you see a way to fork this? Or would forking not help in this case? I think I read that forking will chew up some more memory, but I think I can handle that overhead. (I have 20 cores/40 threads and 192GB ram to work with.)

        The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.)

        I have to ask:

        "ultimate goal is pulling data from a database"? Then why were you talking about these .txt files in the OP?

        "creating the binaries"? What are "binaries"?

        Why pull data from a database to do "calculations" (apparently external from the db) when a database can do efficient calculations for you?

        The problem lies in my actual file names and the way the column variable is assigned. I have a couple types of file formats unfortunately...

        Type 1 = combinationoftextnumbersandcaharacter.fileextension Type 2 = combinationoftextnumbersandcaharacter.combinationoftextnumber +sandcaharacter.fileextension

        In either case, only the first block is needed. The second block in Type 2 can be ignored as well as the file extension for both.
        I'm going to look at regular expressions and try to make that work.

Re: Best way to store/access large dataset?
by erix (Parson) on Jun 22, 2018 at 16:14 UTC

    You still haven't really clearly said what the (filtering) query would look like.

    So after enlarging your attributes.txt to 1 million rows I used postgres as a front-end to read that whole file (via file_fdw).

    To read and "calculate" the whole file takes 2 seconds.

    This is the SQL I used:

    select attribute as attr , _1_file_ext + _4_file_ext + _13_file_ext + _16_file_ext as square , _2_file_ext + _5_file_ext + _11_file_ext + _12_file_ext as triangle , _3_file_ext + _6_file_ext + _7_file_ext + _10_file_ext as circle , _8_file_ext + _9_file_ext + _14_file_ext + _15_file_ext as rectangle from public.atts

    A more interesting part would probably be a WHERE-clause, or possibly an ORDER BY clause, that you would need but I don't know how that would look from what you've said so far.

    UPDATE: I typoed the order of the column names so fixed that.

      The post filtering would be (in words) something like: For category 1 (square, circle, whatever), list all attributes that occur greater than 75% of the time in the items listed in category 1, but less than 25% of the time in the items listed in category 2, 3, 4, and 5, and less than 5% of the time in the items listed under category 6. (Ultimately each category will be set up with it's own set of variables for custom tailored percentages for each comparison.)

      The end output would be a list of attributes by category that are unique to that category.

      EDIT: And I think my issues with speed aren't here yet. I am anticipating it though as this transitions from reading from a set file, to gathering a series of raw values from the database and calculating the binary for the attributes.

Re: Best way to store/access large dataset?
by Veltro (Pilgrim) on Jun 22, 2018 at 18:13 UTC

    I would suggest you try a two step approach.

    1. Get your data structures as you want them

    2. Process your data

    For example build something like this first to get the data (I show you a rather simple script here, but I wonder if the database is not already coming from the database, why don't you poor it into the correct format already...)

    use strict ; use warnings ; use Data::Dumper ; my $data = () ; my $attrs = () ; sub getData { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getAttrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } # Gather data my $context = 0 ; my $counter = -1 ; while(<DATA>) { chomp ; if ( $_ =~ /ID\'s/ ) { $context = 1 ; $counter = -1 ; next ; } if ( $_ =~ /Attributes/ ) { $context = 2 ; $counter = -1 ; next ; } if ( $context == 1 && $counter == -1 ) { ++$counter ; next ; } elsif ( $context == 1 && $counter > -1 ) { getData($counter, $_) ; ++$counter ; } if ( $context == 2 && $counter == -1 ) { ++$counter ; next ; } elsif ( $context == 2 && $counter > -1 ) { getAttrs($counter, $_) ; ++$counter ; } } ; foreach ( @{$data } ) { print $_ . " " ; } print "\n" ; foreach ( @{$attrs->[0] } ) { print $_ . " " ; } print "\n" ; __DATA__

    Once you have collected your data then move on to your algorithm. In the following example I have reduced the amount of input data to reduce the output and I use hashes for their behavior. Further I don't know what you exactly want with the 25/75% thingy, but you can easily add another counter to this algorithm and count the times a 0 is encountered. I would work from there if you want some statistical calculation or something.

    my @data = qw(Square Triangle Circle Square Triangle Circle Circle Rec +tangle Rectangle Circle Triangle Triangle Square Rectangle Rectangle +Square) ; $data = \@data ; $attrs = [ [1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1], [1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1], ] ; my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i]} ; } } ; push @result, \%subres ; } print Dumper(\@result) ;

    The output is:

    $VAR1 = [ { 'Square' => 4, 'Circle' => 4, 'Rectangle' => 4, 'Triangle' => 0 }, { 'Rectangle' => 0, 'Triangle' => 0, 'Circle' => 4, 'Square' => 4 } ];

      I think i'm butchering this trying to get it to work. Is this the right approach to get my data in?

      use strict ; use warnings ; use Data::Dumper ; open my $data,"<","ID_file.txt" or die "NO ID FILE: $!"; open my $attrs,"<","Attribute_file.txt" or die "NO ATTR FILE: $!"; sub getdata { while(<$data>){ my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } } sub getattrs { while(<$attrs>){ my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } }

      I know this isn't right, just not sure why.

        Try it with this, make sure that you remove the first two lines from your both your data files or it won't work:

        use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "ID_file.txt" or die "NO ID FILE: $!"; open my $dataIn2, "<", "Attribute_file.txt" or die "NO ATTR FILE: $!"; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } while( <$dataIn1> ) { chomp ; # In my previous example I used # a counter which is not available # here, so that is why the first # value is 0 getdata( 0, $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( 0, $_ ) ; } print Dumper( $data ) ; print Dumper ( $attrs ) ;
Re: Best way to store/access large dataset?
by BrowserUk (Pope) on Jun 22, 2018 at 15:38 UTC

    So the first file is just a lookup of the (upto 200) names for each column; and the second file has one column per name and each line contains 1 boolean attribute for each item and there can be 1 million attributes?

    And you want to basically count the number of true attributes for each item?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

      Yes, and sum the attribute counts by the assigned category from the first file.

      So if I have 200 items spread equally across 5 categories, I need a sum value for each of the 1 million attributes as they appear in the items grouped by category. So attribute 1 was found 6 times in the 40 items listed as category 1, 12 times in the 40 items listed as category 2...etc, for each category and each attribute.

    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1217139]
Approved by Athanasius
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (10)
As of 2018-09-26 14:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Eventually, "covfefe" will come to mean:













    Results (207 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!