Re^3: Best way to store/access large dataset?

Replies are listed 'Best First'.
Re^4: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 25, 2018 at 14:33 UTC
I can't quite visualize it, but what you're doing is assigning each file it's category name and carrying that forward right? And then it just counts up the "hits" in each category for each attribute. Just for general knowledge, on a 3/4 size data set, it takes approximately 16 minutes before the dumper starts printing to screen. That's where I was wondering if this was the type of thing that could be forked? Also, is $j an arbitrary variable, or is it special? And $i is a special variable right? I was hoping to shoehorn the attribute ID into the data structure in order to use it in an output at the end of this. This works: use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "Attribute_ID.txt" or die "NO ID FILE: $!"; open my $dataIn2, "<", "Attributes.txt" or die "NO ATTR FILE: $!"; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } while( <$dataIn1> ) { chomp ; getdata( 0, $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( 0, $_ ) ; } my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i]} ; } } ; push @result, \%subres ; } print Dumper(\@result) ; [download]	[reply] [d/l]
Re^5: Best way to store/access large dataset? by Veltro (Hermit) on Jun 26, 2018 at 08:45 UTC
...what you're doing is assigning each file it's category name and carrying that forward right?... I'm not really assigning anything. In your example each row from ID's corresponds to exactly one column in Attributes. So I used this to keep the code simple: 1.file.ext Square --> corresponds to column 2 in Attributes 2.file.ext Triangle --> corresponds to column 3 in Attributes ... 16.file.et Square --> corresponds to column 17 in Attributes ...Also, is $j an arbitrary variable, or is it special? And $i is a special variable right?... There is nothing 'special' about `$i` and `$j`. They are just used to traverse the `data` array and multi dimensional `attrs` array. In this case I used `$j` to address each attribute set in `attrs`. I used `$i` to address each element in `data` and each individual attribute of data sub sets inside `attrs` ...I was hoping to shoehorn the attribute ID into the data structure in order to use it in an output at the end of this... To get the ID in the data set, you can make these changes. I'm just adding it to the final result set with the key 'ID' in this case. (Line number followed by: < = remove and > = add): `18 < shift @attrs ; 35 > $subres{ID} = $attrs->[0] ; 36 < for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) { 36 > for( my $i = 1 ; $i < @{$attrs->[$j]} ; ++$i ) { 38 < ++$subres{ $data->[$i]} ; 38 > ++$subres{ $data->[$i-1]} ;` [download] On line 18 the row ID was removed from the attribute set. So we no longer do that. That means that in the for loop we need to start at index 1 instead of 0 (Line 36). However, the indexing in `data` has not changed so we have to subtract 1 `$i-1` (Line 38). ...If I get rid of the first line in the second file, I'll lose the file name associated with the binary... I'm not sure what you mean with this association. If it is the order of appearance inside `data` that changes? Then I suggest a small piece of code that alters that order based on column order inside `Attributes` ...Which would make them not be able to be grouped by category?... What needs to be grouped? Do you have examples? ...And it's probably also important to point out that the attribute numbers aren't arbitrary, they are defined.... What do you mean with defined? In you example you show attributes that are binary, they are either 0 or 1? If there is something specific that needs to be done can you try to visualize that?	[reply] [d/l] [select]
Re^6: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 26, 2018 at 23:03 UTC
I am getting the following field in the dumper " 'ID' => $VAR1->[0]{'ID'}," added to the top of the output list for each attribute `$VAR1 = [ { 'ID' => [ '1', 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1 ], 'Circle' => 4, 'Triangle' => 0, 'Rectangle' => 4, 'Square' => 4 }, { 'ID' => $VAR1->[0]{'ID'}, 'Circle' => 4, 'Triangle' => 0, 'Rectangle' => 0, 'Square' => 4 },` [download] I am using the following script: use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "Attributes_ID.txt" or die "NO ID FILE: $!"; open my $dataIn2, "<", "Attributes.txt" or die "NO ATTR FILE: $!"; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[1] ; #shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } while( <$dataIn1> ) { chomp ; getdata( 0, $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( 0, $_ ) ; } my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; $subres{ID} = $attrs->[0] ; for( my $i = 1 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i-1]} ; } } ; push @result, \%subres ; } print Dumper(\@result) ; [download] I'll continue looking at that to see if I can see why it isn't carrying the attribute ID forward. But I wanted to ask some more questions, and answer yours! Within the code you have written, is it possible to move the attribute ID "outside" of the grouped data? As in: `{5}{ 'Circle' => 0, 'Triangle' => 0, 'Rectangle' => 0, 'Square' => 0 }, {6} { 'Circle' => 0, 'Triangle' => 0, 'Rectangle' => 0, 'Square' => 0 }, {7} { 'Circle' => 4, 'Triangle' => 4, 'Rectangle' => 0, 'Square' => 4 },` [download] The above may answer your question about the attribute ID's being defined. The numbers in the left column of the attribute demo dataset (1-30) are identifiers for that attribute. They could be names, or serial numbers...etc. But they are how I identify that attribute so I can look at it later. At the end of this, I actually need a list of the attributes that pass a True/False statement based on a series of percentages. This is where the grouping comes in. Your script groups the datasets by their category by shifting the category in place of the file name. Which works great, as I am not so concerned with carrying the file name forward. So the categories in this case are "Square" "Circle" Rectangle" "Triangle." What I would need to do then is look at each attribute...So for attribute 7 in the code block above. I would have a series of True/False statements that asked for each category: "Does this attribute occur in Circle more than 50% of the time, and less than 10% of the time in Triangle, Rectangle, and Square?" Then I would have another statement asking the same question about that attribute for the next category. "Does this attribute occur in Square more than 50% of the time, and less than 10% of the time in Triangle, Rectangle, and Circle?" And so on and so forth for each unique category identified in the Attributes_ID file. At the end of that, I would generate a list of attributes that scored "True" for each category. Which brings me to my two questions. 1) Would it be better to "melt" this data? Create a four column data structure that consists of 1)File Name 2)Category Name, 3)Attribute ID, 4)Binary Value. `FILE CATEGORY ATTRIB SCORE 1.file.ext Square 1 1 2.file.ext Triangle 1 0 3.file.ext Circle 1 1 4.file.ext Square 1 1 5.file.ext Triangle 1 0 etc...` [download] Or, 2) Would it be better to do this one line at a time with the True/False qualifiers built into the loop? As in, read in the first attribute row, with the categories, and evaluate the attribute for each category and store that True/False for that quartet in each category before moving to the next attribute? (It would be good to note that the categories change, but are defined in the attribute_ID file. SO it would be based on unique entries there.) Down the rabbit hole!	[reply] [d/l] [select]
Re^7: Best way to store/access large dataset? by Veltro (Hermit) on Jun 27, 2018 at 10:26 UTC
Re^8: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 27, 2018 at 14:10 UTC
Some notes below your chosen depth have not been shown here
Re^4: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 25, 2018 at 13:10 UTC
If I get rid of the first line in the second file, I'll lose the file name associated with the binary. Which would make them not be able to be grouped by category? And it's probably also important to point out that the attribute numbers aren't arbitrary, they are defined. I can always sort my input file so they are listed in order which would be a workaround if I can't carry the numbers forward.	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks