I am getting the following field in the dumper " 'ID' => $VAR1->[0]{'ID'}," added to the top of the output list for each attribute
$VAR1 = [
{
'ID' => [
'1',
1,
0,
1,
1,
0,
1,
1,
1,
1,
1,
0,
0,
1,
1,
1,
1
],
'Circle' => 4,
'Triangle' => 0,
'Rectangle' => 4,
'Square' => 4
},
{
'ID' => $VAR1->[0]{'ID'},
'Circle' => 4,
'Triangle' => 0,
'Rectangle' => 0,
'Square' => 4
},
I am using the following script:
use strict ;
use warnings ;
use Data::Dumper ;
open my $dataIn1, "<", "Attributes_ID.txt" or die "NO ID FILE: $!";
open my $dataIn2, "<", "Attributes.txt" or die "NO ATTR FILE: $!";
my $data = () ;
my $attrs = () ;
sub getdata {
my ( $fileName, $type ) = split /\t/, $_[1] ;
push @{$data}, $type unless !defined $fileName ;
}
sub getattrs {
my @attrs = split /\t/, $_[1] ;
#shift @attrs ;
push @{$attrs}, \@attrs unless !defined $attrs[0] ;
}
while( <$dataIn1> ) {
chomp ;
getdata( 0, $_ ) ;
}
while( <$dataIn2> ) {
chomp ;
getattrs( 0, $_ ) ;
}
my @result;
for( my $j = 0 ; $j < @{$attrs} ; ++$j ) {
my %subres ;
@subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ;
$subres{ID} = $attrs->[0] ;
for( my $i = 1 ; $i < @{$attrs->[$j]} ; ++$i ) {
if ( $attrs->[$j][$i] == 1 ) {
++$subres{ $data->[$i-1]} ;
}
} ;
push @result, \%subres ;
}
print Dumper(\@result) ;
I'll continue looking at that to see if I can see why it isn't carrying the attribute ID forward. But I wanted to ask some more questions, and answer yours!
Within the code you have written, is it possible to move the attribute ID "outside" of the grouped data? As in:
{5}{
'Circle' => 0,
'Triangle' => 0,
'Rectangle' => 0,
'Square' => 0
},
{6} {
'Circle' => 0,
'Triangle' => 0,
'Rectangle' => 0,
'Square' => 0
},
{7} {
'Circle' => 4,
'Triangle' => 4,
'Rectangle' => 0,
'Square' => 4
},
The above may answer your question about the attribute ID's being defined. The numbers in the left column of the attribute demo dataset (1-30) are identifiers for that attribute. They could be names, or serial numbers...etc. But they are how I identify that attribute so I can look at it later. At the end of this, I actually need a list of the attributes that pass a True/False statement based on a series of percentages.
This is where the grouping comes in. Your script groups the datasets by their category by shifting the category in place of the file name. Which works great, as I am not so concerned with carrying the file name forward. So the categories in this case are "Square" "Circle" Rectangle" "Triangle." What I would need to do then is look at each attribute...So for attribute 7 in the code block above. I would have a series of True/False statements that asked for each category: "Does this attribute occur in Circle more than 50% of the time, and less than 10% of the time in Triangle, Rectangle, and Square?" Then I would have another statement asking the same question about that attribute for the next category. "Does this attribute occur in Square more than 50% of the time, and less than 10% of the time in Triangle, Rectangle, and Circle?" And so on and so forth for each unique category identified in the Attributes_ID file.
At the end of that, I would generate a list of attributes that scored "True" for each category.
Which brings me to my two questions.
1) Would it be better to "melt" this data? Create a four column data structure that consists of 1)File Name 2)Category Name, 3)Attribute ID, 4)Binary Value.
FILE CATEGORY ATTRIB SCORE
1.file.ext Square 1 1
2.file.ext Triangle 1 0
3.file.ext Circle 1 1
4.file.ext Square 1 1
5.file.ext Triangle 1 0
etc...
Or, 2) Would it be better to do this one line at a time with the True/False qualifiers built into the loop? As in, read in the first attribute row, with the categories, and evaluate the attribute for each category and store that True/False for that quartet in each category before moving to the next attribute? (It would be good to note that the categories change, but are defined in the attribute_ID file. SO it would be based on unique entries there.)
Down the rabbit hole!
|