Count the Duplicate Entries and make them uniq

slayedbylucifer has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am working on CSV and need to manipulate it as follows. Please note that these are test files I created for explanation on this thread. However the essence of the problem is the same.

I have below test.csv:

OS,RELEASE,LIKE
Ubuntu,Warty,No
Ubuntu,Hoary,No
Ubuntu,Breezy,Yes
Ubuntu,Breazy,Yes
Fedora,Yarrow,Yes
Fedora,Stentz,No
Fedora,Yarrow,Yes
Fedora,Yarrow,Yes
Windows,XP PRO,Yes
Windows,XP PRO,Yes
Windows,XP Home,No
Windows,XP PRO,Yes
[download]

Please note that there are rows which has duplicate entries.

I want to create report.csv as follows by referring above table:

OS,RELEASE,Yes,No
Ubuntu,Warty,0,1
Ubuntu,Hoary,0,1
Ubuntu,Breezy,2,0
Fedora,Yarrow,3,0
Fedora,Stentz,0,1
Windows,XP PRO,3,0
Windows,XP Home,0,1
[download]

Meaning, e.g. I need to create entries for OS and RELEASE and count the times It was liked as "YES" or "NO".

So this is the flow I am following:

1. Create hash of array with "OS name" as hash value and "Array reference" to the array containing RELASE versions for that OS as the values.

2. Run a "SELECT" query by looping over this Hash and print the corresponding the RELEASE and LIKE values

3. unique/count the output received in step 2 and push it to the report.csv

I am stuck at step no. 3. I am not able to unique and count the entries which were received from step 2. I would not say that my code is not working. i would rather say I am not able to think of a way to make it work. Basically, I am not able to think of the logic how to get this done.

Sorry for the long and confusing explanation. I am pasting my code below but I am pretty sure it would add to more confusion. Also you may find lots of bugs in this code

#!/usr/bin/perl -w

use strict;
use Data::Dumper;
use DBI;

my @os;
my %duplicate;
my %hash1;

my $dbh = DBI->connect ("dbi:CSV:") or die "Cannot connect to the CSV 
+file: $DBI::errstr()";
$dbh->{RaiseError} = 1;
$dbh->{TraceLevel} = 0;

my $table = "test.csv";

my $query = "select OS from $table";
my $sth = $dbh->prepare ($query);
$sth->execute();

while ( my $row_ref = $sth->fetch )
{
    (my $os = $row_ref->[0]);
    push (@os, $os) unless $duplicate{$os}++ ;
}

foreach my $operating_system (@os) 
{
    my @array;
    my $query = "select RELEASE from $table where OS = ?";
    my $sth = $dbh->prepare ($query);
    $sth->execute($operating_system);
    while (my $ref = $sth->fetch )
    {
        my $var = $ref->[0];
        push (@array, $var) unless $duplicate{$var}++ ;  ### removing 
+the duplicate release names
    }
    $hash1{$operating_system} = \@array;   
}

my @new;

foreach my $operating_system (@os) 
{ 
     foreach my $arr_ref ($hash1{$operating_system})
     {
         my $arr_index = @{$hash1{$operating_system}}-1; 
         for (0..$arr_index)
         {
            my $release = $hash1{$operating_system}->[$_];
            my $query = "select LIKE from $table where OS = ? and RELE
+ASE = ?";
            my $sth = $dbh->prepare ($query);
            $sth->execute ($operating_system, $release);
            while ( my $row_ref = $sth->fetch )
            {
               push (@new, $release.",".$hash{$row_ref->[0]}); 
            }
            $sth->finish ();
         }
     }
}

#############  >>>> from here below I need to figure out the unique en
+tries and count the times it were repeated and push it to report.csv

foreach my $var (@new)
{
  my $count = 0;
  if ( ! $duplicate{$var}++ )
  {
      print $var.$count."\n";
  }
}
$sth->finish();
$dbh->disconnect()
[download]

Thanks.

Comment on Count the Duplicate Entries and make them uniq Select or Download Code

Replies are listed 'Best First'.
Re: Count the Duplicate Entries and make them uniq by Corion (Patriarch) on Aug 29, 2012 at 11:37 UTC
Why are you trying to do a unique count in Perl instead of using SQL to do that? See SQL count and/or SQL count distinct. Also, SQL has the `DISTINCT` keyword if you need to find the unique values for a column. See SQL DISTINCT. Also see perlfaq4 on "duplicate" on how to find/remove duplicates.	[reply] [d/l]
Re: Count the Duplicate Entries and make them uniq by choroba (Cardinal) on Aug 29, 2012 at 12:13 UTC
In a real database, you would be able to do something like `select distinct OS, RELEASE, (select count(1) from $table as y where x.OS=y.OS and x.RELEASE=y.RELEA +SE and y.LIKE='Yes'), (select count(1) from $table as y where x.OS=y.OS and x.RELEASE=y.RELEA +SE and y.LIKE='No') from $table as x;` [download] For some reason, it works in SQLite, but CSV returns `DBD::CSV::db prepare failed: Unbalanced parentheses! at .../perl/lib/p +erl5/SQL/Statement.pm line 88` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: Count the Duplicate Entries and make them uniq by scorpio17 (Canon) on Aug 29, 2012 at 13:33 UTC
A good rule of thumb is to always make the database do as much of the work as possible. For example, instead of reading data and then sorting it, add an 'ORDER BY' clause to your SELECT statement and read the already sorted data. In this case, you don't really need a 'database' at all. Reading the columns of data from a text file would look something like this: use strict; my %data; my $headers = <DATA>; while ( my $line = <DATA>) { chomp $line; my ($os,$release,$like) = split(/,/,$line); ++$data{$os}{$release}{ uc $like }; } for my $os (sort keys %data) { for my $release (sort keys %{ $data{$os} } ) { my $yes = $data{$os}{$release}{'YES'} \|\| '0'; my $no = $data{$os}{$release}{'NO'} \|\| '0'; print "$os, $release, $yes, $no\n"; } } __DATA__ OS,RELESASE,LIKE Ubuntu,Warty,No Ubuntu,Hoary,No Ubuntu,Breezy,Yes Ubuntu,Breazy,Yes Fedora,Yarrow,Yes Fedora,Stentz,No Fedora,Yarrow,Yes Fedora,Yarrow,Yes Windows,XP PRO,Yes Windows,XP PRO,Yes Windows,XP Home,No Windows,XP PRO,Yes [download] If you really want to use a database, then you could pull the data out like this: `my $sth = $dbh->prepare("select OS, RELEASE, LIKE from $table"); $sth->execute(); while ( my ($os, $release, $like) = $sth->fetchrow ) { ++$data{$os}{$release}{$like}; }` [download]	[reply] [d/l] [select]
Re^2: Count the Duplicate Entries and make them uniq by slayedbylucifer (Scribe) on Aug 29, 2012 at 15:19 UTC
Thank. I am working on it. Will keep you posted. Thanks for your time.	[reply]
Re^2: Count the Duplicate Entries and make them uniq by slayedbylucifer (Scribe) on Aug 30, 2012 at 04:25 UTC
Hello corpio17, Thanks for your response. However, for the time being, I am good with the solution provided by philiprbrenan. However, I will definitely explore what you have suggested as it is completelly works without the SQL statements. Thanks.	[reply]
Re: Count the Duplicate Entries and make them uniq by philiprbrenan (Monk) on Aug 29, 2012 at 13:39 UTC
Build a data structure from the relational layout you have provided aggregating the Yes and No entries and then walk the resulting structure to print the desired results. For safety you could add some data validation. use feature ":5.14"; use warnings FATAL => qw(all); use strict; use Data::Dump qw(dump pp); my @l = split /\n/, << 'END'; Ubuntu,Warty,No Ubuntu,Hoary,No Ubuntu,Breezy,Yes Ubuntu,Breazy,Yes Fedora,Yarrow,Yes Fedora,Stentz,No Fedora,Yarrow,Yes Fedora,Yarrow,Yes Windows,XP PRO,Yes Windows,XP PRO,Yes Windows,XP Home,No Windows,XP PRO,Yes END my $h; for(@l) {my @w = split /,/; $h->{$w[0]}{$w[1]}{$w[2]}++; } pp($h); for my $o(sort keys %$h) {for my $r(sort keys %{$h->{$o}}) {my %yn = %{$h->{$o}{$r}}; my ($yes, $no) = ($yn{Yes}//0, $yn{No}//0); say "$o\t$r\t$yes\t$no"; } } [download] Produces { Fedora => { Stentz => { No => 1 }, Yarrow => { Yes => 3 } }, Ubuntu => { Breazy => { Yes => 1 }, Breezy => { Yes => 1 }, Hoary => { No => 1 }, Warty => { No => 1 }, }, Windows => { "XP Home" => { No => 1 }, "XP PRO" => { Yes => 3 } }, } Fedora Stentz 0 1 Fedora Yarrow 3 0 Ubuntu Breazy 1 0 Ubuntu Breezy 1 0 Ubuntu Hoary 0 1 Ubuntu Warty 0 1 Windows XP Home 0 1 Windows XP PRO 3 0	[reply] [d/l]
Re^2: Count the Duplicate Entries and make them uniq by slayedbylucifer (Scribe) on Aug 29, 2012 at 15:20 UTC
Thanks for your time. I am working on it. I will update this thread with my result.	[reply]
Re^2: Count the Duplicate Entries and make them uniq by slayedbylucifer (Scribe) on Aug 30, 2012 at 04:20 UTC
Hello philiprbrenan, Your solution is working perfectly the way I wanted. thank you very-much for your time and response. I had to tweak your script a little bit in order to make is work with my actual CSV but the whole logic that you have used remains the same. Thanks you very much.	[reply]
Re: Count the Duplicate Entries and make them uniq by Kenosis (Priest) on Aug 29, 2012 at 15:13 UTC
Here's another option: `use Modern::Perl; my %hash; map { /(.+)?,(.+)/; $hash{$1}{$2}++ } <DATA>; say 'OS,RELEASE,Yes,No'; say "$_," . ( $hash{$_}{Yes} \|\| 0 ) . ',' . ( $hash{$_}{No} \|\| 0 ) for sort keys %hash; __DATA__ Ubuntu,Warty,No Ubuntu,Hoary,No Ubuntu,Breezy,Yes Ubuntu,Breezy,Yes Fedora,Yarrow,Yes Fedora,Stentz,No Fedora,Yarrow,Yes Fedora,Yarrow,Yes Windows,XP PRO,Yes Windows,XP PRO,Yes Windows,XP Home,No Windows,XP PRO,Yes` [download] Output: `OS,RELEASE,Yes,No Fedora,Stentz,0,1 Fedora,Yarrow,3,0 Ubuntu,Breezy,2,0 Ubuntu,Hoary,0,1 Ubuntu,Warty,0,1 Windows,XP Home,0,1 Windows,XP PRO,3,0` [download]	[reply] [d/l] [select]
Re^2: Count the Duplicate Entries and make them uniq by slayedbylucifer (Scribe) on Aug 30, 2012 at 04:22 UTC
Hello Kenosis, Thanks for your respose. your suggestion is shorted thant other solutions on this thread. However, for the time being, I am good with the solution proviede by philiprbrenan. However, i will definitely explore what you ahve sugested as it is totally new for me and it will be a great learning for me.	[reply]
Re^3: Count the Duplicate Entries and make them uniq by slayedbylucifer (Scribe) on Aug 30, 2012 at 04:24 UTC
Please excuse the my typos.	[reply]
Re^3: Count the Duplicate Entries and make them uniq by Kenosis (Priest) on Aug 30, 2012 at 04:36 UTC
You're most welcome! philiprbrenan's solution is certainly a good one, indeed!	[reply]

Back to Seekers of Perl Wisdom