http://www.perlmonks.org?node_id=990445

slayedbylucifer has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am working on CSV and need to manipulate it as follows. Please note that these are test files I created for explanation on this thread. However the essence of the problem is the same.

I have below test.csv:

OS,RELEASE,LIKE Ubuntu,Warty,No Ubuntu,Hoary,No Ubuntu,Breezy,Yes Ubuntu,Breazy,Yes Fedora,Yarrow,Yes Fedora,Stentz,No Fedora,Yarrow,Yes Fedora,Yarrow,Yes Windows,XP PRO,Yes Windows,XP PRO,Yes Windows,XP Home,No Windows,XP PRO,Yes

Please note that there are rows which has duplicate entries.

I want to create report.csv as follows by referring above table:

OS,RELEASE,Yes,No Ubuntu,Warty,0,1 Ubuntu,Hoary,0,1 Ubuntu,Breezy,2,0 Fedora,Yarrow,3,0 Fedora,Stentz,0,1 Windows,XP PRO,3,0 Windows,XP Home,0,1

Meaning, e.g. I need to create entries for OS and RELEASE and count the times It was liked as "YES" or "NO".

So this is the flow I am following:

1. Create hash of array with "OS name" as hash value and "Array reference" to the array containing RELASE versions for that OS as the values.

2. Run a "SELECT" query by looping over this Hash and print the corresponding the RELEASE and LIKE values

3. unique/count the output received in step 2 and push it to the report.csv

I am stuck at step no. 3. I am not able to unique and count the entries which were received from step 2. I would not say that my code is not working. i would rather say I am not able to think of a way to make it work. Basically, I am not able to think of the logic how to get this done.

Sorry for the long and confusing explanation. I am pasting my code below but I am pretty sure it would add to more confusion. Also you may find lots of bugs in this code

#!/usr/bin/perl -w use strict; use Data::Dumper; use DBI; my @os; my %duplicate; my %hash1; my $dbh = DBI->connect ("dbi:CSV:") or die "Cannot connect to the CSV +file: $DBI::errstr()"; $dbh->{RaiseError} = 1; $dbh->{TraceLevel} = 0; my $table = "test.csv"; my $query = "select OS from $table"; my $sth = $dbh->prepare ($query); $sth->execute(); while ( my $row_ref = $sth->fetch ) { (my $os = $row_ref->[0]); push (@os, $os) unless $duplicate{$os}++ ; } foreach my $operating_system (@os) { my @array; my $query = "select RELEASE from $table where OS = ?"; my $sth = $dbh->prepare ($query); $sth->execute($operating_system); while (my $ref = $sth->fetch ) { my $var = $ref->[0]; push (@array, $var) unless $duplicate{$var}++ ; ### removing +the duplicate release names } $hash1{$operating_system} = \@array; } my @new; foreach my $operating_system (@os) { foreach my $arr_ref ($hash1{$operating_system}) { my $arr_index = @{$hash1{$operating_system}}-1; for (0..$arr_index) { my $release = $hash1{$operating_system}->[$_]; my $query = "select LIKE from $table where OS = ? and RELE +ASE = ?"; my $sth = $dbh->prepare ($query); $sth->execute ($operating_system, $release); while ( my $row_ref = $sth->fetch ) { push (@new, $release.",".$hash{$row_ref->[0]}); } $sth->finish (); } } } ############# >>>> from here below I need to figure out the unique en +tries and count the times it were repeated and push it to report.csv foreach my $var (@new) { my $count = 0; if ( ! $duplicate{$var}++ ) { print $var.$count."\n"; } } $sth->finish(); $dbh->disconnect()

Thanks.

Replies are listed 'Best First'.
Re: Count the Duplicate Entries and make them uniq
by Corion (Patriarch) on Aug 29, 2012 at 11:37 UTC

    Why are you trying to do a unique count in Perl instead of using SQL to do that? See SQL count and/or SQL count distinct.

    Also, SQL has the DISTINCT keyword if you need to find the unique values for a column. See SQL DISTINCT.

    Also see perlfaq4 on "duplicate" on how to find/remove duplicates.

Re: Count the Duplicate Entries and make them uniq
by choroba (Cardinal) on Aug 29, 2012 at 12:13 UTC
    In a real database, you would be able to do something like
    select distinct OS, RELEASE, (select count(1) from $table as y where x.OS=y.OS and x.RELEASE=y.RELEA +SE and y.LIKE='Yes'), (select count(1) from $table as y where x.OS=y.OS and x.RELEASE=y.RELEA +SE and y.LIKE='No') from $table as x;
    For some reason, it works in SQLite, but CSV returns
    DBD::CSV::db prepare failed: Unbalanced parentheses! at .../perl/lib/p +erl5/SQL/Statement.pm line 88
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Count the Duplicate Entries and make them uniq
by scorpio17 (Canon) on Aug 29, 2012 at 13:33 UTC

    A good rule of thumb is to always make the database do as much of the work as possible. For example, instead of reading data and then sorting it, add an 'ORDER BY' clause to your SELECT statement and read the already sorted data.

    In this case, you don't really need a 'database' at all. Reading the columns of data from a text file would look something like this:

    use strict; my %data; my $headers = <DATA>; while ( my $line = <DATA>) { chomp $line; my ($os,$release,$like) = split(/,/,$line); ++$data{$os}{$release}{ uc $like }; } for my $os (sort keys %data) { for my $release (sort keys %{ $data{$os} } ) { my $yes = $data{$os}{$release}{'YES'} || '0'; my $no = $data{$os}{$release}{'NO'} || '0'; print "$os, $release, $yes, $no\n"; } } __DATA__ OS,RELESASE,LIKE Ubuntu,Warty,No Ubuntu,Hoary,No Ubuntu,Breezy,Yes Ubuntu,Breazy,Yes Fedora,Yarrow,Yes Fedora,Stentz,No Fedora,Yarrow,Yes Fedora,Yarrow,Yes Windows,XP PRO,Yes Windows,XP PRO,Yes Windows,XP Home,No Windows,XP PRO,Yes

    If you really want to use a database, then you could pull the data out like this:

    my $sth = $dbh->prepare("select OS, RELEASE, LIKE from $table"); $sth->execute(); while ( my ($os, $release, $like) = $sth->fetchrow ) { ++$data{$os}{$release}{$like}; }
      Thank. I am working on it. Will keep you posted. Thanks for your time.
      Hello corpio17, Thanks for your response. However, for the time being, I am good with the solution provided by philiprbrenan. However, I will definitely explore what you have suggested as it is completelly works without the SQL statements. Thanks.
Re: Count the Duplicate Entries and make them uniq
by philiprbrenan (Monk) on Aug 29, 2012 at 13:39 UTC

    Build a data structure from the relational layout you have provided aggregating the Yes and No entries and then walk the resulting structure to print the desired results. For safety you could add some data validation.

    use feature ":5.14"; use warnings FATAL => qw(all); use strict; use Data::Dump qw(dump pp); my @l = split /\n/, << 'END'; Ubuntu,Warty,No Ubuntu,Hoary,No Ubuntu,Breezy,Yes Ubuntu,Breazy,Yes Fedora,Yarrow,Yes Fedora,Stentz,No Fedora,Yarrow,Yes Fedora,Yarrow,Yes Windows,XP PRO,Yes Windows,XP PRO,Yes Windows,XP Home,No Windows,XP PRO,Yes END my $h; for(@l) {my @w = split /,/; $h->{$w[0]}{$w[1]}{$w[2]}++; } pp($h); for my $o(sort keys %$h) {for my $r(sort keys %{$h->{$o}}) {my %yn = %{$h->{$o}{$r}}; my ($yes, $no) = ($yn{Yes}//0, $yn{No}//0); say "$o\t$r\t$yes\t$no"; } }

    Produces

    {
      Fedora  => { Stentz => { No => 1 }, Yarrow => { Yes => 3 } },
      Ubuntu  => {
                   Breazy => { Yes => 1 },
                   Breezy => { Yes => 1 },
                   Hoary  => { No => 1 },
                   Warty  => { No => 1 },
                 },
      Windows => { "XP Home" => { No => 1 }, "XP PRO" => { Yes => 3 } },
    }
    Fedora	Stentz	0	1
    Fedora	Yarrow	3	0
    Ubuntu	Breazy	1	0
    Ubuntu	Breezy	1	0
    Ubuntu	Hoary	0	1
    Ubuntu	Warty	0	1
    Windows	XP Home	0	1
    Windows	XP PRO	3	0
    
      Thanks for your time. I am working on it. I will update this thread with my result.
      Hello philiprbrenan, Your solution is working perfectly the way I wanted. thank you very-much for your time and response. I had to tweak your script a little bit in order to make is work with my actual CSV but the whole logic that you have used remains the same. Thanks you very much.
Re: Count the Duplicate Entries and make them uniq
by Kenosis (Priest) on Aug 29, 2012 at 15:13 UTC

    Here's another option:

    use Modern::Perl; my %hash; map { /(.+)?,(.+)/; $hash{$1}{$2}++ } <DATA>; say 'OS,RELEASE,Yes,No'; say "$_," . ( $hash{$_}{Yes} || 0 ) . ',' . ( $hash{$_}{No} || 0 ) for sort keys %hash; __DATA__ Ubuntu,Warty,No Ubuntu,Hoary,No Ubuntu,Breezy,Yes Ubuntu,Breezy,Yes Fedora,Yarrow,Yes Fedora,Stentz,No Fedora,Yarrow,Yes Fedora,Yarrow,Yes Windows,XP PRO,Yes Windows,XP PRO,Yes Windows,XP Home,No Windows,XP PRO,Yes

    Output:

    OS,RELEASE,Yes,No Fedora,Stentz,0,1 Fedora,Yarrow,3,0 Ubuntu,Breezy,2,0 Ubuntu,Hoary,0,1 Ubuntu,Warty,0,1 Windows,XP Home,0,1 Windows,XP PRO,3,0
      Hello Kenosis, Thanks for your respose. your suggestion is shorted thant other solutions on this thread. However, for the time being, I am good with the solution proviede by philiprbrenan. However, i will definitely explore what you ahve sugested as it is totally new for me and it will be a great learning for me.
        Please excuse the my typos.

        You're most welcome! philiprbrenan's solution is certainly a good one, indeed!