Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Advances SQL calculations in CSV file

by DrAxeman (Scribe)
on Aug 31, 2005 at 16:50 UTC ( #488148=perlquestion: print w/replies, xml ) Need Help??

DrAxeman has asked for the wisdom of the Perl Monks concerning the following question:

I have an issue that I'm not sure how to approach. I apologize in advance if I do not explain my delima clearly the first time. I'm not sure how to explain it...

I have CSV files that contain statistical data. I need to be able to extract this data in a format that gives me the TOP 5% (highest value) for certain columns, along with certain the data contained in the same rows. What's got me stumped, is that I don't need the TOP 5% of the table, but the top 5% of each server having data collected on it. I also need the TOP 5% of multiple columns. Let me provide an example:

My data looks similar to this (just LOTS more of it). There are a lot more columns, so I trimmed this down a little. The first row is the headers:
PDHCSV40EasternDaylightTime240,ERWWCOMMUNITIES_MemoryPagessec,ERWWCOMM +UNITIES_NetworkInterfaceCompaqEthernetFastEthernetAda pterModuleBytesTotalsec,ERWWCOMMUNITIES_NetworkInterfaceMSTCPLoopbacki +nterfaceBytesTotalsec,ERWWCOMMUNITIES_NetworkInterfac eCompaqEthernetFastEthernetAdapterModuleCurrentBandwidth,ERWWCOMMUNITI +ES_NetworkInterfaceMSTCPLoopbackinterfaceCurrentBandw idth,ERWWCOMMUNITIES_PhysicalDisk0CPCTDiskTime,ERWWCOMMUNITIES_Physica +lDisk1DPCTDiskTime,ERWWCOMMUNITIES_PhysicalDiskTotalP CTDiskTime,ERWWCOMMUNITIES_PhysicalDiskTotalAvgDiskQueueLength,ERWWCOM +MUNITIES_ProcessorTotalPCTProcessorTime,ERWWDC1_Memor yPagessec,ERWWDC1_NetworkInterfaceIntelPRO100NetworkConnectionBytesTot +alsec,ERWWDC1_NetworkInterfaceMSTCPLoopbackinterfaceB ytesTotalsec,ERWWDC1_NetworkInterfaceIntelPRO100NetworkConnectionCurre +ntBandwidth,ERWWDC1_NetworkInterfaceMSTCPLoopbackinte rfaceCurrentBandwidth,ERWWDC1_PhysicalDisk0CPCTDiskTime,ERWWDC1_Physic +alDiskTotalPCTDiskTime,ERWWDC1_PhysicalDiskTotalAvgDi skQueueLength,ERWWDC1_ProcessorTotalPCTProcessorTime 06302005141157.588,16.581489221609786,644.14319700475187,129.796028843 +00556,100000000,10000000,4.6331155655937869e-009,4.42 86442834463436e-009,4.5308799245200652e-009,9.0617598490401305e-011,99 +.985420635482711,0.35448523206811611,3025.96021122622 05,2095.5728628079773,100000000,10000000,2.8398789187023344e-009,2.839 +8789187023344e-009,2.8398789187023345e-011,99.9902273 56518915,1.7378886565248808,1690.2294964462301,13.292340672916287,1000 +00000,10000000,0.00015427048318971214,0.0001542704831 8971214,1.5427048318971215e-006,99.990256224867423

The first column is a date/time stamp. The rest of the columns are server name & statistic collected. For each column with "ProcessorTotalPCT" in it, I need the TOP 5 % numbers along with the data from the same column. Then I also need the same thing done with for the columns with "MemoryPagessec" in it. I would also like the data split into to files. One for CPU and one for the RAM info.

Final output should look similar to:

Date/time,servername,data,data,data,... Date/time,servername,data,data,data,... Date/time,servername,data,data,data,... Date/time,servername,data,data,data,...

Thanks in advance. I hope that I wasn't too confusing.

Replies are listed 'Best First'.
Re: Advances SQL calculations in CSV file
by InfiniteSilence (Curate) on Aug 31, 2005 at 17:18 UTC
    Aha, DrAxeman, the problem is in the way you are growing your data!

    From looking at your earlier posts I see that you have been progressively working toward this dataset. In a previous post you had a similar problem, but you must have only been considering one server. Now you appear to have multiple servers and you decided to MUNGE the column names with the SERVER NAME.

    Instead you should add a column with the SERVERNAME and solve the problem with SQL the same way you did in your earlier posts.

    Celebrate Intellectual Diversity

Re: Advances SQL calculations in CSV file
by Roger (Parson) on Aug 31, 2005 at 17:12 UTC
    So what have you tried so far? Where is your Perl question? This post is just a "work request", it is not a genuine Perl question.

    You can use split to split your record if you want to do it the easy way, or you can use Text::CSV_XS module to do a more capable splitting, or you can use DBD::CSV to read the CSV file via SQL. There are many many ways to do it.

      Sorry. I didn't mean for that to sound like a job request. I was more concerned with properly describing my needs that I was at formating my request for suggestions.

      I have other scripts that calculate averages for these columns, and I will include that code here. I would like to use DBD::CSV on this, but lack the SQL experience to properly script this with out looping over and over in the file.
      # Connect to the database, (the directory containing our csv file( +s)) my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;"); # Associate our csv file with the table name 'results' $dbh->{'csv_tables'}->{'results'} = { 'file' => "$CWD/$name"}; my $sth = $dbh->prepare("SELECT * FROM results WHERE 1=0"); $sth->execute; my @origcols = @{$sth->{NAME}}; my @cols; # = @origcols; shift @origcols; # Eliminate First Column foreach ( @origcols ) { push ( @cols , $_ ) unless /Bandwidth.*|MSTCPLoop.*/ ; # Remove + unwanted columns }; my $avgSQL = 'SELECT ' . join(', ', map { "avg($_) \n" } @cols ) . ' FROM results'; my @avgs = $dbh->selectrow_array($avgSQL); my %avgsHash; @avgsHash{ @cols } = @avgs; # uses a hash slice to populate %avgs +Hash for ( @cols ) { (my $changed = $_) =~ s/_/","/; printf STATS ("\"%s\",%f\n", $changed, $avgsHash{$_}); }

      Honestly, I have only made a couple of attempts at this. I'm really stumped on how to format the SQL query. I think that is someone could point me in the direction the SQL query should look like, I'd be able to create it.
        SQL is the unnecessary complexity you don't want in such a simple task. Besides DBD::CSV implements a very limited subset of SQL.

        What you should do is just to have a single pass over your file reading a line at the time, build your hash table of statistics along the way while splitting the records, and print out the stats at the end.

        my %stats; open MYFILE, "<data.txt" or die "Can not open file: $!"; chomp(my $heading = <MYFILE>); my @col = split /,/, $heading; while (my $line = <MYFILE>) { # build a hash for the input record chomp($line); my @rec = split /,/, $line; my %rec; @rec{@col} = @rec; # collect stats $stats{$rec{rec_id}}{BAL} += $rec{bal}; $stats{$rec{rec_id}}{NUM} ++; } # print the stats here for my $id (keys %rec) { if ($rec{$id}{NUM}) { print "$id => ", $rec{$id}{BAL}/$rec{$id}{NUM}, "\n" } } close MYFILE;
Re: Advances SQL calculations in CSV file
by jZed (Prior) on Aug 31, 2005 at 17:36 UTC
    DBD::CSV supports GROUP BY clauses, aggregate functions like SUM() and MAX(), and user-defined functions which can perform calculations on row values. See SQL::Statement::Syntax for details. But first, it looks like you better read some SQL references on how to design your data structure and how to query your data to produce the results you want.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://488148]
Approved by kirbyk
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2020-10-25 09:12 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (249 votes). Check out past polls.