Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

which data structure do I need for this grouping problem?

by Anonymous Monk
on Sep 04, 2018 at 13:39 UTC ( #1221691=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!
I am trying to think of how to implement the following:
My task is, if I have a tab-separated file, like the following example lines:
nick 20/5/1950 one john 18/2/1980 two nick 19/6/1978 three nick 20/5/1950 four nick 12/9/2000 five john 15/6/1997 six nick 20/5/1950 seven
How would I group the lines using BOTH name & date as "key", i.e. the output would be:
nick 20/5/1950 one four seven john 18/2/1980 two nick 19/6/1978 three nick 12/9/2000 five john 15/6/1997 six

because nick, on 20/5/1950 has 3 measurements.
I know how I would group only using the name:
use strict; use warnings; my %res; while (<>) { my ( $name, $rest ) = split /\t/; push @{ $res{$name} }, $rest; } for ( sort keys %res ) { print "$_ ", join( "|", @{ $res{$_} } ), "\n"; }

but now I need to also take the date into account. Can you help me?

Replies are listed 'Best First'.
Re: which data structure do I need for this grouping problem?
by 1nickt (Abbot) on Sep 04, 2018 at 14:32 UTC

    Hi, I'd suggest you build a hash keyed by name and then a sub-hash per name keyed by the date, with the values stored in an array. (To lovers of acronyms this would be a HOHOA (hash of hashes of arrays)).

    use strict; use warnings; use feature 'say'; my %result; for my $line (<DATA>) { chomp $line; my ( $name, $date, $val ) = split ' ', $line; push @{ $result{ $name }{ $date } }, $val; } for my $name ( keys %result ) { say $name; for my $date ( keys %{ $result{ $name } } ) { say "\t$date: @{ $result{ $name }{ $date } }"; } } __DATA__ nick 20/5/1950 one john 18/2/1980 two nick 19/6/1978 three nick 20/5/1950 four nick 12/9/2000 five john 15/6/1997 six nick 20/5/1950 seven
    Output:
    $ perl 1221691.pl john 15/6/1997: six 18/2/1980: two nick 19/6/1978: three 20/5/1950: one four seven 12/9/2000: five

    Hope this helps!


    The way forward always starts with a minimal test.

      and that would be a HoTHel !

      edit: he said tab-separated so '\t' probably?

Re: which data structure do I need for this grouping problem?
by haukex (Bishop) on Sep 04, 2018 at 13:53 UTC

    Here's one way: If there is a character that you know doesn't occur in either the name or the date (like a tab), you can use that to separate the two and make a single hash key out of it. Note that in the sample data that you've posted here, you don't have any tabs, so I've had to guess that all of your columns are separated by tabs. However, in that case, my ( $name, $rest ) = split /\t/; is only grabbing the first two columns. Also, you probably want to chomp your lines.

    use strict; use warnings; my %res; while (<DATA>) { chomp; my ( $name, $date, @rest ) = split /\t/; push @{ $res{"$name\t$date"} }, @rest; } for my $key ( sort keys %res ) { my ( $name, $date ) = split /\t/, $key, 2; print "$name,$date:", join( "|", @{ $res{$key} } ), "\n"; } __DATA__ nick 20/5/1950 one john 18/2/1980 two two and a half nick 19/6/1978 three nick 20/5/1950 four nick 12/9/2000 five john 15/6/1997 six nick 20/5/1950 seven eight

    Output:

    john,15/6/1997:six john,18/2/1980:two|two and a half nick,12/9/2000:five nick,19/6/1978:three nick,20/5/1950:one|four|seven|eight

    In the above code, using \t to separate the hash key is safe, because of the split /\t/ I know that none of the strings will contain tabs. If you choose a separator character of which you're not sure if it's contained in the strings, like say |, you may want to add a check like die $name if $name=~/\|/; die $date if $date=~/\|/; to play it safe. Also, you can use a separator that is very unlikely to appear, like $/ or \0 (but again, if you want to code defensively, check for its presence anyway). (Update: $/ is the input record separator, which chomp removes for you. Also made minor fix to the latter two regexes.)

    Update 2: I should also mention that using a module like Text::CSV is generally better for reading this kind of data, because it handles things like quoted fields and escaped characters for you (also install Text::CSV_XS for speed).

      There is a built-in mechanism for this: see perldoc -v '$;'.
      Interesting approach, thank you!
Re: which data structure do I need for this grouping problem?
by kevbot (Priest) on Sep 05, 2018 at 04:41 UTC
    Hello,

    I see that you have already received good replies. Here is another way to perform this task. The Data::Table module has many useful methods for manipulating tabular data. In this case, the group method is applicable.

    The data.tsv file contains the following tab-delimited data

    nick 20/5/1950 one john 18/2/1980 two nick 19/6/1978 three nick 20/5/1950 four nick 12/9/2000 five john 15/6/1997 six nick 20/5/1950 seven
    This code will group the data, and prepare the concatenated values.
    #!/usr/bin/env perl use strict; use warnings; use Data::Table; # Load input data from tsv file # The first argument is the file name # The second argument specifies that there is no header row (in this + case # the Data::Table object that is created will have auto-generated co +lumn # names of col1, col2, etc. my $dt = Data::Table::fromTSV('data.tsv', 0); print "The input table is:\n"; print $dt->tsv, "\n\n"; # Group by 'col1' and 'col2' my $output_t = $dt->group( ['col1', 'col2'], # columns to group by ['col3'], # Columns to perform calculation on [ \&join_vals ], # Apply join_vals function to values found in 'co +l3' ['values'] # Put the joined values into these columns ); print "The output table is:\n"; print $output_t->tsv, "\n\n"; sub join_vals { my @data = @_; return join("|", @data); } exit;
    The output should be,
    The input table is: col1 col2 col3 nick 20/5/1950 one john 18/2/1980 two nick 19/6/1978 three nick 20/5/1950 four nick 12/9/2000 five john 15/6/1997 six nick 20/5/1950 seven The output table is: col1 col2 values nick 20/5/1950 one|four|seven john 18/2/1980 two nick 19/6/1978 three nick 12/9/2000 five john 15/6/1997 six
Re: which data structure do I need for this grouping problem?
by afoken (Canon) on Sep 05, 2018 at 17:20 UTC

    The standard solution for handling files containing character-separated values (CSV), including tabulator separated values, is to use Text::CSV and - if possible - its accelerating companion Text::CSV_XS. It is "the standard" because it not only splits (and joins) on the separating character(s), but also handles quoting, escaping, and all of those nasty edge cases you can find in CSV files.

    If you are used to work with relational databases and DBI, try DBD::CSV. It sits on top of Text::CSV and allows you to treat CSV files like database tables in a relational database. In other words: You can use SQL to work directly with CSV files.

    All of those modules are currently maintained by our helpful Tux.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1221691]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2021-05-16 17:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (152 votes). Check out past polls.

    Notices?