Cool way to parse Space Separated Value and CSV files

As a programmer and teacher of the Perl programming language, I often get destabilizing questions. In one of the last class I gave, while I was talking about hashes, someone asked me "What is it used for? When would I ever need that?" Of course, for me (and you too, probably) hashes are quite practical, but being told that, on the spot, I didn't know what to say, so I talked about the %ENV hash and made an example with it.

Today I found an interesting use for hashes. I wish I would have thought of it during my class but I didn't, so I would like to share it with you for the benefit of newer Perl programmers.

Imagine you have to read a Space Separated Value file or Comma Separated Value (CSV) file. It's easy because the fields are always in the same order. For example:

# firstname lastname age
joe builder 9
bob plumber 66
dora squarepants 10
diego simpson 11
[download]

You can do this:

open( $l, "<file" ) || die "Error : $!";
my @lines = <$l>;
close( $l );

foreach my $line ( @lines ) {
  
  # Skipping if the line is empty or a comment
  next if ( $line =~ /^\s*$/ );
  next if ( $line =~ /^\s*#/ );
  
  my ($firstname, $lastname, $age) = split( /\s+/, $line );
  
  # then do whatever you have to
}
[download]

But then someday someone give you a new file with the fields in a different order plus new extra fields you don't need. Here is the new file:

# lastname firstname age gender phone
mcgee bobby 27 M 555-555-5555
kincaid marl 67 M 555-666-6666
hofhazards duke 22 M 555-696-6969
[download]

What do you do? Do you change your code with a if statement? Do you alter the file to change the order of the fields and remove the extra fields? No! You use hashes!

Here is the solution:

open( $l, "<file" ) || die "Error : $!";
my @lines = <$l>;
close( $l );

my @keys = split( /\s+/, $lines[0] );
shift( @keys ); # to remove the # as the first field

foreach my $line ( @lines ) {
  
  # Skipping if the line is empty or a comment
  next if ( $line =~ /^\s*$/ );
  next if ( $line =~ /^\s*#/ );
  
  my %hash;
  @hash{ @keys } = split( /\s+/, $line );
  
  # then do whatever you have to
}
[download]

Note that the first line in the file is important, it gives you the order of the fields. Even if it's not there when you receive the file, you can easily add it. Note the @hash{ } syntax. This is called a slice. You are slicing the hash using the array form, basically to access a list of element from the hash. The @keys array contains a list of keys in the same order written at the top of the file therefore, doing @hash{ @keys } is like doing @hash{ qw(lastname firstname age gender phone) } or @hash{ 'lastname', 'firstname', 'age', 'gender', 'phone' } except it doesn't matter if the fields in the file are not always in the same order as in the previous file.

The split of the line returns a list so doing this:

@hash{ @keys } = split( /\s+/, $line );

is the same as this:

@hash{'lastname', 'firstname', 'age', 'gender', 'phone' } = split( /\s+/, $line );

or this:

($hash{'lastname'}, $hash{'firstname'}, $hash{'age'}, $hash{'gender'}, $hash{'phone'}) = split( /\s+/, $line );

Also if some fields are not needed, you don't care. As long as all the required fields are there, your code will always work.

I hope this will be useful for you someday! Good luck!

A for will get you from A to Z; a while will get you everywhere.

-- greengaroo

Comment on Cool way to parse Space Separated Value and CSV files Select or Download Code

Replies are listed 'Best First'.

Re: Cool way to parse Space Separated Value and CSV files
by Anonymous Monk on Apr 10, 2013 at 07:18 UTC

Hashes are dictionaries ... $age{Peter} is pronounced age-of-Peter
$price{hammer} price-of-hammer
$definition{dictionary} definition-of-dictionary ... more on this type of thing in Re^3: highest value in hash (virtual teddybear)

Speaking of csv and dictionary-of-@fields :) you could even use the technique with fixed with records :)

Examples at

Re: hash_reference from file, Re^2: hash_reference from file
join - join two files according to a common key
part - split up files according to column value
Merge two files with similar column entries
Opening multiple files ( csvpaste.pl )
Open multiple file handles?
X, Y Table structure
How to extract the particular residues from PDB files(text csv split hash bioperl)
Select only desired features from a text (text csv split hash bioperl)
Sort on Table headers (text csv split hash bioperl)
I need help joining tab-delimited files/tables!
split then join based on common value in field
Smart way to read a file vertically?
reformatting tab delimited file
edit a CSV and "in-place" replacement
Re: Reversing a mysql table, transpose-tsv -- invert a tab-delimited table

Text::CSV

examples/csv2xls Script to onvert CSV files to M$Excel

examples/csv-check Script to check a CSV file/stream

examples/csvdiff Script to shoff diff between sorted CSV files

examples/parser-xs.pl Parse CSV stream, be forgiving on bad lines

[CSV hash ] CSV hash: Best way to match a hash with large CSV file; perl hash to CSV using Text::CSV_XS; Issue parsing CSV into hashes?; Veriable Length Array/Hash derived from CSV to populate an XML; extracting data from CSV files and making hash of hashes; Re^2: build hash from csv file; Encoding a hash in perl before saving it as a CSV file; hash from CSV-like structure; Read the csv file to a hash....; Parsing CSV into a hash; build hash from csv file; Converting a CSV list to a list of hashrefs naming the fields

merging csv files into a third file preserving column & row

[reply]
[d/l]

Re^2: Cool way to parse Space Separated Value and CSV files

by greengaroo (Hermit) on Apr 10, 2013 at 13:12 UTC

Hashes are dictionaries

That is a dam good explanation! I will use it in my class! Never thought of it! Thank you very much!

A for will get you from A to Z; a while will get you everywhere.

-- greengaroo

[reply]

Re: Cool way to parse Space Separated Value and CSV files
by johngg (Canon) on Apr 12, 2013 at 22:53 UTC

Since you state that the first line in the file is important it might be as well to treat it differently by assigning it to a separate scalar variable. Also, when you split the header to get the column names you could save having to do the shift by assigning the first value to undef which can act as a sort of programmatic bit bucket.

use strict;
use warnings;

use 5.014;

use Data::Dumper;

open my $inFH, q{<}, \ <<EOD or die qq{open: < HEREDOC: $!\n};
# lastname firstname age gender phone
mcgee bobby 27 M 555-555-5555
   
kincaid marl 67 M 555-666-6666
# comment
hofhazards duke 22 M 555-696-6969
EOD

my( $header, @lines ) = <$inFH>;

close $inFH or die qq{close: < HEREDOC: $!\n};

my( undef, @keys ) = split m{\s+}, $header;

foreach my $line ( @lines )
{
    next if $line =~ m{(?x) ^ \s* (?: (?-x:#) | $ )};
    my %hash;
    @hash{ @keys } = split m{\s+}, $line;

    print Data::Dumper->Dumpxs( [ \ %hash ], [ qw{ *hash } ] );
}
[download]

The output.

%hash = (
          'firstname' => 'bobby',
          'lastname' => 'mcgee',
          'phone' => '555-555-5555',
          'age' => '27',
          'gender' => 'M'
        );
%hash = (
          'firstname' => 'marl',
          'lastname' => 'kincaid',
          'phone' => '555-666-6666',
          'age' => '67',
          'gender' => 'M'
        );
%hash = (
          'firstname' => 'duke',
          'lastname' => 'hofhazards',
          'phone' => '555-696-6969',
          'age' => '22',
          'gender' => 'M'
        );
[download]

The technique falls to pieces somewhat when your space-separated files contain fields or headers containing spaces, or a CSV file with commas dotted around. You would then reach for something like Text::CSV.

I hope this is of interest.

Cheers,

JohnGG

[reply]
[d/l]
[select]

Re^2: Cool way to parse Space Separated Value and CSV files

by Anonymous Monk on May 21, 2018 at 07:23 UTC

How to take input from user rather than hard coding for this solution

[reply]

Re^3: Cool way to parse Space Separated Value and CSV files

by Corion (Patriarch) on May 21, 2018 at 07:35 UTC

I assume that you want to run the program on a user-specified file instead of the hardcoded file?

Command line parameters are available in the @ARGV array, see perlvar.

You can read user input from STDIN, like my $filename = <STDIN>;

Once you have the filename, modify the open statement to use a filename instead of opening a here-document.

[reply]
[d/l]
[select]

A reply falls below the community's threshold of quality. You may see it by logging in.


Think about Loose Coupling
	PerlMonks