Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Extracting information from file to Hash

by tobyink (Canon)
on Jan 18, 2012 at 21:52 UTC ( [id://948630]=note: print w/replies, xml ) Need Help??


in reply to Extracting information from file to Hash

Here's a quick stab at a solution. It works with your example data...

use strict; use Data::Dumper; my %hash; my $regexp = qr{ ^ \s* Collection \s* => \s* (\d+)? \s* ImageCount \s* => \s* (\d+)? \s* Status \s* => \s* (\w+)? \s* Missing \s* => \s* ([\d,]+)? \s* Modified \s* => \s* ([\d/]+\s[\d:]+)? \s* $}x; while (defined(my $line = <DATA>)) { chomp $line; my %linehash; if ($line =~ $regexp) { %linehash = ( Collection => $1, ImageCount => $2, Status => $3, Missing => $4, Modified => $5, ); } next unless defined $linehash{Collection}; $hash{ $linehash{Collection} } = \%linehash; } my @sorted_by_status = sort { $a->{Status} cmp $b->{Status} } values % +hash; print Dumper \@sorted_by_status; __DATA__ Collection=>168245 ImageCount=>6 Status=>SI Missing=>1,3 Modified=>01/ +18/2012 11:14:30 Collection=>161745 ImageCount=>6 Status=>I Missing=>2,3 Modified=>01/1 +8/2012 11:16:38 Collection=>162451 ImageCount=>6 Status=>SC Missing=> Modified=>01/20/ +2012 11:16:38 Collection=>117481 ImageCount=>8 Status=>C Missing=> Modified=>01/18/2 +011 7:16:38

It would be nice if the regular expression could be made less specific, but some features of your data format make that tricky (e.g. the fact that the value following "=>" can be a zero-length string).

Replies are listed 'Best First'.
Re^2: Extracting information from file to Hash
by bart (Canon) on Jan 18, 2012 at 22:07 UTC
    It would be nice if the regular expression could be made less specific, but some features of your data format make that tricky (e.g. the fact that the value following "=>" can be a zero-length string).
    However, I think it's pretty much guaranteed that there will be whitespace between the key/value pairs. Yet, you use /\s*/. Also, nowhere do I see specified that there even may be whitespace around the "=>" — you just made that up. As this file looks to be computer generated, I sincerely doubt that this will ever be the case. Finally: the only place do I see whitespace inside a column value, is in the final column of the line: the timestamp.

    In short: I think this regex will do:

    /^Collection=>(\S*) \s+ ImageCount=>(\S*) \s+ Status=>(\S*) \s+ Missing=>(\S*) \s+ Modified=>(.*\S) /x

    And if you do this:

    my %r = /^ (Collection)=>(\S*) \s+ (ImageCount)=>(\S*) \s+ (Status)=>(\S*) \s+ (Missing)=>(\S*) \s+ (Modified)=>(.*\S) /x;
    you even get a nice hash record out of it, even though it is restricted to one match per line (otherwise, when using /g you'd get list context, with a different behavior as a result.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://948630]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-23 06:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found