Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Extracting information from file to Hash

by hok_si_la (Curate)
on Jan 18, 2012 at 21:31 UTC ( #948626=perlquestion: print w/ replies, xml ) Need Help??
hok_si_la has asked for the wisdom of the Perl Monks concerning the following question:

Good localtime friends,

Given a large 1000+ line file with the following format (flexible here):
Collection=>168245 ImageCount=>6 Status=>SI Missing=>1,3 Modified=>01/ +18/2012 11:14:30 Collection=>161745 ImageCount=>6 Status=>I Missing=>2,3 Modified=>01/1 +8/2012 11:16:38 Collection=>162451 ImageCount=>6 Status=>SC Missing=> Modified=>01/20/ +2012 11:16:38 Collection=>117481 ImageCount=>8 Status=>C Missing=> Modified=>01/18/2 +011 7:16:38 ...
What would be the best way to extract all the information into a hash so that I can can perform sorts based on Collection, Status, and Modified?

Some caveats:
1)Collection would be the db equivalent of an unique id.
2)I cannot use a database or non-core modules.
3)I will dynamically build web pages displaying collection information on a table row based upon a timeframe a user requests, so I will have to play with dates a bit.
4)The creation of the file containing the format about is done in the background, so anything I can do to speed up fetching/sorting specific to file format etc would be worth considering.

Thanks for your advice,
hok

Comment on Extracting information from file to Hash
Download Code
Re: Extracting information from file to Hash
by JavaFan (Canon) on Jan 18, 2012 at 21:47 UTC
    What would be the best way to extract all the information into a hash so that I can can perform sorts based on Collection, Status, and Modified?
    If you want to sort, hashes just get into your way. I'd use a array of arrays, each line represented by a 5 element array: collection id, image count, status, missing, and modified. Depending on what you need to do with the dates, I keep them as is, or convert them to UTC timestamps.
Re: Extracting information from file to Hash
by tobyink (Abbot) on Jan 18, 2012 at 21:52 UTC

    Here's a quick stab at a solution. It works with your example data...

    use strict; use Data::Dumper; my %hash; my $regexp = qr{ ^ \s* Collection \s* => \s* (\d+)? \s* ImageCount \s* => \s* (\d+)? \s* Status \s* => \s* (\w+)? \s* Missing \s* => \s* ([\d,]+)? \s* Modified \s* => \s* ([\d/]+\s[\d:]+)? \s* $}x; while (defined(my $line = <DATA>)) { chomp $line; my %linehash; if ($line =~ $regexp) { %linehash = ( Collection => $1, ImageCount => $2, Status => $3, Missing => $4, Modified => $5, ); } next unless defined $linehash{Collection}; $hash{ $linehash{Collection} } = \%linehash; } my @sorted_by_status = sort { $a->{Status} cmp $b->{Status} } values % +hash; print Dumper \@sorted_by_status; __DATA__ Collection=>168245 ImageCount=>6 Status=>SI Missing=>1,3 Modified=>01/ +18/2012 11:14:30 Collection=>161745 ImageCount=>6 Status=>I Missing=>2,3 Modified=>01/1 +8/2012 11:16:38 Collection=>162451 ImageCount=>6 Status=>SC Missing=> Modified=>01/20/ +2012 11:16:38 Collection=>117481 ImageCount=>8 Status=>C Missing=> Modified=>01/18/2 +011 7:16:38

    It would be nice if the regular expression could be made less specific, but some features of your data format make that tricky (e.g. the fact that the value following "=>" can be a zero-length string).

      It would be nice if the regular expression could be made less specific, but some features of your data format make that tricky (e.g. the fact that the value following "=>" can be a zero-length string).
      However, I think it's pretty much guaranteed that there will be whitespace between the key/value pairs. Yet, you use /\s*/. Also, nowhere do I see specified that there even may be whitespace around the "=>" — you just made that up. As this file looks to be computer generated, I sincerely doubt that this will ever be the case. Finally: the only place do I see whitespace inside a column value, is in the final column of the line: the timestamp.

      In short: I think this regex will do:

      /^Collection=>(\S*) \s+ ImageCount=>(\S*) \s+ Status=>(\S*) \s+ Missing=>(\S*) \s+ Modified=>(.*\S) /x

      And if you do this:

      my %r = /^ (Collection)=>(\S*) \s+ (ImageCount)=>(\S*) \s+ (Status)=>(\S*) \s+ (Missing)=>(\S*) \s+ (Modified)=>(.*\S) /x;
      you even get a nice hash record out of it, even though it is restricted to one match per line (otherwise, when using /g you'd get list context, with a different behavior as a result.
Re: Extracting information from file to Hash
by BrowserUk (Pope) on Jan 18, 2012 at 22:15 UTC

    If you use an array of hashes, the sorting becomes quite intuitive:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my @data; while( <DATA> ) { my( $col, $cnt, $stat, $miss, $mod ) = m[ ^ Collection=>(\d+) \s+ ImageCount=>(\d+) \s+ Status=>(\w+ ) \s+ Missing=>( [1-9,]+ )? \s+ Modified=>( .+ ) $ ]x or warn "Bad format at line $.\n" and next; my( $day, $mon, $year, $hrs, $min, $sec ) = $mod =~ m[(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)] or warn "Bad date format in line $." and next; push @data, { Collection => $col, ImageCount => $cnt, Status => $stat, Missing => [ split ',', $miss||'' ], Modified => sprintf( "%4d/%02d/%02d %02d:%02d:%02d", $year, $mon, $day, $hrs, $min, $sec ), }; } ##pp 'original order', \@data; ## Sort data by image count descending and modified date ascending my @ordered = sort { $data[ $b ]{ ImageCount } <=> $data[ $a ]{ ImageCount } || $data[ $a ]{ Modified } cmp $data[ $b ]{ Modified } } 0 .. $#data; print 'Sorted by Image count descending and modified date ascending'; pp $data[ $_ ] for @ordered; __DATA__ Collection=>168245 ImageCount=>6 Status=>SI Missing=>1,3 Modified=>01/ +18/2012 11:14:30 Collection=>161745 ImageCount=>6 Status=>I Missing=>2,3 Modified=>01/1 +8/2012 11:16:38 Collection=>162451 ImageCount=>6 Status=>SC Missing=> Modified=>01/20/ +2012 11:16:38 Collection=>117481 ImageCount=>8 Status=>C Missing=> Modified=>01/18/2 +011 7:16:38

    Produces:

    C:\test>junk53 Sorted by Image count descending and modified date ascending { Collection => 117481, ImageCount => 8, Missing => [], Modified => "2011/18/01 07:16:38", Status => "C", } { Collection => 168245, ImageCount => 6, Missing => [1, 3], Modified => "2012/18/01 11:14:30", Status => "SI", } { Collection => 161745, ImageCount => 6, Missing => [2, 3], Modified => "2012/18/01 11:16:38", Status => "I", } { Collection => 162451, ImageCount => 6, Missing => [], Modified => "2012/20/01 11:16:38", Status => "SC", }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://948626]
Approved by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2014-09-21 04:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (166 votes), past polls