Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Parsing a Large file with no reason

by mrras25 (Acolyte)
on Jan 28, 2010 at 19:17 UTC ( #820221=perlquestion: print w/ replies, xml ) Need Help??
mrras25 has asked for the wisdom of the Perl Monks concerning the following question:

I need help with parsing a file and placing certain elements into a hash of hashes. The file I am parsing is always different based on host that the file is pulled from and basically it is a lsattr with an lsdev C command ran after it All I care about is the lines that read like:

............. ---- lsattr -El vgbkup ---- auto_on y N/A True conc_auto_on n N/A True conc_capable n N/A True gbl_pbufs_ppv 0 N/A True gbl_pbufs_pvg 0 N/A True timestamp 49188ced308665bd N/A True vg_pbufs_ppv 0 N/A True vgserial_id 00c0b42000004c000000011a5f8b013e N/A False ---- lsattr -El vgfrd01 ---- auto_on y N/A True conc_auto_on n N/A True conc_capable n N/A True gbl_pbufs_ppv 0 N/A True gbl_pbufs_pvg 0 N/A True timestamp 4ae7197814a6d91a N/A True vg_pbufs_ppv 0 N/A True vgserial_id 00c0b42000004c000000011a0d27a4bc N/A False .............. ---- lsattr -El loglv02 ---- copies 1 N/A True inter m N/A True intra m N/A True label None N/A True lvserial_id 00c0b42000004c000000011a5f8b013e.2 N/A False relocatable y N/A True size 1 N/A True strictness y N/A True stripe_size N/A True stripe_width 0 N/A True type jfs2log N/A True upperbound 64 N/A True ---- lsattr -El lvbackup ---- copies 1 N/A True inter x N/A True intra im N/A True label /backup N/A True lvserial_id 00c0b42000004c000000011a5f8b013e.1 N/A False relocatable y N/A True size 1200 N/A True strictness y N/A True stripe_size N/A True stripe_width 0 N/A True type jfs2 N/A True upperbound 64 N/A True ...........

The hash would go something like this:

$host_info{VG}{$vgname}{$vgid}{LV}=$lvlabel;

I need the vgname which is vgfrd01, the vgserial_id which is 00c0b42000004c000000011a0d27a4bc and I need to map the label of the lv to the vgserial number which is the same with a . Reference at the end.

Comment on Parsing a Large file with no reason
Select or Download Code
Re: Parsing a Large file with no reason
by afoken (Parson) on Jan 28, 2010 at 19:32 UTC

    Is this a job offer or did you just forget to show us what you tried so far? For a job offer, you forgot to tell us how much you are willing to pay, and you posted in the wrong place.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Not a job offer - Already on a job. The only thing I can come up with is doing it in batches first grabbing the lv's and parsing that down with the lvgroup_id place that into a hash then step through the hash to split the lvgroup_id and and stepping through the file to get the vggroup_id but I still think there would be an easier way - i was hoping to learn how to find a line step back up a few lines to get the information I needed or do a find the line "-El lv" and then do an until the next -El lv do such and such but I am not sure of the right syntax

        mrras25:

        If you really want to step back a few lines, then you can just keep a buffer of the last few lines read. However, I'd suggest just parsing out the elements as you find them, and then insert them when you determine they're "interesting". If you find that it's not an interesting record, clear your list of elements and keep on going. Something like this1:

        my %largerHash; # Place to accumulate interesting records my %elements; # Place to accumulate data into records while (<INF>) { if (/^(yabba|dabba|doo)\s+(.*)/) { # We only care about some of the fields $elements{$1} = $2; } elsif (/End of record ID:\s+(.*)/) { if ($1 =~ /foo/) { # Interesting record (starts with foo) so, store it $largerHash{$1} = %elements; } # Since we found end of record, clear our workspace %elements={}; } } __DATA__ Record 1 scooby 7 dooby 8 yabba Fred dabba Wilma End of record ID: cupcake Record 2 doo not fold spindle staple or mutilate dabba Barney yabba Dino End of record ID: foobar

        In this example, we collect a couple of fields in record 1, but at the end of the record, we find that nothing was interesting, so we discard the elements we collected. Then we collect more items and at the end of the record, we find that it's interesting, so we add the elements to the larger hash that you want to process after parsing the data.

        Note 1: Untested and quite possibly bad syntax, as I've been wrestling a bunch of .Net and C++ code for the last couple of weeks.

        ...roboticus

        Insert witty banter here.

Re: Parsing a Large file with no reason
by pileofrogs (Priest) on Jan 28, 2010 at 21:59 UTC

    I don't know if I understand your question, but you could grab the vgname and stick it in a variable like $now_vgname, and do the same with the label and then build your hash when you get the id. EG

    ... my ($now_vgname,$now_label); while ( <$input_handle> ) { if ( /lsattr -El (\w+)/ ) $now_vgname = $1; next; } if ( /label\s+([\/\w]+)/ ) { $now_label = $1; next; } if ( /lvserial_id\s+([\w\.]+)/ ) { $host_info{VG}->{$now_vgname}->{$1}->{LV} = $now_label next; } }

    This code is just to demonstrate the idea, it probably won't work as is.

    --Pileofrogs

Re: Parsing a Large file with no reason
by mrras25 (Acolyte) on Jan 28, 2010 at 23:23 UTC

    I was just trying to have people to bounce some ideas off of - however this is what I came up with - its crude and runs slow on large files (the 1 file I am running this off of as a test is 80,000+ lines long) - If someone sees something I can do differently I am open to suggestions

    #!/usr/bin/perl use strict; use warnings; no warnings 'uninitialized'; use Data::Dumper; use Tie::File; my $base = $ARGV[0]; open(FILE, $base) || die "Unable to locate file: $!\n"; my (@searray,@flarray); tie(@flarray, 'Tie::File',$base); while(<FILE>) { my ($start,$end); chomp; if($_ =~ /-El\s+vg/../vgserial_id/) { $start = (split /\s+/,$_)[3] if($_ =~/-El/); $end = (split /\s+/, $_)[1] if($_ =~/vgserial_id/); } if(defined $start) { push(@searray, $start); } else { $start = ''; +} if(defined $end) { push(@searray, $end); } else { $end = ''; } } my %hash_ref = @searray; #print Dumper \%hash_ref; foreach my $hkey(keys %hash_ref) { my $hvalue = $hash_ref{$hkey}; my $count = 0; for (my $i = 0; $i < @flarray; $i++) { next unless $flarray[$i] =~ /$hvalue/; next if($flarray[$i] =~ /vgserial_id/); my ($mc,$lvsip) = (($i-1),($i-5)); my $mount = (split /\s+/, $flarray[$mc])[1]; my $lvnam = (split /\s+/, $flarray[$lvsip])[3]; next if($mount =~ /None/); print "$i: VG: $hkey : MOUNT: $mount : LV_name: $lvnam : SIZE: $ +size\n"; } }
      It is so slow on large files because for each matching record, you loop through all 80,000 lines; So, if you had had 4000 matching records, you would have 4000 * 80000 = 320,000,000 iterations. There must be a better method I think.

      And I don't know if you can 'tie' the same file (as an array) while opening it for reading (both at the same time).

      Note that I set the input record separator, $/, to ---- lsattr , (with a space following lsattr), to read a record at a time.

      Not seeing more sample data, I made a guess at what might work and it did work with your sample data. But again, it's difficult to tell.

      #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $base = $ARGV[0] or die "Must supply a filename to open. $!"; open my $fh, "<", $base or die "Unable to locate file: $!\n"; my %data; { local $/ = "---- lsattr "; while (<$fh>) { chomp; next unless /^-El\s+(\S+)/; my $vg = $1; next unless /^label\s+(\S+)/m; my $label = $1; next if $label eq "None"; next unless /^lvserial_id\s+(\S+)/m; my $name = $1; next unless /^size\s+(\d+)/m; my $size = $1; @{ $data{ $vg } }{ qw/ label name size / } = ($label, $name, $ +size); #print "VG: $vg : MOUNT: $label : LV_name: $name : SIZE: $size +\n"; } } print Dumper \%data;
      Update: The data structure created above will only work if there is only 1 record for each sought key, ($vg). If there is more than 1 record with the same key, the data structure will only contain the last fields parsed from the file. It will silently give you incorrect results.

      That said, I would need to know more about your file to be able to suggest a suitable data structure.

Re: Parsing a Large file with no reason
by BrowserUk (Pope) on Jan 29, 2010 at 02:58 UTC

    This will extract the id & label from each block much more quickly than your code, by reading one block at a time. I didn't understand your hash, so I've left that to you.

    #! perl -slw use strict; use Data::Dump qw[ pp ]; $/ = "\n----"; ## read 1 multiline record at a time while( <DATA> ) { my( $label, $id ) = m[ -El \s+ (\S+) .+? serial_id \s+ ( \S+ ) ]xms or die 'Bad data'; print "$label : $id"; } __DATA__ ---- lsattr -El vgbkup ---- auto_on y N/A True conc_auto_on n N/A True conc_capable n N/A True gbl_pbufs_ppv 0 N/A True gbl_pbufs_pvg 0 N/A True timestamp 49188ced308665bd N/A True vg_pbufs_ppv 0 N/A True vgserial_id 00c0b42000004c000000011a5f8b013e N/A False ---- lsattr -El vgfrd01 ---- auto_on y N/A True conc_auto_on n N/A True conc_capable n N/A True gbl_pbufs_ppv 0 N/A True gbl_pbufs_pvg 0 N/A True timestamp 4ae7197814a6d91a N/A True vg_pbufs_ppv 0 N/A True vgserial_id 00c0b42000004c000000011a0d27a4bc N/A False ---- lsattr -El loglv02 ---- copies 1 N/A True inter m N/A True intra m N/A True label None N/A True lvserial_id 00c0b42000004c000000011a5f8b013e.2 N/A False relocatable y N/A True size 1 N/A True strictness y N/A True stripe_size N/A True stripe_width 0 N/A True type jfs2log N/A True upperbound 64 N/A True ---- lsattr -El lvbackup ---- copies 1 N/A True inter x N/A True intra im N/A True label /backup N/A True lvserial_id 00c0b42000004c000000011a5f8b013e.1 N/A False relocatable y N/A True size 1200 N/A True strictness y N/A True stripe_size N/A True stripe_width 0 N/A True type jfs2 N/A True upperbound 64 N/A True

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://820221]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-07-31 07:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (245 votes), past polls