Pulling specific data from a large text file

TStanley has asked for the wisdom of the Perl Monks concerning the following question:

I have been tasked with retrieving some information from three different files. Each file is the output from a script that collects basic information from specific directories/files (the system in question is a Stratus V Series server, running VOS). The info I need to collect from all three is the same, but there are some differences in the actual output files. An example of the file that I am reading from is below:

operator.ccstores logged in on %demoulas_prod#m1 at 14-06-09 11:01:06 
+EDT.

Welcome.


set_terminal_parameters: Invalid I/O control opcode specified.
>ccdem>testops>dfs01.cm
change_current_dir >ccdem>files

ls -all -full

Files: 61, Blocks: 449820

w        514 seq       03-08-14 00:05:28  binfile
w          3 seq       14-01-08 04:45:49  caldar-master
w          3 seq       14-06-08 20:16:26  card-master
w          3 seq       97-05-20 06:49:15  ccfields
w        444 seq       14-06-08 20:15:59  charge-master
w        218 seq       14-06-08 00:13:41  chkovr3-in
w          2 stm       14-06-02 11:30:44  chkstr-in
w       1590 seq       14-06-08 20:16:22  comments
w          0 seq       96-11-12 13:32:42  doncert-empty
w          0 stm       96-11-12 13:32:42  doncert-old
w          3 seq       14-06-04 14:18:25  eft-payroll-work
w       2342 seq       14-06-08 00:19:47  full-neg-file
w          1 seq       14-06-08 22:38:58  GCP.out
w       6780 seq       14-06-06 13:57:24  GCP.out-old
w          1 stm       14-06-08 08:25:36  gftcdtot
w        178 rel-71    14-06-08 20:16:38  gift-balance
w        868 rel-100   14-06-08 22:38:35  gift-bulk
w         19 stm       14-06-08 18:31:48  gift-detail
w       2233 rel-217   14-06-08 20:16:26  gift-donated
w         32 rel-54    08-06-23 22:58:44  gift-invalid
w         51 rel-30    14-05-31 00:53:36  gift-journal
w       3090 rel-177   14-06-08 22:38:35  gift-name
w          5 stm       12-11-27 22:24:52  gift-new-detail
w          6 seq       14-06-08 18:40:27  gift080-out
w          6 seq       14-06-08 18:40:26  gift080-work
w     108568 rel-64    14-06-08 20:16:42  giftcard-hist
w          2 stm       14-06-04 14:18:35  gifts-in
w          2 stm       12-11-09 09:46:50  gifts-in-good
w          2 stm       14-06-04 12:02:37  gifts-in-old
w          2 stm       14-05-28 16:32:35  gifts-in-older
w          2 stm       14-05-28 15:01:49  gifts-in-oldest
w       2581 seq       14-06-02 08:54:36  hold-print-req
w          1 seq       09-12-01 07:01:00  limit-file

Directories: 0

Links: 22

14-06-03 17:01:50  CClog.14-05-26 -> %demoulas_prod#d02>ccdem>CClog.14
+-05-26
14-06-03 17:01:50  CClog.14-05-27 -> %demoulas_prod#d02>ccdem>CClog.14
+-05-27
14-06-03 17:01:50  CClog.14-05-28 -> %demoulas_prod#d02>ccdem>CClog.14
+-05-28
14-06-03 17:01:50  CClog.14-05-29 -> %demoulas_prod#d02>ccdem>CClog.14
+-05-29
14-06-03 17:01:50  CClog.14-05-30 -> %demoulas_prod#d02>ccdem>CClog.14
+-05-30
14-06-03 17:01:50  CClog.14-05-31 -> %demoulas_prod#d02>ccdem>CClog.14
+-05-31
14-06-03 17:01:50  CClog.14-06-01 -> %demoulas_prod#d02>ccdem>CClog.14
+-06-01
14-06-03 17:01:50  CClog.14-06-02 -> %demoulas_prod#d02>ccdem>CClog.14
+-06-02
14-06-03 17:01:50  CClog.14-06-03 -> %demoulas_prod#d02>ccdem>CClog.14
+-06-03
14-06-03 17:01:50  CClog.14-06-04 -> %demoulas_prod#d02>ccdem>CClog.14
+-06-04
14-06-04 01:20:11  CClog.14-06-05 -> %demoulas_prod#d02>ccdem>CClog.14
+-06-05
14-06-05 01:20:12  CClog.14-06-06 -> %demoulas_prod#d02>ccdem>CClog.14
+-06-06


dfs gift-bulk -count_keys
name:                      %demoulas_prod#d01>ccdem>files>gift-bulk
file organization:         relative file
last used at:              14-06-09 10:43:38 EDT
last modified at:          14-06-08 22:38:35 EDT
last saved at:             14-06-08 20:15:58 EDT
time created:              08-09-25 06:02:41 EDT
transaction file:          yes
safety switch:             no
audit:                     no
dynamic extents:           no
extent size:               1
record size:               100
last record:               18912
blocks used:               472
num indexes:               3
allocation size:           1
mode:                      w
author:                    operator.ccstores
tag type:                  0
tag version:               0
record count:              1
data byte count:           100

 index name:               bulkconf_index
 key components:           1,8
 type:                     embedded_key
 collation:                ascii
 data type:                nonvarying string
 ascending:                yes
 duplicates:               no
 null keys:                no
 extent index:             no
 automatic update:         yes
 dynamic extents:          no
 extent_size:              1
 open options:             
 blocks:                   125
 number of keys:           18911

 index name:               bulkcard_index
 key components:           9,32
 type:                     embedded_key
 collation:                ascii
 data type:                nonvarying string
 ascending:                yes
 duplicates:               yes
 null keys:                yes
 extent index:             no
 automatic update:         yes
 dynamic extents:          no
 extent_size:              1
 open options:             
 blocks:                   269
 number of keys:           18911

 index name:               _deleted_record_index
 dynamic extents:          no
 extent_size:              1
 open options:             
 blocks:                   2
 number of keys:           1

dfs gift-name -count_keys
name:                      %demoulas_prod#d01>ccdem>files>gift-name
file organization:         relative file
last used at:              14-06-09 10:43:38 EDT
last modified at:          14-06-08 22:38:35 EDT
last saved at:             14-06-08 20:15:58 EDT
time created:              09-03-11 05:58:41 EDT
transaction file:          yes
safety switch:             no
audit:                     no
dynamic extents:           no
extent size:               1
record size:               177
last record:               54756
blocks used:               2410
num indexes:               3
allocation size:           1
mode:                      w
author:                    operator.ccstores
tag type:                  0
tag version:               0
record count:              34027
data byte count:           6022779

 index name:               giftname-number-end
 key components:           17,16
 type:                     embedded_key
 collation:                ascii
 data type:                nonvarying string
 ascending:                yes
 duplicates:               no
 null keys:                no
 extent index:             no
 automatic update:         yes
 dynamic extents:          no
 extent_size:              1
 open options:             
 blocks:                   450
 number of keys:           54755

 index name:               giftname-org-key
 key components:           56,40
 type:                     embedded_key
 collation:                ascii
 data type:                nonvarying string
 ascending:                yes
 duplicates:               yes
 null keys:                no
 extent index:             no
 automatic update:         yes
 dynamic extents:          no
 extent_size:              1
 open options:             
 blocks:                   228
 number of keys:           54755

 index name:               _deleted_record_index
 dynamic extents:          no
 extent_size:              1
 open options:             
 blocks:                   2
 number of keys:           1
[download]

Please note I truncated the example file, as the original takes up about 20 printed pages. In each of the files, I need to retrieve the following information:

The ls -all -full file listing at the beginning
Name of the file
Record Size
Last Record
Data Byte Count
Index Names(if the file has any)

Here is what I have so far:

#!C:\Perl64\bin\perl
use strict;
use warnings;

my $DFS1= "dfs01.out";
my $DFS2= "dfs02.out";
my $DFS3= "dfs03.out";
my $DFS_Report = "DFS_Report.html";
my ($IN,$OUT);

         
open ($IN,"<","$DFS1") || die "Can not open $DFS1: $!\n";
open ($OUT,">","$DFS_Report") || die "Can not open $DFS_Report: $!\n";

print $OUT "<html>\n<head><title>Stratus V Series DFS Report</title></
+head>\n<body>\n";

GetDFSdata($DFS1);

open ($IN,"<","$DFS2") || die "Can not open $DFS2: $!\n";

GetDFSdata($DFS2);

open ($IN,"<","$DFS3") || die "Can not open $DFS3: $!\n";

GetDFSdata($DFS3);

print $OUT "</body>\n</html>\n";
close $OUT;
##############################################################
sub GetDFSdata{
  my $report = shift @_;
  my @lsl;
  my $start=qr{^name:\s+(.*)};
  my %Hash;
  my @array;
  my @indexes;
   
  print $OUT "<h2>$report</h2>\n";
   
  #Get the ls -l listing
  while(<$IN>){
     if(/(^w.*)/){
      push @lsl,$1;
    }else{
      next;
    }
  }

  print $OUT "<h3>File Listing</h3>\n";
  foreach my $l(@lsl){
    my @a = split /\s+/,$l;
    print $OUT "$a[5]<br>\n";
  }
  
  #Get the list of file names and indexes
  print $OUT "<h3>File Name and Index List</h3>\n";
  while(<$IN>){
    chomp;
    next if /^operator/;
    next if /^[w|W].*/;
    next if /^ls.*/;
    next if /^[Files|Directories|Links].*/;
    next if /^\d{2}-\d{2}-\d{2}.*/;
    next if /^dfs.*/;
    
    if(m/$start/){
      $Hash{$1}=\@array;
    }elsif(m/^record size:\s+\d+/){
      push @array,$_;
    }elsif(m/^last record:\s+\d+/){
      push @array,$_;
    }elsif(m/^data byte count:\s+\d+/){
      push @array,$_;
    }elsif(m/^\s+index name:\s+(\w.*)/){
      push @indexes,$1;
    }
 
    foreach my $key(keys %Hash){
      print $OUT "File: $key<BR>\n";
      print $OUT "\tRecord Size: $Hash{$key}[0]<BR>\n";
      print $OUT "\tLast Record: $Hash{$key}[1]<BR>\n";
      print $OUT "\tData Byte Count: $Hash{$key}[2]<BR>\n";
      my $str = join ',',@indexes;
      print $OUT "Index Names: $str\n";
    }
  }
  close $IN;
}
[download]

While the actual listing of the files works, I don't have any output when it runs through getting the file info/index names.

TStanley
--------
People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

Comment on Pulling specific data from a large text file Select or Download Code

Replies are listed 'Best First'.
Re: Pulling specific data from a large text file by tangent (Parson) on Jun 13, 2014 at 01:27 UTC
In your first while(<$IN>) loop you run through each line of the file until the end - after that loop $IN is empty. You need to put some check to break out of the first loop once you have the info you require, something like: `while (<$IN>) { if (/(^w.)/) { push @lsl,$1; } elsif (/^Directories/) { last; } }` [download] In your second while(<$IN>) loop you print out the contents of %Hash for each found line - the first time %Hash will have only one item, the second two and so on. You need to take the print out of that loop. Also, everything is being added to the single @array and to the single @indexes, so the information of all keys will be printed for each key. Here is a different way to do it: sub GetDFSdata { #... #Get the list of file names and indexes my $name; while(<$IN>){ chomp; next if /^operator/; #...etc. if (m/$start/) { $name = $1; } elsif (m/^record size:\s+\d+/) { $Hash{$name}{'record_size'} = $_; } elsif (m/^last record:\s+\d+/){ $Hash{$name}{'last_record'} = $_; } elsif (m/^data byte count:\s+\d+/) { $Hash{$name}{'data_byte_count'} = $_; } elsif (m/^\s+index name:\s+(\w.)/) { push(@{ $Hash{$name}{'indexes'} }, $1); } } close $IN; foreach my $key (keys %Hash) { print $OUT "File: $key<BR>\n"; print $OUT "\tRecord Size: $Hash{$key}{'record_size'}<BR>\n"; print $OUT "\tLast Record: $Hash{$key}{'last_record'}<BR>\n"; print $OUT "\tData Byte Count: $Hash{$key}{'data_byte_count'}< +BR>\n"; my $str = join ',', @{ $Hash{$name}{'indexes'} }; print $OUT "Index Names: $str\n"; } } [download]	[reply] [d/l] [select]
Re^2: Pulling specific data from a large text file by TStanley (Canon) on Jun 13, 2014 at 15:08 UTC
After implementing your above suggestions, here is what the file name/index list looks like: File: Record Size: Last Record: Data Byte Count: 68256 Index Names: bulkconf_index, bulkcard_index, _deleted_record_index, g +iftname-number-end, giftname-org-key, _deleted_record_index, _deleted +_record_index, card_index, giftcard-hist-key, giftcard-hist-sold-key, + _deleted_record_index, giftbal_index, _deleted_record_index, pr-chec +k-index, _deleted_record_index, card_index, giftdon_number_index, gif +tdon_cat_index, giftdon_location_index, _deleted_record_index, print_ +req_index, _deleted_record_index, _deleted_record_index, ob-key, com_ +index, _deleted_record_index, chg_index1, date_index, store-index, 1, + zip_index, vzip_index, gift-number, gift-date-redeemed, _deleted_rec +ord_index, reasons-index [download] It looks like it is collecting all of the index names in the file, but not associating them with a file name. As far as the number in the data byte count field, I can't tell where that is coming from, as none of the numbers match it. The output I am trying to get would ultimately look like: `dfs01.out File Listing ..list of file names here.. File Name and Index List File: %demoulas_prod#d01>ccdem>files>gift-bulk Record Size: 100 Last Record: 18912 Data Byte Count: 100 Index Names: bulkconf_index, bulkcard_index, _deleted_record_index File: %demoulas_prod#d01>ccdem>files>gift-name Record Size: 177 Last Record: 54756 Data Byte Count: 6022779 Index Names: giftname-number-end, giftname-org-key, _deleted_record_in +dex etc...` [download] TStanley -------- People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell	[reply] [d/l] [select]
Re: Pulling specific data from a large text file by TStanley (Canon) on Jun 13, 2014 at 18:22 UTC
OK, Had to make some changes, but I got what I wanted. I am showing just the while loop that has been driving me nuts. Here are the changes: while(<$IN>){ chomp; if (m/$start/) { $name = $1; next; } elsif (m/^record size:\s+(\d+)/) { $Hash{$name}{'record_size'} = $1; next; } elsif (m/^last record:\s+(\d+)/){ $Hash{$name}{'last_record'} = $1; next; } elsif (m/^data byte count:\s+(\d+)/) { $Hash{$name}{'data_byte_count'} = $1; next; } elsif (m/^\s+index name:\s+(\w.)/) { push(@{ $Hash{$name}{'indexes'} }, $1); next; }else{ next; } } close $IN; foreach my $key (keys %Hash) { print $OUT "File: $key<BR>\n"; print $OUT "\tRecord Size: $Hash{$key}{'record_size'}<BR>\n" i +f defined $Hash{$key}{'record_size'}; print $OUT "\tLast Record: $Hash{$key}{'last_record'}<BR>\n" i +f defined $Hash{$key}{'last_record'}; print $OUT "\tData Byte Count: $Hash{$key}{'data_byte_count'}< +BR>\n" if defined $Hash{$key}{'data_byte_count'}; if (defined $Hash{$key}{'indexes'}){ my $str = join ', ', @{ $Hash{$key}{'indexes'} } ; print $OUT "Index Names: $str<BR><BR>\n"; }else{ print $OUT "<BR><BR>\n"; } } } [download] I ran the hash through Data::Dumper and was seeing all of the data, but was having an issue printing. I then noticed that some fields were missing in the data, so I went back and verified that they didn't exist in the actual files. Once I did that, I put in the defined checks and everything worked. # TStanley -------- People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell*	[reply] [d/l]

Back to Seekers of Perl Wisdom