Keeping the first occurence of a pattern, and removing the other occurences

TStanley has asked for the wisdom of the Perl Monks concerning the following question:

I have the following report:

^LSTORE 001
===============
DEPT:    PRODUCE
                                                               EXTEND 
+ MRKDWN   REASON                 EXT.  MRKDWN
ITEM    DESCRIPTION                SIZE      QTY   WGT  RETAIL RETAIL 
+ RETAIL   CD DESCRIPTION                 LOSS  VENDOR

0300008 28OZ FRT PLATTER/DIP      00028OZ     1    0.0    8.99   8.99 
+   8.99   01 DAMAGED/UNSALEABLE          0.00  102827 
0080948 EXPRESS FANCY GREENS      00007OZ     6    0.0    2.99  17.94 
+  17.94   01 DAMAGED/UNSALEABLE          0.00  103128 
0321855 CLAMSHL HYDRO BOSTON      00COUNT     1    0.0    1.99   1.99 
+   1.99   01 DAMAGED/UNSALEABLE          0.00  104040 
0058309 12OZ MONTEREY MROOM       00012OZ     1    0.0    2.29   2.29 
+   2.29   01 DAMAGED/UNSALEABLE          0.00  105524 
0058309 12OZ MONTEREY MROOM       00012OZ     1    0.0    2.29   2.29 
+   2.29   01 DAMAGED/UNSALEABLE          0.00  105524 
0084448 10OZ SPINACH PACK 12      00010OZ     1    0.0    1.69   1.69 
+   1.69   01 DAMAGED/UNSALEABLE          0.00  107505 

        REASON CODE TOTAL:                                            
+  35.19                                  0.00

              DEPT TOTALS:                                            
+  35.19                                  0.00


^LSTORE 002
===============
DEPT:    PRODUCE

0084508 2LB STRAWBERRIES          00002LB    20    0.0    3.69  73.80 
+  73.80   01 DAMAGED/UNSALEABLE          0.00  101224     



DEPT:    PRODUCE

                                                               EXTEND 
+ MRKDWN   REASON                  EXT. MRKDWN
ITEM  DESCRIPTION                   SIZE    QTY    WGT  RETAIL RETAIL 
+ RETAIL   CD DESCRIPTION                LOSS   VENDOR
[download]

What I am trying to get rid of, is the second or more occurences of the department name. The first instance after the store name is fine, but the others are not. My thought is to split the file into records based on the form feed character, then work on each record. So the initial code would be:

#!/usr/bin/perl -w
use strict;

open my $IN,"<","ISC001" or die "Can not open ISC001: $!\n";
open my $OUT,">","ISC-OUT2" or die "Can not open ISC-OUT2: $!\n";

$/="^L";

while(<$IN>){
  ......
}

close $IN;
close $OUT;
[download]

One thing consistent through the file is that the occurences of the department name that I need to remove occur before the header lines, so I'm guessing a regex similar to:

$_ =~s|DEPT:\s+PRODUCE\n{2,}(\s{63}EXTEND\s+MRKDWN\s+REASON\s+EXT.  MR
+KDWN\n)|$1|g;
[download]

would do what I need. Am I heading in the right direction with this guess, or am I going in the wrong direction?

TStanley
--------
People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

Comment on Keeping the first occurence of a pattern, and removing the other occurences Select or Download Code

Replies are listed 'Best First'.
Re: Keeping the first occurence of a pattern, and removing the other occurences by thenaz (Beadle) on Aug 04, 2011 at 16:09 UTC
Use a look-behind assertion to delete all occurrences that don't follow a store name heading. `s/(?<!===\n)DEPT:\s+PRODUCE\n+/\n/g;` [download] Of course, I am assuming the input has UNIX-style line endings (newline character only). If not, you may need to replace "\n" with "\r\n" where appropriate.	[reply] [d/l]
Re: Keeping the first occurence of a pattern, and removing the other occurences by johngg (Canon) on Aug 04, 2011 at 16:55 UTC
Bear in mind that the "`^L`" that you are seeing is probably a text representation of a form feed control character so you might need to change your input record separator to something like `"\cL"`, IIRC. Cheers, JohnGG	[reply] [d/l] [select]
Re: Keeping the first occurence of a pattern, and removing the other occurences by Marshall (Canon) on Aug 05, 2011 at 04:11 UTC
I wouldn't split anything or make things more complicated than necessary. keep a %seen hash and start it over at the top of each page or STORE. If there are other "only want to see it once per store" things (like maybe column headers), just add a line similar to the DEPT line code. `my %seen; while (<DATA>) { %seen = () if (/^\s*STORE/); #start over for each STORE #just /\f/ might be fine too next if $seen{$_}; $seen{$_}++ if /^DEPT/; #no more DEPT lines print; }` [download]	[reply] [d/l]


go ahead... be a heretic
	PerlMonks