http://www.perlmonks.org?node_id=1074598


in reply to Reading file and matching lines

G'day Jalcock501,

You asked a very similar question, with a very similar title, using very similar data, in "Search file for certain lines".

Here's a cutdown version (with appropriate modifications) of the technique I provided in that thread (Re: Search file for certain lines):

#!/usr/bin/env perl use strict; use warnings; local $/ = "\nh"; print "Block $.\n", /^(E.*?)^G/ms ? $1 : "Error\n" while <DATA>; __DATA__ hblah Qblah Eblock_1_line_1 Eblock_1_line_2 Gblah hblah Qblah Gblah hblah Qblah Eblock_3_line_1 Eblock_3_line_2 Gblah

Output:

Block 1 Eblock_1_line_1 Eblock_1_line_2 Block 2 Error Block 3 Eblock_3_line_1 Eblock_3_line_2

In Re: Search file for certain lines, I provided an explanation of the code as well as links to more detailed documentation. I've introduced no new concepts here: if there's something you don't understand here, go back to the earlier post for more information.

-- Ken

Replies are listed 'Best First'.
Re^2: Reading file and matching lines
by Jalcock501 (Sexton) on Feb 13, 2014 at 15:50 UTC
    Hi Kcott

    I complete forgot about that thread, thank you for reminding me.

    I do however have a quick question... if I want to check for duplicate G entries within the same scope (i.e between E and h records) how would I do it.

    I have some example code I tried but it just prints all G records.

    my $lines if(/^G/) { next if ($lines eq $_); $lines = $_; print $_; }
    here is the example data I'm using
    E123456789 G123456798 ignore this as this is the first instance of G record +in scope h12345 E1234567 E7899874 G123456798 even though this is the same ignore as its first insta +nce G123456789 ignore this as it is different from previous G record G123465798 should flag duplicate here because it is the same firs +t G record in scope!!! h1245

      Firstly, you have no duplicates in any (of what you're calling) "scope". G123465798 is not a duplicate of G123456798: you've transposed the 5 and the 6. I've fixed this in the example below.

      There's a standard idiom for checking for duplicates in this sort of scenario. Use a hash (often called %seen) that has as its keys whatever identifier you're checking. While processing, if the key exists, it's a duplicate, so skip/flag/etc. as appropriate; if the key doesn't exist, it's unique, so use it and then add it to the hash (usually done with a postfix increment).

      Here's an example using your fixed data:

      #!/usr/bin/env perl -l use strict; use warnings; my @data = ( [ qw{E123456789 G123456798 h12345} ], [ qw{E1234567 E7899874 G123456798 G123456789 G123456798 h1245} ], ); for my $scope (@data) { my %seen; for my $identifier (@$scope) { print $identifier unless $seen{$identifier}++; } }

      Output:

      E123456789 G123456798 h12345 E1234567 E7899874 G123456798 G123456789 h1245

      -- Ken