Re: Extract Tags between Two strings

by Athanasius (Chancellor)
on Jun 11, 2012 at 14:58 UTC

in reply to Extract Tags between Two strings

The following makes a single pass over the input file (I have used in-file DATA for convenience. You will, of course, have to change this to read from your data file):

use strict; use warnings; my (%tags_to_match, @extracted_tags); my $in_matching = 0; my $tag_prefix = 'bbc_'; my $tag_regex = qr{ ( $tag_prefix \w+ _ \d+ ) }x; while (my $line = <DATA>) { if ($in_matching) { if ($line =~ / ^ \s* \[ end \] \s* $ /x) { $in_matching = 0; } elsif ($line =~ $tag_regex) { $tags_to_match{ $1 }++; } } elsif ($line =~ / ^ \s* \[ start \] \s* $ /x) { $in_matching = 1; } elsif ($line =~ $tag_regex) { my $tag = $1; foreach (keys %tags_to_match) { if ($tag eq $_) { push @extracted_tags, $tag; last; } } } } say "\@extracted_tags = ", join(', ', @extracted_tags); __DATA__ [start] bbc_arc_001 bbc_arc_002 abc_arc_001 [end] bbc_arc_001 bbc_arc_002 bbc_arc_003 bbc_arc_004

This should work provided the tags to be extracted always appear after the start/end block in which they are specified. If this is not the case for your input file, you will need to make two passes over the file: the first to read the contents of the start/end block(s), the second to extract the specified tags.

Also note that your regex may not be doing what you wanted. [a-zA-Z]*[0-9]*_* means: zero or more letters, followed by zero or more digits, followed by zero or more underscores. In my code I use a regex which is a guess at what was intended.


Athanasius <°(((><contra mundum

