Extract Data between Tags

ppremkumar has asked for the wisdom of the Perl Monks concerning the following question:

Team

I need help with fixing the below problem, for which I am unable to find a solution.

I am trying to write a program to extract all data within the tag "BIB."

The problem is this: When my find code is this

while ($data1 =~ m{(<BIB>.*</BIB>)}gx)

the output comes as

<BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 m
+illion babies have been born using the procedure, which has become al
+most routine. And at the age of 28, Louise became a mother herself, g
+iving birth to a baby boy name Cameron—conceived, by the way, in the 
+old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>
Total occurrences of <BIB> is 1
[download]

which is not what I want.

When my find code is changed to this

while ($data1 =~ m{(<BIB>)}gx)

I get something closer; at least the number of items within the "BIB" tag matches the total number of items within "BIB."

What I want is this, each entry saved as an array value:

<BIB>Falco (2012)</BIB>

<BIB>Falco, 2012</BIB>

<BIB>ICMRT, 2012</BIB>

use strict;
use 5.14.2;

my $bib_count = 0;
my $INPUT_REF_FH;
my @text_found;
open $INPUT_REF_FH,"<:utf8", "ch01.txt";
binmode STDOUT, ':utf8';
while(<$INPUT_REF_FH>){
    my $data1 = $_;
    while ($data1 =~ m{(<BIB>.*</BIB>)}gx){
        $bib_count += 1;
#        print "$&\n";
        push @text_found, ${^MATCH}; 
    };
};
foreach (@text_found){
    print "$_\n";
};
print "Total occurrences of <BIB> is $bib_count";
close $INPUT_REF_FH;
[download]

INPUT TEXT:

In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 million babies have been born using the procedure, which has become almost routine. And at the age of 28, Louise became a mother herself, giving birth to a baby boy name Cameron—conceived, by the way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>).

Comment on Extract Data between Tags Select or Download Code

Replies are listed 'Best First'.
Re: Extract Data between Tags by tmharish (Friar) on Mar 05, 2013 at 18:23 UTC
Try the non-greedy match like so ( addition of he question mark ): `while ($data1 =~ m{(<BIB>.*?</BIB>)}gx)` [download]	[reply] [d/l]
Re^2: Extract Data between Tags by ppremkumar (Novice) on Mar 05, 2013 at 18:27 UTC
Thank you. It worked. I am pretty new to regex and Perl, so I appreciate your help.	[reply]
Re: Extract Data between Tags by Your Mother (Archbishop) on Mar 05, 2013 at 19:05 UTC
Your code has a couple of gotchas in it, even in the fixed version. If the `<BIB/>`s contain record separators (normally newlines), your matches will fail (twice); first fail is the file reading line by line will break the records into two passes of your `while(<$INPUT_REF_FH>){}` and then `.` does not match newlines in regular expressions by default. Add an `s` flag to your regex to match it. Also `.` matches nothing quite happily; unless you really want empty `<BIB/>`s. The use of the `x` is meaningless in your regex. I know this is sometimes a recommended default but to me it's distracting noise, akin to someone wasting your time with the code equivalent of "Made you look." This is a little idiomatic but it addresses the issues– use strictures; use open qw( :std :utf8 ); my $corpus = do { local $/; <DATA> }; my @bibs; push @bibs, $corpus =~ m{<BIB>(.+?)</BIB>}sg; s/[^\S ]+/ /g for @bibs; # Normalize whitespace. if ( @bibs ) { print "Found...\n"; print "\t $_\n" for @bibs; } else { print "No love.\n"; } __DATA__ In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 million babies have been born using the procedure, which has become almost routine. And at the age of 28, Louise became a mother herself, giving birth to a baby boy name Cameron—conceived, by the way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>). [download] `Found... * Falco (2012) * Falco, 2012 * ICMRT, 2012` [download] Related Reading Perl Idioms Explained - my $string = do { local $/; <FILEHANDLE> }; perldoc open Death to Dot Star!	[reply] [d/l] [select]
Re^2: Extract Data between Tags by ppremkumar (Novice) on Mar 11, 2013 at 06:29 UTC
Thank you, @YourMother. 1. "first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it."----However, I validate input to make sure the <BIB> tags are within a single line, which is the correct way to tag files I have to use. 2. "The use of the x is meaningless in your regex"----Yes, I agree; I carried it over from my another expression that required multiple lines and comments in the searches. 3. Thanks to you, I have started to use ".+?" instead of ".*?"	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks

Extract Data between Tags

Related Reading