Re: Extract Data between Tags

in reply to Extract Data between Tags

Your code has a couple of gotchas in it, even in the fixed version. If the <BIB/>s contain record separators (normally newlines), your matches will fail (twice); first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it. Also .* matches nothing quite happily; unless you really want empty <BIB/>s. The use of the x is meaningless in your regex. I know this is sometimes a recommended default but to me it's distracting noise, akin to someone wasting your time with the code equivalent of "Made you look."

This is a little idiomatic but it addresses the issues–

use strictures;
use open qw( :std :utf8 );

my $corpus = do { local $/; <DATA> };

my @bibs;
push @bibs, $corpus =~ m{<BIB>(.+?)</BIB>}sg;
s/[^\S ]+/ /g for @bibs; # Normalize whitespace.

if ( @bibs )
{
    print "Found...\n";
    print "\t* $_\n" for @bibs;
}
else
{
    print "No love.\n";
}

__DATA__
In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More
than 5 million babies have been born using the procedure, which has
become almost routine. And at the age of 28, Louise became a mother
herself, giving birth to a baby boy name Cameron—conceived, by the
way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT,
2012</BIB>).
[download]

Found...
    * Falco (2012)
    * Falco, 2012
    * ICMRT, 2012
[download]

Related Reading

Comment on Re: Extract Data between Tags Select or Download Code

Replies are listed 'Best First'.
Re^2: Extract Data between Tags by ppremkumar (Novice) on Mar 11, 2013 at 06:29 UTC
Thank you, @YourMother. 1. "first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it."----However, I validate input to make sure the <BIB> tags are within a single line, which is the correct way to tag files I have to use. 2. "The use of the x is meaningless in your regex"----Yes, I agree; I carried it over from my another expression that required multiple lines and comments in the searches. 3. Thanks to you, I have started to use ".+?" instead of ".*?"	[reply]

Replies are listed 'Best First'.

Re^2: Extract Data between Tags
by ppremkumar (Novice) on Mar 11, 2013 at 06:29 UTC

Thank you, @YourMother.

1. "first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it."----However, I validate input to make sure the <BIB> tags are within a single line, which is the correct way to tag files I have to use.

2. "The use of the x is meaningless in your regex"----Yes, I agree; I carried it over from my another expression that required multiple lines and comments in the searches.

3. Thanks to you, I have started to use ".+?" instead of ".*?"

[reply]

In Section Seekers of Perl Wisdom