http://www.perlmonks.org?node_id=1021894


in reply to Extract Data between Tags

Your code has a couple of gotchas in it, even in the fixed version. If the <BIB/>s contain record separators (normally newlines), your matches will fail (twice); first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it. Also .* matches nothing quite happily; unless you really want empty <BIB/>s. The use of the x is meaningless in your regex. I know this is sometimes a recommended default but to me it's distracting noise, akin to someone wasting your time with the code equivalent of "Made you look."

This is a little idiomatic but it addresses the issues–

use strictures; use open qw( :std :utf8 ); my $corpus = do { local $/; <DATA> }; my @bibs; push @bibs, $corpus =~ m{<BIB>(.+?)</BIB>}sg; s/[^\S ]+/ /g for @bibs; # Normalize whitespace. if ( @bibs ) { print "Found...\n"; print "\t* $_\n" for @bibs; } else { print "No love.\n"; } __DATA__ In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 million babies have been born using the procedure, which has become almost routine. And at the age of 28, Louise became a mother herself, giving birth to a baby boy name Cameron—conceived, by the way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>).
Found... * Falco (2012) * Falco, 2012 * ICMRT, 2012

Related Reading

Replies are listed 'Best First'.
Re^2: Extract Data between Tags
by ppremkumar (Novice) on Mar 11, 2013 at 06:29 UTC

    Thank you, @YourMother.

    1. "first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it."----However, I validate input to make sure the <BIB> tags are within a single line, which is the correct way to tag files I have to use.

    2. "The use of the x is meaningless in your regex"----Yes, I agree; I carried it over from my another expression that required multiple lines and comments in the searches.

    3. Thanks to you, I have started to use ".+?" instead of ".*?"