Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Extract Data between Tags

by Your Mother (Canon)
on Mar 05, 2013 at 19:05 UTC ( #1021894=note: print w/ replies, xml ) Need Help??


in reply to Extract Data between Tags

Your code has a couple of gotchas in it, even in the fixed version. If the <BIB/>s contain record separators (normally newlines), your matches will fail (twice); first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it. Also .* matches nothing quite happily; unless you really want empty <BIB/>s. The use of the x is meaningless in your regex. I know this is sometimes a recommended default but to me it's distracting noise, akin to someone wasting your time with the code equivalent of "Made you look."

This is a little idiomatic but it addresses the issues–

use strictures; use open qw( :std :utf8 ); my $corpus = do { local $/; <DATA> }; my @bibs; push @bibs, $corpus =~ m{<BIB>(.+?)</BIB>}sg; s/[^\S ]+/ /g for @bibs; # Normalize whitespace. if ( @bibs ) { print "Found...\n"; print "\t* $_\n" for @bibs; } else { print "No love.\n"; } __DATA__ In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 million babies have been born using the procedure, which has become almost routine. And at the age of 28, Louise became a mother herself, giving birth to a baby boy name Cameronóconceived, by the way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>).
Found... * Falco (2012) * Falco, 2012 * ICMRT, 2012

Related Reading


Comment on Re: Extract Data between Tags
Select or Download Code
Re^2: Extract Data between Tags
by ppremkumar (Novice) on Mar 11, 2013 at 06:29 UTC

    Thank you, @YourMother.

    1. "first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it."----However, I validate input to make sure the <BIB> tags are within a single line, which is the correct way to tag files I have to use.

    2. "The use of the x is meaningless in your regex"----Yes, I agree; I carried it over from my another expression that required multiple lines and comments in the searches.

    3. Thanks to you, I have started to use ".+?" instead of ".*?"

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1021894]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2014-09-20 11:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (159 votes), past polls