Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Extract Data between Tags

by ppremkumar (Novice)
on Mar 05, 2013 at 18:18 UTC ( [id://1021886]=perlquestion: print w/replies, xml ) Need Help??

ppremkumar has asked for the wisdom of the Perl Monks concerning the following question:

Team

I need help with fixing the below problem, for which I am unable to find a solution.

I am trying to write a program to extract all data within the tag "BIB."

The problem is this: When my find code is this

while ($data1 =~ m{(<BIB>.*</BIB>)}gx)

the output comes as

<BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 m +illion babies have been born using the procedure, which has become al +most routine. And at the age of 28, Louise became a mother herself, g +iving birth to a baby boy name Cameron—conceived, by the way, in the +old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB> Total occurrences of <BIB> is 1

which is not what I want.

When my find code is changed to this

while ($data1 =~ m{(<BIB>)}gx)

I get something closer; at least the number of items within the "BIB" tag matches the total number of items within "BIB."

What I want is this, each entry saved as an array value:

<BIB>Falco (2012)</BIB>

<BIB>Falco, 2012</BIB>

<BIB>ICMRT, 2012</BIB>

use strict; use 5.14.2; my $bib_count = 0; my $INPUT_REF_FH; my @text_found; open $INPUT_REF_FH,"<:utf8", "ch01.txt"; binmode STDOUT, ':utf8'; while(<$INPUT_REF_FH>){ my $data1 = $_; while ($data1 =~ m{(<BIB>.*</BIB>)}gx){ $bib_count += 1; # print "$&\n"; push @text_found, ${^MATCH}; }; }; foreach (@text_found){ print "$_\n"; }; print "Total occurrences of <BIB> is $bib_count"; close $INPUT_REF_FH;

INPUT TEXT:

In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 million babies have been born using the procedure, which has become almost routine. And at the age of 28, Louise became a mother herself, giving birth to a baby boy name Cameron—conceived, by the way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>).

Replies are listed 'Best First'.
Re: Extract Data between Tags
by tmharish (Friar) on Mar 05, 2013 at 18:23 UTC
    Try the non-greedy match like so ( addition of he question mark ):
    while ($data1 =~ m{(<BIB>.*?</BIB>)}gx)

      Thank you. It worked.

      I am pretty new to regex and Perl, so I appreciate your help.

Re: Extract Data between Tags
by Your Mother (Archbishop) on Mar 05, 2013 at 19:05 UTC

    Your code has a couple of gotchas in it, even in the fixed version. If the <BIB/>s contain record separators (normally newlines), your matches will fail (twice); first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it. Also .* matches nothing quite happily; unless you really want empty <BIB/>s. The use of the x is meaningless in your regex. I know this is sometimes a recommended default but to me it's distracting noise, akin to someone wasting your time with the code equivalent of "Made you look."

    This is a little idiomatic but it addresses the issues–

    use strictures; use open qw( :std :utf8 ); my $corpus = do { local $/; <DATA> }; my @bibs; push @bibs, $corpus =~ m{<BIB>(.+?)</BIB>}sg; s/[^\S ]+/ /g for @bibs; # Normalize whitespace. if ( @bibs ) { print "Found...\n"; print "\t* $_\n" for @bibs; } else { print "No love.\n"; } __DATA__ In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 million babies have been born using the procedure, which has become almost routine. And at the age of 28, Louise became a mother herself, giving birth to a baby boy name Cameron—conceived, by the way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>).
    Found... * Falco (2012) * Falco, 2012 * ICMRT, 2012

    Related Reading

      Thank you, @YourMother.

      1. "first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it."----However, I validate input to make sure the <BIB> tags are within a single line, which is the correct way to tag files I have to use.

      2. "The use of the x is meaningless in your regex"----Yes, I agree; I carried it over from my another expression that required multiple lines and comments in the searches.

      3. Thanks to you, I have started to use ".+?" instead of ".*?"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1021886]
Approved by tmharish
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-24 11:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found