Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Extract Data between Tags

by ppremkumar (Novice)
on Mar 05, 2013 at 18:18 UTC ( [id://1021886]=perlquestion: print w/replies, xml ) Need Help??

ppremkumar has asked for the wisdom of the Perl Monks concerning the following question:

Team

I need help with fixing the below problem, for which I am unable to find a solution.

I am trying to write a program to extract all data within the tag "BIB."

The problem is this: When my find code is this

while ($data1 =~ m{(<BIB>.*</BIB>)}gx)

the output comes as

<BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 m +illion babies have been born using the procedure, which has become al +most routine. And at the age of 28, Louise became a mother herself, g +iving birth to a baby boy name Cameron—conceived, by the way, in the +old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB> Total occurrences of <BIB> is 1

which is not what I want.

When my find code is changed to this

while ($data1 =~ m{(<BIB>)}gx)

I get something closer; at least the number of items within the "BIB" tag matches the total number of items within "BIB."

What I want is this, each entry saved as an array value:

<BIB>Falco (2012)</BIB>

<BIB>Falco, 2012</BIB>

<BIB>ICMRT, 2012</BIB>

use strict; use 5.14.2; my $bib_count = 0; my $INPUT_REF_FH; my @text_found; open $INPUT_REF_FH,"<:utf8", "ch01.txt"; binmode STDOUT, ':utf8'; while(<$INPUT_REF_FH>){ my $data1 = $_; while ($data1 =~ m{(<BIB>.*</BIB>)}gx){ $bib_count += 1; # print "$&\n"; push @text_found, ${^MATCH}; }; }; foreach (@text_found){ print "$_\n"; }; print "Total occurrences of <BIB> is $bib_count"; close $INPUT_REF_FH;

INPUT TEXT:

In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 million babies have been born using the procedure, which has become almost routine. And at the age of 28, Louise became a mother herself, giving birth to a baby boy name Cameron—conceived, by the way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>).

Replies are listed 'Best First'.
Re: Extract Data between Tags
by tmharish (Friar) on Mar 05, 2013 at 18:23 UTC
    Try the non-greedy match like so ( addition of he question mark ):
    while ($data1 =~ m{(<BIB>.*?</BIB>)}gx)

      Thank you. It worked.

      I am pretty new to regex and Perl, so I appreciate your help.

Re: Extract Data between Tags
by Your Mother (Archbishop) on Mar 05, 2013 at 19:05 UTC

    Your code has a couple of gotchas in it, even in the fixed version. If the <BIB/>s contain record separators (normally newlines), your matches will fail (twice); first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it. Also .* matches nothing quite happily; unless you really want empty <BIB/>s. The use of the x is meaningless in your regex. I know this is sometimes a recommended default but to me it's distracting noise, akin to someone wasting your time with the code equivalent of "Made you look."

    This is a little idiomatic but it addresses the issues–

    use strictures; use open qw( :std :utf8 ); my $corpus = do { local $/; <DATA> }; my @bibs; push @bibs, $corpus =~ m{<BIB>(.+?)</BIB>}sg; s/[^\S ]+/ /g for @bibs; # Normalize whitespace. if ( @bibs ) { print "Found...\n"; print "\t* $_\n" for @bibs; } else { print "No love.\n"; } __DATA__ In fact, <BIB>Falco (2012)</BIB> today Louise is hardly isolated. More than 5 million babies have been born using the procedure, which has become almost routine. And at the age of 28, Louise became a mother herself, giving birth to a baby boy name Cameron—conceived, by the way, in the old-fashioned way (<BIB>Falco, 2012</BIB>; <BIB>ICMRT, 2012</BIB>).
    Found... * Falco (2012) * Falco, 2012 * ICMRT, 2012

    Related Reading

      Thank you, @YourMother.

      1. "first fail is the file reading line by line will break the records into two passes of your while(<$INPUT_REF_FH>){} and then . does not match newlines in regular expressions by default. Add an s flag to your regex to match it."----However, I validate input to make sure the <BIB> tags are within a single line, which is the correct way to tag files I have to use.

      2. "The use of the x is meaningless in your regex"----Yes, I agree; I carried it over from my another expression that required multiple lines and comments in the searches.

      3. Thanks to you, I have started to use ".+?" instead of ".*?"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1021886]
Approved by tmharish
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-23 14:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found