Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

multiple OR match fails

by zzgulu (Novice)
on Jan 31, 2012 at 02:42 UTC ( #950858=perlquestion: print w/ replies, xml ) Need Help??
zzgulu has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file that contains many sections. I want to extract certain sections with associated content. Why the code below when finds the first matched section (Findings) stops looking for the other section (Complications)? If I add more sections to the OR list, it only finds the first match and ignores the rest. Really apprecite your wisdom!

.... ..... while(<IN>) { undef ($/); $string=$_; $string =~m/(FINDINGS|COMPLICATIONS|(:)(.*?)(^[A-Z])/sgm; print "processing $file\n"; print OUT "$1$2\t$3"; }exit; .... ......

Comment on multiple OR match fails
Download Code
Re: multiple OR match fails
by JavaFan (Canon) on Jan 31, 2012 at 02:52 UTC
    If only finds one match, because you only ask it to match once. Use a while loop if you want to find all the matches.
Re: multiple OR match fails
by InfiniteSilence (Curate) on Jan 31, 2012 at 03:16 UTC

    Would have been nice to include at least some piece of your input file. Your regex looks like it is missing some parentheses or something. Also, if your code is so short you might consider one-lining it:

    ~linux> perl -ne 'if(m/((FINDINGS|COMPLICATIONS):(.*?)([A-Z]+))/sgm){p +rint qq|$2$3\t$4\n|};' foo7.txt

    Celebrate Intellectual Diversity

      if (//g) should be while (//g).
Re: multiple OR match fails
by jwkrahn (Monsignor) on Jan 31, 2012 at 03:33 UTC
    while(<IN>) { undef ($/); $string=$_;

    Because you undef $/ inside the loop that means that the first time through the loop $_ will contain only the first line of the file and the second time through the loop $_ will contain all the rest of the file.

    Did you really want to process the file in two chunks like that?

      yes, The input record separator, newline by default. $/ may be set to a value longer than one character in order to match a multi-character delimiter. If $/ is undefined, no record separator is matched, and <FILEHANDLE> will read everything to the end of the current file in one line.

Re: multiple OR match fails
by lune (Monk) on Jan 31, 2012 at 13:33 UTC
    The obvious part of your question refers to return all matches from a regex match.

    That can easily done like this (I simplified your regex, as the missing parenthesis makes it unclear, what you really want):

    while(<STDIN>) { # see previous answer #undef ($/); $string=$_; my @matches = ($string =~ m/(FINDINGS|COMPLICATIONS|:.*)/g); print STDOUT "@matches \n"; } echo "FINDINGS COMPLICATIONS :something" | t.pl

    However from your question it seems, what you really want is not just to get a list of matches, but some sort of parsing. eg. extract the text from the section "FINDINGS" etc.

    To answer this, it would be necessary to know, where a section ends. If this is not, what you wanted, please clarify.

      Thank you very much for your inputs and sorry for the typo; one parenthesis was missing from the code. My text files are operative notes and each note consists of sections that start with a title at the beginning of a line, all in upper case and end in colon. Sections are usually separated by an empty line, although this may not be always the case. The input directory contains 1000 files and my intention is to write the files back to an output directory but with only designated matched sections (title + content). Per recommendation, it seems adding a while loop to my matching RegEx fixed the issue but please do advise me if you find other issues in the code. I seldom do codes but since I am working with text files the RegEx is very powerful helping me for occasional data extraction.I am sure there are much easier ways to code what I coded below. This is a sample input file:

      PREOPERATIVE DIAGNOSIS: Left invasive cancer, positive margins.

      TITLE OF OPERATION:

      1. Left needle-localized segmental mastectomy.

      2. intraoperative axillary lymphatic mapping.

      3. lymphadenectomy.

      ANESTHESIA: General.

      INDICATIONS FOR SURGERY: Invasive carcinoma with positive margins and residual calcifications.

      COMPLICATIONS : None.

      #!/usr/bin/perl use strict; use warnings; my $indir; my $file; my $new; my $string; my $outdir; $indir = 'C:/input'; $outdir ='C:/output'; if(-d $indir) { opendir(DIR, $indir) or die "can't open $!"; } while ($file=readdir(DIR)) { my $fullpath=$indir.'/'.$file; open IN, "$indir/$file"; $new= "$outdir/$file"; open OUT, ">$new"; while(<IN>) { undef ($/); $string=$_; while ($string =~m/(FINDINGS|COMPLICATIONS)(:)(.*?)(^[A-Z])/sgm) { print "processing $file\n"; print OUT "$1$2\t$3"; } } close IN; close OUT; } closedir(DIR); exit;
        Since you asked for comments, I'll make a few:
        - main improvement is to make better indenting
        - if(-d $indir) was unnecessary
        - when you do a readdir, this returns only the names (not full paths) and this will include any directories (including the . and .. ones!). It is common to use a grep to filter out the stuff that you don't want.
        - always check whether any kind of file operation succeeded or not
        - declare variables when you actually use them the first time.
        I didn't actually run this so excuse me if I made a mistake.
        #!/usr/bin/perl use strict; use warnings; my $indir = 'C:/input'; my $outdir ='C:/output'; opendir(DIR, $indir) or die "can't open directory $indir $!"; foreach my $file (grep{-f "$indir/$_"}readdir DIR) { open IN, '<', "$indir/$file" or die "can't open $indir/$file $!"; my $new= "$outdir/$file"; open OUT, '>', $new or die "can't open $new for output $!"; while (my $string = <IN>) { undef ($/); while ($string =~m/(FINDINGS|COMPLICATIONS)(:)(.*?)(^[A-Z])/sgm +) { print "processing $file\n"; print OUT "$1$2\t$3"; } } close IN; close OUT; } closedir(DIR);
        update: these "close" statements aren't strictly necessary, all file handles will get closed when your program exists. When you open IN for the next file, this automatically closes the current IN file (if there is one). exit() wasn't necessary, so I took it out.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://950858]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-12-28 20:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (182 votes), past polls