Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Question about regex.

by that_perl_guy (Initiate)
on Sep 29, 2020 at 02:10 UTC ( #11122313=perlquestion: print w/replies, xml ) Need Help??

that_perl_guy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I will be extremely thankful if the monks here could help me understanding regex in Perl.

Suppose I have a text file that has these lines:

This is line one. Line Two is this. Third line starts here. This is line four. This is line five. This is line six. This is the seventh line. This is line eight.

If the record contains the word "third" our "four", I want it to print the whole record, meaning the stuff between the empty lines, not just lines with those words in it. But I am not able to write it correctly. Here is what I have tried:

use strict; use warnings; open my $fh, "+<", "testlines.txt"; while (<$fh>) { if ($_=~ /(third | four)/si) { chomp; local $/ = "\n\n"; print "line is: $_\n"; } }

And it prints:

>perl regex.pl line is: Third line starts here. line is: This is line four.

But what I want is:

This is line one. Line Two is this. Third line starts here. This is line four

Where am I going wrong? Please guide. Please note, this is just an example. Depending on the data supplied the other lines in the file may contain some different words.

Not sure if this matters, but I'm on Windows 10 with Strawberry Perl version 5.32.

Replies are listed 'Best First'.
Re: Question about regex.
by GrandFather (Saint) on Sep 29, 2020 at 03:07 UTC

    NetWalla identified your immediate problem, but there is an issue with your regex. The spaces around third and four are required to match spaces in the string you are matching against. On top of that, you don't prevent matching words that contain third or four (fourth for example). You can fix both issues like this:

    use strict; use warnings; local $/ = "\n\n"; while (<DATA>) { if (/\b(third|four)\b/si) { chomp; print "line is: $_\n"; } } __DATA__ This is line one. Line Two is this. Third line starts here. This is line four. This is line five going fourth. This is line six. This is the seventh line. This is line eight.

    Prints:

    line is: This is line one. Line Two is this. Third line starts here. This is line four.

    the \b matches word boundaries.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Question about regex.
by NetWallah (Canon) on Sep 29, 2020 at 02:45 UTC
    A little line shuffling and error checking fixes this:
    use strict; use warnings; open (my $fh, "<", "testlines.txt") or die "$!"; local $/ = "\n\n"; while (<$fh>) { if ($_=~ /(third | four)/si) { chomp; print "line is: $_\n"; } } close $fh;
    Code>perl test15.pl line is: This is line one. Line Two is this. Third line starts here. This is line four.

                    "Imaginary friends are a sign of a mental disorder if they cause distress, including antisocial behavior. Religion frequently meets that description"

Re: Question about regex.
by Cristoforo (Curate) on Sep 29, 2020 at 03:36 UTC
    Here is another approach. It keeps the blank lines after the text (if you need to separate records like they are in the testlines.txt file).

    It reads the file to be parsed from the command line following the program invocation. ( perl regex.pl testlines.txt )

    It also places the local statement and the code inside a block to limit the changes to $/ to the block. Although this is not necessary for this small program, in a larger program, this will restore $/ to \n after the block is finished for any code that could follow.

    #!/usr/bin/perl use strict; use warnings; @ARGV == 1 or print "Usage: perl $0 testlines.txt\n" and exit; { local $/ = "\n\n"; print grep /\b(?:third|four)\b/i, <>; }

    My command prompt:

    C:\Old_Data\perlp>perl test3.pl testlines.txt This is line one. Line Two is this. Third line starts here. This is line four.
Re: Question about regex. ($INPUT_RECORD_SEPARATOR)
by LanX (Cardinal) on Sep 29, 2020 at 11:12 UTC
    > Where am I going wrong?

    Short-answer:

    You need to set $/ aka $INPUT_RECORD_SEPARATOR before you start reading the <input> , not inside the loop.

    You already got this from the previous longer answers, but I wanted to make it clear for those who didn't manage to dig thru all the code.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Question about regex.
by haukex (Bishop) on Sep 29, 2020 at 18:11 UTC

    A couple of points that haven't been mentioned yet:

    • Perl's input record separator $/ has a special "paragraph mode" when you set local $/=""; that will split the input on one or more blank lines.
    • GrandFather mentioned the spaces in the regex. If you wanted to format your regex nicely and have whitespace ignored, you could use the /x modifier (perlre).
    • Cristoforo's suggestion is nice and short, but has the disadvantage that it reads the entire file into memory before grepping it.
      "local $/="";
      "If you wanted to format your regex nicely and have whitespace ignored, you could use the /x modifier"

      Anyone else feel a dissonance here? I'd format the assignment (yes, it is an assignment) as:

      local $/ = "";

      On first glance the assignment looks like $/= "" to me and that's not right. An eye blink later it turns into $ /= "" which doesn't work either. It's only after two glances and a hard look that it resolves into $/ = "";. Using white space to reduce cognitive load is a Good Thing

      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
        Anyone else feel a dissonance here?

        Yes, you do have a point, I didn't need to be stingy with those two spaces. Though I did say "If you wanted to format your regex nicely", and not "you should format your regex nicely" (and noone else in the thread used /x either) ;-)

        Edit: missed a word on c&p

Re: Question about regex.
by that_perl_guy (Initiate) on Sep 29, 2020 at 04:14 UTC

    wow, you folks are amazing. Thank you all! I tried to understand regex by reading a book but you guys have a knack for explaining things so clearly.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11122313]
Approved by GrandFather
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2021-01-20 06:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?