Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

[Solved]Need to extract a particular block of lines between two patterns

by chengchl (Novice)
on Nov 09, 2017 at 00:39 UTC ( #1202989=perlquestion: print w/replies, xml ) Need Help??
chengchl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have seen the post here about "Getting lines in a file between two patterns" (http://www.perlmonks.org/?node_id=979946)

I was curious that what if I need to extract only the second block of lines that meet the pattern?

Is there something that we can do via (/START/ .. /END) pattern matching? Many thanks

That is:

I have a text file -

abc efg ... START lines not to be extracted END START lines to be extracted END START lines not to be extracted END ...

I think I find a way. Thanks guys

my $count = 0; 11 while (<$fh_r>) { 12 if (/START/ .. /END/) { 13 $count++ if /START/; 14 print if ($count == 2); 15 } 16 }

--- Updated on Nov. 10, 15pm PST Thank you so much guys. I read through all your answers and appreciate your help. Thank you all and have a nice day.

Replies are listed 'Best First'.
Re: [Solved]Need to extract a particular block of lines between two patterns
by Cristoforo (Curate) on Nov 09, 2017 at 03:34 UTC
    This solution makes use of the next if 1 .. /^END$/; flip flop operator. The 1 is the first line of the file to the first END. The next time it encounters a START END block, it performs the actions in the code.
    #!/usr/bin/perl use strict; use warnings; while (<>) { next if 1 .. /^END$/; if (/^START$/ .. /^END$/) { next if /^START$/; last if /^END$/; print; } }
    This works for the data sample you provided.

      Hi Cristoforo,

      Thank you so much for your help. Please correct me if I understand it wrong - the code will skip the first START ... END pattern but will output the all the following patterns right? That is to say, the third START .. END pattern will be printed out as well even if it's not wanted?

      Thank you so much for the help again!

        Yes, it will skip the first block. It will exit the while loop (last if /^END$/) when it reaches the END for the second block.
Re: Need to extract a particular block of lines between two patterns
by kcott (Chancellor) on Nov 10, 2017 at 09:20 UTC

    G'day chengchl,

    Welcome to the Monastery.

    Here's a generic solution for your problem. It handles:

    • Extraction of any block (i.e. there's no hard-coded or constant block number).
    • Extraction of multiple blocks.
    • Blocks of lines actually containing (plural) lines.
    • Rogue START or END tokens within START-END blocks.
    • Specification of wanted blocks in any order.
    • Invalid block specifications (e.g. out of range and non-integer identifiers).

    In production code, you may want to add some form of validation and sanity checking, such that the function is short-circuited if no valid blocks are specified (which could mean not even having to open the input file).

    The following shows the technique (specifically for testing via the command line); you'll need to adapt this to your needs (e.g. change <DATA> to <$fh_r>). I've embedded test data to check all the things I've said it handles; you should create your own test data, which more realistically reflects your actual data, and use that for any proof-of-concept or regression tests.

    #!/usr/bin/env perl use strict; use warnings; my %print_block = map { $_ => 1 } @ARGV; my $found_block = 0; while (<DATA>) { next unless /^START$/ .. /^END$/; ++$found_block, next if /^START$/; next if /^END$/; print if $print_block{$found_block}; } __DATA__ ... line BEFORE any wanted blocks ... START block A line 1 block A line 2 with rogue END token block A line 3 block A line 4 with rogue START token block A line 5 END ... line BETWENN any wanted blocks ... START block B line 1 block B line 2 with rogue START token block B line 3 block B line 4 with rogue END token block B line 5 END ... line BETWENN any wanted blocks ... START block C line 1 block C line 2 with rogue END token block C line 3 block C line 4 with rogue START token block C line 5 END ... line BETWENN any wanted blocks ... START block D line 1 block D line 2 with rogue START token block D line 3 block D line 4 with rogue END token block D line 5 END ... line AFTER any wanted blocks ...

    Some example test runs (the script name is pm_1202989_flip_flop_selection.pl):

    $ pm_1202989_flip_flop_selection.pl $ pm_1202989_flip_flop_selection.pl 99 $ pm_1202989_flip_flop_selection.pl A B C $ pm_1202989_flip_flop_selection.pl 1 block A line 1 block A line 2 with rogue END token block A line 3 block A line 4 with rogue START token block A line 5 $ pm_1202989_flip_flop_selection.pl 1 4 block A line 1 block A line 2 with rogue END token block A line 3 block A line 4 with rogue START token block A line 5 block D line 1 block D line 2 with rogue START token block D line 3 block D line 4 with rogue END token block D line 5 $ pm_1202989_flip_flop_selection.pl 3 4 2 # NOTE: specified order irre +levant block B line 1 block B line 2 with rogue START token block B line 3 block B line 4 with rogue END token block B line 5 block C line 1 block C line 2 with rogue END token block C line 3 block C line 4 with rogue START token block C line 5 block D line 1 block D line 2 with rogue START token block D line 3 block D line 4 with rogue END token block D line 5 $

    [Side note: As you're new here, you may have been surprised by certain responses. You can safely ignore these; a quick perusal of the "Worst Nodes" page should explain why.]

    — Ken

      Hi Ken

      Thank you so much for the help and the kind side note. I really appreciate it!

Re: [Solved]Need to extract a particular block of lines between two patterns
by runrig (Abbot) on Nov 09, 2017 at 21:50 UTC
    Use the return value of the flip-flop:
    my $count; while (<$fh_r>) { if (my $status = /START/ .. /END/) { $count++ if $status == 1; print if $count == 2; } }

      Hi Runrig,

      Thank you so much for the clear explanation! It works perfect on my side. Out of curiosity, I also modified the code to print out the $status each line of the matched patterns - print "$status\t$count\t$_" if $count == 2; And I got the output results as:

      1 2 START 2 2 lines to be extracted 3E0 2 END

      Do you by any chance know why the third line is 3E0 and what does that stand for? Thank you in advance!

        ... why the third line is 3E0 and what does that stand for?

        From the discussion of Range Operators in scalar context (the "flip-flop" operator):

        The right operand is not evaluated while the operator is in the "false" state, and the left operand is not evaluated while the operator is in the "true" state. ... The value returned is either the empty string for false, or a sequence number (beginning with 1) for true. The sequence number is reset for each range encountered. The final sequence number in a range has the string "E0" appended to it, which doesn't affect its numeric value, but gives you something to search for if you want to exclude the endpoint.
        [Emphases added]
        So IOW, the final sequence number matches  qr{ \A \d+ E0 \z }xms See also The Scalar Range Operator ("Exluding Markers" section) and Flipin good, or a total flop? (specifically Re: Flipin good, or a total flop?), both in the Monastery's Tutorials section.


        Give a man a fish:  <%-{-{-{-<

Re: [Solved]Need to extract a particular block of lines between two patterns
by sundialsvc4 (Abbot) on Nov 09, 2017 at 20:52 UTC

    “Thank you, all, for once-again clarifying your by-now ... (have you actually bothered to notice this?) ... “who really cares anymore anyway,“ completely inbred, completely political ... positions?

    I already understand that there are a double-handful of you, who, upon this web-site about which an ever-shrinking number of individuals actually cares about anyway, continue to determinedly maintain a vendetta about a specific Monk.   (So far as I can tell, “at the exclusion of each and every other such participant here.”)

    But let’s look at the source code, shall we?

    my $count = 0; 11 while (<$fh_r>) { 12 if (/START/ .. /END/) { 13 $count++ if /START/; 14 print if ($count == 2); 15 } 16 }

    Do you see the very-obvious bug in the code?   Obviously you do.   There is only one point in time when $count will be exactly equal to 2, and anyone could spot the flaw immediately.

    Therefore, I, in entirely good faith, spelled out the basics of an alternative algorithm which not only would actually work, but which would be future-maintainable.   (As a manager right now of more than a dozen people who would individually qualify as “Perl Monks,” I happen to think that my point-of-view has certain weight.   Not to mention having written computer software for now just over four decades and counting ... how old are you?)

    Nevertheless, I daresay that you, in your still-determined pursuit of your still-pointless vendettas from, by now, “a full decade ago,” never once bothered to consider a single thing that I had actually said to the OP, except for the fact that sundialsvc4 had been the one to post it.   In so doing, you frankly completely-voided your assumed position of authority with regards to what the OP actually had to say – this being the entire reason that (s)he came here in the first place.

    So, do you not think that it is well past the appointed time(!) to “just let bygones be bygones?”   I rather think that most of the c-u-s-t-o-m-e-r-s which this website somehow still attracts would agree with me.   To them, I daresay, you are today just making noise.

      Do you see the very-obvious bug in the code? Obviously you do.

      Sorry, but I don't. Could you explain or demonstrate the bug with an SSCCE?

      There is only one point in time when $count will be exactly equal to 2, and anyone could spot the flaw immediately.

      That seems to be the OP's specification, "extract only the second block of lines that meet the pattern" (emphasis mine). Note how the OP's sample input data explicitly says "lines not to be extracted". I've added an extra block to the input for testing:

      use warnings; use strict; my $count = 0; while (<DATA>) { if (/START/ .. /END/) { $count++ if /START/; print if ($count == 2); } } __DATA__ abc efg ... START lines NOT to be extracted 1 END START lines to be extracted END START lines NOT to be extracted 2 END START lines NOT to be extracted 3 END

      Outputs:

      START lines to be extracted END

      ... which seems to me to meet the specification. Cristoforo's code also works as posted, except that it does not output the START and END markers.

      Whereas your code, once I fix all the syntax errors and change "do something with the line" to a print, outputs for the same sample input:

      lines to be extracted lines NOT to be extracted 3

      Update: Added the last two paragraphs, and minor edits for clarity.

      To start with, I'm very sorry, but I really don't have any vendetta against you (nor against any other monk here for that matter). Maybe some people around here do, I don't know for sure, but not me.

      My post was very factual: besides being syntactically wrong, the code you posted (once fixed) wasn't going to do what the OP was asking for. That's it. Nothing to add. Period.

      I would have stopped here if it weren't for your response.

      But your answer calls for additional comments.

      Don't make yourself yourself a victim of others. Here, my personal opinion was that your code was wrong, and I explained in which respects I thought your code suggestion was wrong. Leaving aside the syntax mistakes, you code (if fixed syntactically) doesn't do what what the OP asked for. I only reported that fact. Maybe, after all, I was wrong and your code was right (I really don't think so), but your answer just complains about people not being nice with you, but does not actually discuss in any way my objections to your code.

      Now, I would add that, since in your initial post you insisted quite a bit on thorough testing, perhaps you could check at the very least whether your code could compile flawlessly (I am willing to suppose and accept that: perhaps, you misunderstood the OP's original intent). If you can't do it, at least, ask one of your boys to do it for you.

      I am sure that I have made once in a while silly mistakes in code samples I have posted here or on some other forums; in that case, I would say: "Sorry, I made a mistake," or, maybe, "Sorry, not enough coffee this morning," or some other sentence trying to apologize, or whatever. But, at least, I would recognize my mistake and not accuse people of doing that to me because they don't like me.

      In the case in point, even the Perl compiler appears not to like you--or rather what you write. Yeah, maybe you've got a good reason to become paranoid. Even machines are against you. Or, perhaps, you should question yourself.

      Finally, I have serious doubts about whether you actually know Perl. Your syntax mistakes are glaring. "//" instead of "#" for comments, "false" and "true" for Boolean values, that's pretty damning. Yes, you should learn that "false" has a true value.

      Oh, and BTW, I agree with haukex and don't see "the very-obvious bug" in the OP's code (but I do see several in yours), but that's secondary to what I want to say here.

      Yes, there are probably monks around here who don't like you, or, rather, don't like what you write. But ask yourself: maybe they have some good reasons for that.

      obvious troll is obvious
      As a manager right now of more than a dozen people who would individually qualify as “Perl Monks,” ...

      Yeah, I've had nightmares | managers like that.


      Give a man a fish:  <%-{-{-{-<

      Are you drunk? High? Have mental health issues? You don't know perl basics, stop telling people on a perl forum that you do. You fall into several of these categories: https://en.wikipedia.org/wiki/Victim_playing
      If you want to help people, stop posting trash like this. Stop posting when you don't know anything at all about the topic under discussion (computers, perl, sanity)

      have you actually bothered to notice this

      you reply to wrong node again didnt you notice?

      Flush it

      All your responses come after the thread was marked "Solved"

      Go away Mike Robinson you spammer

Re: [Solved]Need to extract a particular block of lines between two patterns
by sundialsvc4 (Abbot) on Nov 09, 2017 at 01:54 UTC

    If this code works for you – if strenuous testing reveals that it does in fact work in all cases (I suspect that it doesn’t ...) – then of course count your blessings and move on.   But I would much rather see logic that very clearly expresses the actual algorithm and does so in a way that can be easily tested and patched.

    To my way of thinking, the logic is a classic Finite-State Automaton (FSA) which at any time can be said to be in one of three “states”:

    1. Is not within a block of lines that should be considered for extraction.
    2. Is in a block of lines that should be extracted.
    3. Is in a block of lines that should not be extracted.
    Furthermore, the transition between states #2 and #3 “flip-flops” each time.

    One way to do something like this – and this is un-tested – would look something like this, specifically including all comments:

    my $in_a_block = false; my $extract_me = false; // state for FIRST block while (<$fh_r>) { if (/^START$/) { $in_a_block = true; } elsif (/^END$/) { $in_a_block = false; $extract_me = ( $extract_me ? false : true ); // for NEXT time } elseif ($in_a_block && $extract_me) { // do something with the line ($_) } }

    My argument for this approach is that one can readily see how it works and that it does (I think ...) work.   Furthermore, if when the logic needs to be changed, perhaps by someone else, it is possible to do it with relatively little disruption.

    And, I would certainly insist that anyone on my team must create a series of automatic (Test::Most) tests to demonstrate that the logic actually does work and that, over time, it continues to do so.   This would establish that the solution is correct, that it is trustworthy, and that any future changes which introduced bugs would be quickly caught.

      Hi sundialsvc4,

      chengchl's question was essentially: how can I use the flip flop operator and pick up only the second START .. END block of text.

      Even assuming your suggested code were correct Perl code (which it isn't) and would run, it still wouldn't do what the OP was asking for, but only mimic the flip flop operator.

      The flip flop operator has a somewhat surprising edge-case behavior (when exactly is it reset) which might justify writing explicit code, but your code (if it worked) would not do what was asked for.

      Update: s/would do what/would not do what/;. Thanks to hippo for noticing the missing negation at the end of my last sentence above. Now fixed.

      Thank you for your javascript solution to a perl question. When I do run it in perl I get:
      Bareword found where operator expected near "// state" elseif should be elsif syntax error near "// state " Bareword "false" not allowed while "strict subs" in use Bareword "false" not allowed while "strict subs" in use Bareword "false" not allowed while "strict subs" in use Bareword "true" not allowed while "strict subs" in use

      And, I would certainly insist that anyone on my team must create a series of automatic (Test::Most) tests to demonstrate that the logic actually does work
      I know someone in your team who doesn't test their code

      Code that does not even compile, trash words, nothing of value, as usual. The ravings of a sad, mad old man.

        The ravings of a sad, mad old man.

        Be nice. Your time is gonna come.


        The way forward always starts with a minimal test.
        Spam is spam

      If this code works for you ... NOT PERL

      Go away

      Mike Robinson unqualified to give perl advice

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1202989]
Approved by beech
Front-paged by kcott
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2018-02-22 19:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When it is dark outside I am happiest to see ...














    Results (298 votes). Check out past polls.

    Notices?