Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Multi Line Regex Matches...

by tj999 (Novice)
on Sep 03, 2017 at 06:06 UTC ( #1198601=perlquestion: print w/replies, xml ) Need Help??
tj999 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks!

I'm struggling with multi line matches, hoping someone can offer advice.

I have text files with data that looks like this:
123,-->456,
456,-->234,
234,-->789,
789,-->123,

I am trying to print the first three digits of the each line followed by the second three digits of the following line to my OUTFILE on a single line. So the output for the four lines above would be:
123,-->234,
456,-->789,
234,-->123,

From what I've been reading I need to use the /s or /m modifier to match a newline. I've been trying to use /s to have "." match a newline. My code looks like this:

@files=glob"SBNUM_*.txt"; open OUTFILE, ">v2test.txt" or die "Could not create outfile!"; foreach $ARG(@files){ open INFILE, "<$ARG" or die "CANNOTOPENINFILE"; while (<INFILE>) { if (m/^(\d\d\d),-->(\d\d\d),.(\d\d\d),-->(\d\d\d),/s) { print OUTFILE "\n$1,-->$4,"; } } }

I have tried a few variations but so far nothing works for me. Any advice or suggestions are greatly appreciated!

TJ

Replies are listed 'Best First'.
Re: Multi Line Regex Matches...
by Athanasius (Chancellor) on Sep 03, 2017 at 06:32 UTC

    Hello tj999,

    When you find a match, you don’t want to consume the whole matched part of the string, since then you’ll miss every second match. You want to consume only the first half (i.e., the first line) of the match, and start looking again at the beginning of the next line. For this you need a positive lookahead assertion, which prevents the second half of the matched text from being consumed:

    use strict; use warnings; my $in = '123,-->456, 456,-->234, 234,-->789, 789,-->123, '; print "$1,-->$2\n" while $in =~ / (\d{3}) ,--> \d{3}, \s+ (?= \d{3} ,--> (\d{3}) ) /g +sx; # | lookahead assertion |

    Output:

    16:28 >perl 1814_SoPW.pl 123,-->234 456,-->789 234,-->123 16:28 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Multi Line Regex Matches...
by haukex (Abbot) on Sep 03, 2017 at 07:59 UTC

    If you want to match across multiple lines, you need to keep multiple lines in memory, but the problem is your while (<INFILE>) only reads one line at a time. So for Athanasius's solution to work, you need to read the entire file into memory ("slurp"), which you can do like this:

    open my $ifh, '<', $infile or die "$infile: $!"; my $data = do { local $/; <$ifh> }; close $ifh;

    Note that you really should Use strict and warnings, and some other more modern features like lexical filehandles and the three-argument open like I am showing here.

    Another approach is to read the file line-by-line, and keep some state in memory. Since you are reading from multiple input files, I suspect you might want to treat them as a continuous stream of input? If so, then for the above approach, you'd have to slurp all the files into memory and concatenate them, which might be a lot, depending on the size of the input files. But that's not necessary for the line-by-line approach:

    use warnings; use strict; use File::Glob ':bsd_glob'; my @infiles = sort glob 'SBNUM_*.txt'; my $outfile = 'v2test.txt'; open my $ofh, '>', $outfile or die "$outfile: $!"; my ($prevleft,$prevright); for my $infile (@infiles) { open my $ifh, '<', $infile or die "$infile: $!"; while (<$ifh>) { my ($left,$right) = /^(\d{3}),-->(\d{3}),$/ or die "Couldn't match '$_'"; if (defined $prevleft) { print $ofh "$prevleft,-->$right,\n"; } ($prevleft,$prevright) = ($left,$right); } close $ifh; } close $ofh;

    Or, just for fun, the same code as a oneliner:

    perl -nle '/^(\d{3}),-->(\d{3}),$/||die;defined$x&&print"$x,-->$2,";$x +=$1' SBNUM_*.txt >v2test.txt
Re: Multi Line Regex Matches...
by BillKSmith (Vicar) on Sep 03, 2017 at 14:05 UTC
    You may prefer this line-by-line approach:
    C:\Users\Bill\forums\monks>type tj999.pl use strict; use warnings; open my $INFILE, '<', \<< 'END_INFILE'; 123,-->456, 456,-->234, 234,-->789, 789,-->123, END_INFILE my $old_line = <$INFILE>; while (my$new_line = <$INFILE>) { my $sample = $old_line.$new_line; $sample =~ /^(\d{3}).*(\d{3})\,\s*$/s; print "$1,-->$2\n"; $old_line = $new_line } C:\Users\Bill\forums\monks>perl tj999.pl 123,-->234 456,-->789 234,-->123
    Bill

      Thank you all so much for your help!

      I really appreciate the info, you have all given me a lot to work with.

      THANK YOU EVERYONE!!!

Re: Multi Line Regex Matches...
by kcott (Chancellor) on Sep 03, 2017 at 23:35 UTC

    G'day tj999,

    Reading your entire file into a string and then stepping through that string with a regex could be problematic. Given you know the exact record structure, I'd question whether a regex is even an appropriate tool in this case. Perhaps something as simple as the following would suffice.

    #!/usr/bin/env perl use strict; use warnings; my $sep = ',-->'; my $sep_len = length $sep; my $tail = ",\n"; my $last; while (<DATA>) { my ($first, undef, $second) = unpack "A3 A$sep_len A3"; print $last, $sep, $second, $tail if defined $last; $last = $first; } __DATA__ 123,-->456, 456,-->234, 234,-->789, 789,-->123,

    Output:

    123,-->234, 456,-->789, 234,-->123,

    See also:

    • unpack — the function I've used here.
    • pack — documentation on the templates.
    • perlpacktut — a tutorial on pack and unpack.

    [I was going to comment on various aspects of your code. I can see haukex has already addressed this; so, rather than repeating it, I'll just say I concur with all the points he's made.]

    — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1198601]
Approved by karlgoethebier
Front-paged by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2018-06-23 16:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (125 votes). Check out past polls.

    Notices?