http://www.perlmonks.org?node_id=208384

Rich36 has asked for the wisdom of the Perl Monks concerning the following question:

In a piece of code that I'm working on, I've got some report data that consists of a scalar that contains data about tags in a document and what line the tags are on. The data is organized so that there is a line number, followed by a colon, then the report data which can span multiple lines. What I'm trying to do is split the data up so that I've got a hash with the line numbers as the keys and the data as the values. The problem that I'm running into is when the data for a line number spans multiple lines ("\n" delimited). I'm attempting to use a while loop and regex to split the data up, but it's not exactly working right. Any help would be greatly appreciated...

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $info; $info .= $_ while(<DATA>); my %lines; while($info =~ m/(\d+)\: (.+?)/gms) { $lines{$1} = $2; } print Dumper (%lines); __DATA__ 3: Tag <test> found Tag <test> found 5: Tag <test> found 7: Tag <test> found 14: Tag <test> found 16: Tag <test> found 18: Tag <test> found 21: Tag <test> found 25: Tag <test> found 27: Tag <test> found 29: Tag <test> found 32: Tag <test> found 34: Tag <test> found 49: Tag <test> found 80: Tag <test> found 98: Tag <test> found Tag <test> found

«Rich36»

Replies are listed 'Best First'.
Re: Matching over multiple lines in a scalar
by Enlil (Parson) on Oct 27, 2002 at 22:18 UTC
    I am assuming you want to slurp the whole file into the variable $info. It might be easier to do this as follows:
    { local $/ = undef; $info = <DATA>; }
    For more info on $\ look at perlvar

    Apart from this, I have changed the regex a little bit to do more of what I think you want it to do. Here is the code.

    use strict; use warnings; use Data::Dumper; my $info; { local $/ = undef; $info = <DATA>; } my %lines; while ($info =~ m/(\d+)\: (.+?)\n(?=\d)/gs) { $lines{$1} = $2; } print Dumper (%lines); __DATA__ 3: Tag <test> found Tag <test> found 5: Tag <test> found 7: Tag <test> found 14: Tag <test> found 16: Tag <test> found 18: Tag <test> found 21: Tag <test> found 25: Tag <test> found 27: Tag <test> found 29: Tag <test> found 32: Tag <test> found 34: Tag <test> found 49: Tag <test> found 80: Tag <test> found 98: Tag <test> found Tag <test> found
    and here is the output:
    $VAR1 = '29'; $VAR2 = 'Tag <test> found'; $VAR3 = '21'; $VAR4 = 'Tag <test> found'; $VAR5 = '7'; $VAR6 = 'Tag <test> found'; $VAR7 = '14'; $VAR8 = 'Tag <test> found'; $VAR9 = '80'; $VAR10 = 'Tag <test> found'; $VAR11 = '32'; $VAR12 = 'Tag <test> found'; $VAR13 = '16'; $VAR14 = 'Tag <test> found'; $VAR15 = '49'; $VAR16 = 'Tag <test> found'; $VAR17 = '25'; $VAR18 = 'Tag <test> found'; $VAR19 = '3'; $VAR20 = 'Tag <test> found Tag <test> found'; $VAR21 = '34'; $VAR22 = 'Tag <test> found'; $VAR23 = '18'; $VAR24 = 'Tag <test> found'; $VAR25 = '27'; $VAR26 = 'Tag <test> found'; $VAR27 = '5'; $VAR28 = 'Tag <test> found';
    . The regex: m/(\d+)\: (.+?)\n(?=\d)/gs
    looks for a number then lazily matches up to where the next thing is a new line, but only if the first thing after that new line is a digit.

    Well, I think this is what you want.

    UPDATE: I meant to have placed the Dumper(\%lines) as BrowserUK has done below, instead of just Dumper(%lines). Don't know what I was thinking.

    -enlil

      my $info; { local $/ = undef; $info = <DATA>; }
      can be golfed reduced down to:
      my $info = do {local $/;<DATA>};
      UPDATE:
      Well shucks ... chromatic has already said that in this thread (and thanks for the doobie doobie do, Aristotle).

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      

      Thanks very much - that's exactly what I was looking for. I need to study up on ?=...

      The only problem in using that regex is that is doesn't capture the last line of input because the data doesn't end with a digit. The easy way around that was to add $info .= '00:';.


      «Rich36»
      { local $/ = undef; $info = <DATA>; }
      For more info on $\ look at perlvar

      I think that you mean "info on $/ look ...", at least that variable exists elsewhere. :)

Re: Matching over multiple lines in a scalar
by BrowserUk (Pope) on Oct 27, 2002 at 22:31 UTC

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $info; $info .= $_ while(<DATA>); my %lines; while($info =~ m/(?:^|\n)(\d+)\:(.+?)(?=(?:\n\d)|$)/gs) { $lines{$1} = $2; } print Dumper (\%lines); __DATA__ 3: Tag <test> found Tag <test> found 5: Tag <test> found 7: Tag <test> found 14: Tag <test> found 16: Tag <test> found 18: Tag <test> found 21: Tag <test> found 25: Tag <test> found 27: Tag <test> found 29: Tag <test> found 32: Tag <test> found 34: Tag <test> found 49: Tag <test> found 80: Tag <test> found 98: Tag <test> found Tag <test> found

    Gives

    c:\test>208384 $VAR1 = { '29' => ' Tag <test> found', '21' => ' Tag <test> found', '7' => ' Tag <test> found', '14' => ' Tag <test> found', '80' => ' Tag <test> found', '32' => ' Tag <test> found', '16' => ' Tag <test> found', '49' => ' Tag <test> found', '25' => ' Tag <test> found', '3' => ' Tag <test> found Tag <test> found', '98' => ' Tag <test> found Tag <test> found', '34' => ' Tag <test> found', '18' => ' Tag <test> found', '27' => ' Tag <test> found', '5' => ' Tag <test> found' }; c:\test>

    Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

      Thanks for the help on this. By the way - just out of curiosity - why pass a hash reference to Data::Dumper instead of just the hash?


      «Rich36»

        If you pass the hash, it gets flattened and passed as a list of 30 seperate scalars, the association between key=>value pairs is lost. By passing the reference, Data::Dumper knows it a hash, and outputs it as such, with the Key=>value pairs clearly associated and shown as being a part of a compound entity.

        In fact, you can then write this output to a file and then read it back, eval the string and it will recontruct the hash in memory. Often used as a cheap man DB.

        Run the program both ways to see the difference.


        Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
Re: Matching over multiple lines in a scalar
by gjb (Vicar) on Oct 27, 2002 at 22:23 UTC

    I'd suggest a slightly different approach that has the advantage that one can read line by line so that there's no need to have all data in memory (which is nice if you've a lot of data).

    #!perl use strict; my %data; my ($key, $data); while (<DATA>) { chomp($_); if (/^(\d+):\s*(.+)$/) { $data{$key} = $data if defined $key; $key = $1; $data = $2; } else { $data .= " $_"; } } $data{$key} = $data if defined $key; foreach my $key (sort {$a <=> $b} keys %data) { print "$key: '$data{$key}'\n"; } __DATA__ 3: Tag <test> found 1 Tag <test> found 2 5: Tag <test> found 3 7: Tag <test> found 4 14: Tag <test> found 5 16: Tag <test> found 6 18: Tag <test> found 7 21: Tag <test> found 8 25: Tag <test> found 9 27: Tag <test> found 10 29: Tag <test> found 11 32: Tag <test> found 12 34: Tag <test> found 13 49: Tag <test> found 14 80: Tag <test> found 15 98: Tag <test> found 16 Tag <test> found 17

    Essentially, this is a finite state machine with two states, new-line and continue-line, represented by the if and the else part with the variable $key playing the role of state variable.

    Essentially, this is a finite state machine with three states, initial, new-line and continue-line, the last two represented by the if and the else part with the variable $key playing the role of state variable distinguishing between the initial (undef) and the other two states.

    (I modified the data slightly to be able to check that the data actually ends up with the right key in the hash.)

    Hope this helps, -gjb-

    Update: this explanation is more precise than the version I striked out.

      The data's actually coming in as a scalar from another sub and isn't that significantly large, so I don't think it's worth it to split it and then deal with it that way. Although that's a cool approach to the problem and I may end up adopting it if the data gets too large. Thanks.


      «Rich36»
Re: Matching over multiple lines in a scalar
by chromatic (Archbishop) on Oct 27, 2002 at 22:48 UTC

    Why use negative-width assertions, when you're already using the /m flag? I like this:

    use Data::Dumper; my $info = do { local $/; <DATA> }; my %lines; while($info =~ m/(\d+)\: (.+?)$/gm) { $lines{$1} = $2; } print Dumper (\%lines);

    Of course, this also has an appeal:

    my %lines = $info =~ /(\d+): (.+?)$/gm;

    Passing a reference to Dumper allows Data::Dumper to dump the entire data structure without listifying it first.

    Update: I miscopied the test data. Oops. Negative-width assertions are the way to go. :)

      Why use negative-width assertions, when you're already using the /m flag?

      To catch the broken lines. Yours is a very elegant construction, which I intend to steal, but I don't think that snippet meets the original requirements as it is. if you knew the tags wouldn't contain numerals, which I doubt, you could change it to:

      my $info = do { local $/; <DATA> }; my %lines = $info =~ /(\d+): ([^\d]+)/gs;

      but otherwise I can't see an alternative to the (?:^|\n).

      btw, is there any way to catch the matched values during a split? it would make this nice and tidy.

      update: damnation. redundant again.

      another update: I couldn't resist shrinking gjb's cheaper version and introducing a useful but quite unrequested array reference:

      my ($key, %data); for (<DATA>) { /^(?:(\d+):\s*)*(.+)$/; push @{ $data{ $key = $1 || $key } }, $2; }

        Take it one step further still and do away with the temporary scalar. This works.

        my %lines = do{local$/; <DATA> = ~m/(?:^|\n)(\d+)\:(.+?)(?=(?:\n\d)|$) +/gs };

        Now I'll wait for sauoq to reduce the regex to 3 chars and a twiddle and we've got a golf solution.:^)


        Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

      That doesn't work for the first and last 'lines' which actually each consist of 2 lines.


      Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy