Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Regex to fix up records, some multiline fields, some not

by butchie3980 (Acolyte)
on Aug 20, 2013 at 08:59 UTC ( #1050152=perlquestion: print w/ replies, xml ) Need Help??
butchie3980 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I have a batch of data dumps that are shipped to me from another site, and I wanted to convert the data to xml. I'm using regex to grab fields, most of which are on one line, but sometimes a record will have a multiline field. I'm struggling to get this working. each record is pulled (using Tie::File) into a multi-line scalar, called $currentrecord. below is an example of the code I've tried, with sample data

if $currentrecord =~ m/^field2(.*)\nfield3/mi { $field2data = $1; }
Here's two examples of the data encountered: record 1: field1: data 1 monday field2: data 2 monday field3: data 3 monday record 2: field1: data 1 tuesday field2: data 2 tuesday tuesday details line 1 tuesday details line 2 field3: data 3 tuesday

The above approach isn't working when field2 has multiple lines. How can I catch both record styles?

UPDATE
OK, I tested all of the responses to this posting, and they were all effective. Thank you so much for your help.

Comment on Regex to fix up records, some multiline fields, some not
Select or Download Code
Re: Regex to fix up records, some multiline fields, some not
by McA (Priest) on Aug 20, 2013 at 09:09 UTC

    What is the rule to determine that a row is a continued line?

    McA

Re: Regex to fix up records, some multiline fields, some not
by Athanasius (Monsignor) on Aug 20, 2013 at 09:10 UTC

    You need to add an /s modifier to the regex:

    #! perl use strict; use warnings; our $/ = ''; while (my $currentrecord = <DATA>) { if ($currentrecord =~ m/^field2(.*)\nfield3/msi) { my $field2data = $1; print "Found \$field2data = $field2data\n"; } } __DATA__ record 1: field1: data 1 monday field2: data 2 monday field3: data 3 monday record 2: field1: data 1 tuesday field2: data 2 tuesday tuesday details line 1 tuesday details line 2 field3: data 3 tuesday

    Output:

    19:07 >perl 692_SoPW.pl Found $field2data = : data 2 monday Found $field2data = : data 2 tuesday tuesday details line 1 tuesday details line 2 19:08 >

    See perlre#Modifiers.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Regex to fix up records, some multiline fields, some not
by Utilitarian (Vicar) on Aug 20, 2013 at 09:13 UTC
    Try the unfold method of Text::LineFold

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: Regex to fix up records, some multiline fields, some not
by Eily (Deacon) on Aug 20, 2013 at 12:37 UTC

    Instead of having one regex for each field, you can use the /g modifier to go from one field to the other, and use the (?=EXPR) syntax to check that what follows your field is another one and not data without

    use Data::Dumper; my $regex = qr/ ^field(\d+): # find a line starting by 'field' and + capture its number (.*?)\n? # find the smallest string before the + next (?=^field\d+:|\z) # line starting by 'field' or end of +record. Rewind just before that point after the match. /msx; # ^ matches beginning of line, . matches \n and spac +es and comments are ignored in the regex my %result; my $count = 1; { # block to limit the scope of local local $/ = ""; # records are separated by empty lines while(<DATA>) { my %hash; while(/$regex/g) { $hash{"field$1"} = $2; } $result{"record ".$count++} = \%hash; } } print Dumper \%result; __DATA__ field1: data 1 monday field2: data 2 monday field3: data 3 monday field1: data 1 tuesday field2: data 2 tuesday tuesday details line 1 tuesday details line 2 field3: data 3 tuesday
    $VAR1 = { 'record 1' => { 'field1' => ' data 1 monday', 'field2' => ' data 2 monday', 'field3' => ' data 3 monday ' }, 'record 2' => { 'field1' => ' data 1 tuesday', 'field2' => ' data 2 tuesday tuesday details line 1 tuesday details line 2', 'field3' => ' data 3 tuesday' } };

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1050152]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (11)
As of 2014-12-25 05:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (159 votes), past polls