Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Contextual find and replace large config file

by Veltro (Hermit)
on Jan 02, 2019 at 13:01 UTC ( #1227916=perlquestion: print w/replies, xml ) Need Help??

Veltro has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I hope someone can help me with some ideas on this.

Quite often I end up working with big text files (~500k lines) which have configuration data that I want to change using a Perl program. The data files often don't have any official format. The structure of these kind of files are often similar and the content could look something like the following examples:

#ObjectType1 Param1: 8 Param2: SomeText #ObjectType1.NestedObject Param1: 3 Param2: SomeText #ObjectType1 ... #ObjectType2 ...

or

ObjectType1 { Param1 = 8 Param2 = SomeText NestedObject { Param1 = 3 Param2 = SomeText } } ObjectType2 { ... } ObjectType1 { ... }

Most of the time I want to do something like changing the values of parameters for a certain object type and leave all the other lines inside the data file 'untouched'. A very simplistic approach that I used looks like the next code example (second data example). It reads the file line by line and keeps track of which 'context' it is currently reading and acts depending on that context. It works fine (as long as the format does not change too much), however the more complex things that I want to do these kind of snippets tend to become very complex and difficult to maintain.

use strict ; use warnings ; my $file = "test" ; open (my $fhi, "<", $file . ".dat" ) or die "Cannot open $file.dat\n" +; open (my $fho, ">", $file . "_out.dat" ) or die "Cannot open $file" . +"_out.dat\n" ; my $context = "" ; while ( my $line = <$fhi> ) { chomp $line ; if ( $line =~ /ObjectType1/ ) { $context = "ObjectType1" ; } if ( $line =~ /$\}/ ) { $context = "" ; } if ( $context eq "ObjectType1" ) { if ( $line =~ /Param1/ ) { print $fho "Param1 = 0\n" ; } elsif ( $line =~ /Param2/ ) { print $fho "Param2 = SomeOtherText\n" ; } else { print $fho $line . "\n" ; } } else { print $fho $line . "\n" ; } }

Does anyone know of a better or more generic way to do these kind of things? I am looking for a very simple approach (search and replace, not reading the entire data file to memory) where I can flexibly define a formula that is applied to a parameter within the scope of the context it is in.

Thanks, Veltro

edit:/\}/ => /$\}/

Replies are listed 'Best First'.
Re: Contextual find and replace large config file
by haukex (Bishop) on Jan 02, 2019 at 17:27 UTC
    It works fine (as long as the format does not change too much), however the more complex things that I want to do these kind of snippets tend to become very complex and difficult to maintain. ... I am looking for a very simple approach (search and replace, not reading the entire data file to memory)

    It depends a lot on how much you can trust how strict the configuration file format is. For example, if you can be absolutely certain that, like in your example, the opening and closing braces are always on a line by themselves, then it'd be possible to implement a fairly simple line-by-line parser that keeps the names of the current sections on a stack, so that you can differentiate between different nested sections that happen to have the same name - I'm thinking something like the following:

    But once things start getting more complex, I'd recommend a "real" parser instead. You can check the Config:: namespace to see if there happen to be any modules that match your config format. 500k lines isn't all too much to read into memory at once, IMO, unless you're running on some really memory-restricted machine. In the worst case, you can write a parser yourself, e.g. using the m/\G.../gc technique (there's one example in the Perl docs in perlop under "\G assertion"), or using a full grammar (Parse::RecDescent, Regexp::Grammars, Marpa::R2, ...).

    Here's a solution using m/\G.../gc, followed by a Regexp::Grammars example (the latter only parses, it doesn't do the replacement). In both, I've made some assumptions about the file format, such as that a Name = Value pair must appear on a single line by itself, that the section names may or may not contain whitespace, and so on (I've chosen slightly different rules in both). What I like about these kind of solutions is that they're "just" regular expressions, and as long as one can deal with those, it should hopefully be understandable.

      This is great stuff haukex

      I think that using Regexp::Grammars is probably the best solution, however I am getting this YACC feeling over me and think this kind of thing is programming on an entire different level. So currently I am looking at your second approach which I think will offer me the flexibility that I am looking for.

      Actually I think this will help me to take this even one step further and build a more advanced configuration which will allow me to specify a filter and formulas to act on parameters. And for this I am thinking in the same lines as LanX (using a cache, separate functionality in functions etc. etc.).

      I understand about 95% of the code, but I am still struggling with some of the regex items which are:

      • Why (?:\z|\n) and not just \z when \z is 'up to and including \n'
      • Why \h*\n* and not \s*

      Thanks for your elaborate post

        Why (?:\z|\n) and not just \z when \z is 'up to and including \n'

        Not quite, \z only ever matches at the very end of the string, whereas \Z also matches before the newline at the end of the string, and the meaning of $ is changed by the /m modifier to match before every newline or at the end of the string. When I want to express "match up to the end of this line", I sometimes prefer (?:\z|\n) over $+/m because the former explicitly consumes the \n.

        Why \h*\n* and not \s*

        Because /\s*/ would also match e.g. \t\n\t, which causes a following /^.../ to no longer match, since /\s*/ consumed the \t at the beginning of the line.

        Update: Regarding the first point:

        $ perl -MData::Dump -e 'dd split /($)/m, "x\ny\nz"' ("x", "", "\ny", "", "\nz") $ perl -MData::Dump -e 'dd split /(\z|\n)/m, "x\ny\nz"' ("x", "\n", "y", "\n", "z")
Re: Contextual find and replace large config file
by tybalt89 (Prior) on Jan 02, 2019 at 19:26 UTC

    "The data files often don't have any official format." -> Then it's hopeless and you should give up. :)

    Or

    The following program works for your test case #2 (and some things you might have missed). You should only have to change the "configuration section" to alter different things, after, of course, fixing it to actually read and write files.

    If it doesn't work on one of your large files, please show a small failed test case, and we'll see what we can do :)

    #!/usr/bin/perl # https://perlmonks.org/?node_id=1227916 use strict; use warnings; ##################### configuration section my $section = 'ObjectType1'; my %changes = ( Param1 => 0, Param2 => 'SomeOtherText', Param3 => 'Foo +bar'); ##################### end configuration section my $allkeys = join '|', keys %changes; my $pattern = qr/\b($allkeys)\b/; local $/ = "\n}\n"; while( <DATA> ) { if( /\b$section\b/ ) { my @context; print $& while @context && $context[-1] eq $section && /\G(\h*$pattern = ).*\n/ +gc ? "$1$changes{$2}\n" =~ /.*/s : @context && /\G\h*\}\n/gc ? pop @context : /\G\h*([\w ]+)\n\h*\{\n/gc ? push @context, $1 : /\G.*\n/gc; } else { print; } } __DATA__ ObjectType1 { Param1 = 8 NestedObject { Param1 = 3 Param2 = SomeText } Param2 = SomeText } ObjectType2 { Foo { Param1 = StaySame ObjectType1 { Param3 = ReplaceThis } } } ObjectType1 { ... }

    Outputs:

    ObjectType1 { Param1 = 0 NestedObject { Param1 = 3 Param2 = SomeText } Param2 = SomeOtherText } ObjectType2 { Foo { Param1 = StaySame ObjectType1 { Param3 = Foobar } } } ObjectType1 { ... }

    I'm also curious about benchmark times vs any other solution (since I'm not going to generate a 500000 line test file).

Re: Contextual find and replace large config file
by kschwab (Vicar) on Jan 02, 2019 at 18:34 UTC

    "Does anyone know of a better or more generic way to do these kind of things?"

    There's lots of choices for config files. JSON and YAML are popular. Your second example is pretty close to JSON already. It would look like this as JSON:

    { "ObjectType1": { "Param1": 8, "Param2": "SomeText", "NestedObject": { "Param1": 3, "Param2": "SomeText" } }, "ObjectType2": { "Param1": 10 } }

    There are perl modules to parse JSON, some streaming, if you really can't load it all into memory. There's also a really nice command line utility called "jq", see some examples here.

    Note that JSON doesn't support comments, which is probably the biggest complaint about it as a configuration file format.

      > Note that JSON doesn't support comments, which is probably the biggest complaint about it as a configuration file format.

      I never noticed this - most probably because I never came into a situation to need it.

      What's surprising me, is that JSON historically started as eval'ed JS object, so why did they skip the comment feature?

      Especially since CSS inherited JS comments too.

      So I did some research to find out that Douglas Crockford disabled it deliberately, because he wanted to prevent people from hiding data there. ...

      ... well, Douglas again. :/

      Anyway, for config purpose I'd try split up the data into multiple JSON chunks and comment them, or resort to YAML, which allows JSON as subset.

      --- # Comment { "name": "John Smith", "age": 33 }

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      "... JSON doesn't support comments..."

      It does if you treat them as data:

      #!/usr/bin/env perl use strict; use warnings; use JSON::Tiny qw(decode_json encode_json); use Data::Dump; my $conf = encode_json { foo => qw(bar), nose => qw(cuke), comment => qw(RTFM) }; my $hash = decode_json($conf); dd $hash; __END__ { comment => "RTFM", foo => "bar", nose => "cuke" }

      Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

        Comments in most languages can appear anywhere where insignificant whitespace is possible. Your approach can't transform structures that comment both on the keys and values, as in
        { "name" /* represented as "shortname" in the DB */ : "John Doe" /* full name */,
        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Contextual find and replace large config file
by LanX (Sage) on Jan 02, 2019 at 21:52 UTC
    These are several different questions

    First let me warn you that your code has an error

    This will fail if you don't care about indentation:

    if ( $line =~ /ObjectType1/ ) { $context = "ObjectType1" ; } if ( $line =~ /\}/ ) { $context = "" ; }

    Here you rather want to test for /$\}/ at lines start!

    My suggestions

    • separate parsing of syntax logic from processing of semantic logic
    • parse all lines of an object into a cache ( a string or nested hashes) before handling it
    • with nested objects use recursion
    • keep track of the indentation level, like counting open and closed braces
    • you should handle parsing errors in case the input is corrupted
    • use functions and packages instead of piling up if cases
    • use a function dispatcher if you need to handle semantics of different "ObjectTypes"

    Like this you will get reusable and maintainable code!

    edit

    some may miss example code, but you got a generic answer for a generic question.

    Feel free to pick some points and ask for clarification.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Hi LanX,

      Yes, I was actually aware of that error, since you mentioned it I edited the OP

      Not strictly necessary to provide example code (plus others have already done so), I am just trying to redesign some code and trying to find a different approach. So your generic answer is welcome of course

      The only thing is what you mean with your first suggestion (separate parsing...semantic logic). What do you mean with that? Do you mean parsing and gathering data first and then split the processing of that data into different function blocks or something else?

      Thanks, Veltro

        > (separate parsing...semantic logic). What do you mean with that? 

        Your two examples seem to hold the same information (semantic) while having different format (syntax).

        So better write parsers for the different formats which "cache" them in an intermediate format. These parsers should be ignorant about the meaning just concentrating on correctness.

        The semantics - the meaning of the data - could be handled by one central module which only operates on the intermediate format. This module could be reused for all formats.

        A possible intermediate format could be nested hashes

        $cache = { ObjectType1 => { Param1 => 8, Param2 => "SomeText", NestedObject => { Param1 => 3, Param2 => "SomeText" } }

        Of course this highly depends on the nature of your data, like

        • does order matter?
        • are repeated elements allowed?
        Using nested arrays may be better then°

        And after transforming your data you can also have emitter modules to write them into a new out file.

        Like this you are even capable to transform between different formats, or add new ones.

        HTH! :)

        edit

        NB: this approach is also useful when handling only one input format, because you can cleanly separate code, hence much better maintain it.

        update

        °) or a mix of hashes and arrays. Or even using Perl objects blessing elements into different "ObjectTypes", ...

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Contextual find and replace large config file
by tybalt89 (Prior) on Jan 03, 2019 at 15:32 UTC

    Here's a version that uses a HoH to allow multiple changes to multiple contexts in one pass.

    #!/usr/bin/perl # https://perlmonks.org/?node_id=1227916 use strict; use warnings; $SIG{__WARN__} = sub {die @_}; ##################### configuration section my %changes = ( ObjectType1 => { Param1 => 0, Param2 => 'SomeOtherText' }, ObjectType4 => { Param3 => 'Replacement' }, Foo => { Param2 => 'FooChanged' }, ); ##################### end configuration section my $allcontexts = join '|', sort keys %changes; my $contextpattern = qr/\b($allcontexts)\b/; my %patterns; for my $section (keys %changes) { my $all = join '|', keys %{ $changes{$section} }; $patterns{$section} =qr/\b($all)\b/; } local $/ = "\n}\n"; while( <DATA> ) { if( /$contextpattern/ ) { my @context; print $& while @context && $patterns{$context[-1]} && /\G(\h*$patterns{$context[-1]} = ).*\n/gc ? "$1$changes{$context[-1]}{$2}\n" =~ /.*/s : @context && /\G\h*\}\n/gc ? pop @context : /\G\h*([\w ]+)\n\h*\{\n/gc ? push @context, $1 : /\G.*\n/gc; } else { print; } } __DATA__ ObjectType1 { Param1 = 8 NestedObject { Param1 = 3 Param2 = SomeText } Param2 = SomeText } ObjectType2 { Foo { Param1 = StaySame Param2 = FooChange ObjectType4 { Param1 = DoNotReplaceThis Param3 = ReplaceThis } } } ObjectType1 { Param1 = ReplaceThis Param3 = DoNotReplaceThis Foo { Param1 = StaySame ObjectType4 { Param1 = DoNotReplaceThis Param3 = ReplaceThis } } }
Re: Contextual find and replace large config file
by trippledubs (Deacon) on Jan 08, 2019 at 19:41 UTC
    Not sure if this is too much or too little for you to plugin, but fun to learn some Parse::RecDescent. I could not figure out how to get the array list as the hash I wanted except to use unroll. Each parsing module requires it's own learning investment just browsing Regexp::Grammars from haukex's answer. If you need such a thing.
    #!/usr/bin/env perl use strict; use warnings; use Parse::RecDescent; use Data::Dumper; $::RD_ERRORS = 1; $::RD_WARN = 1; $::RD_HINT = 1; #$::RD_TRACE = 1; #$::RD_AUTOACTION = q { print Dumper \@item }; my $grammar = q{ { use Data::Dumper; sub unroll { my @list = @{$_[0]}; my $unrolled; for my $href (@list) { for my $key (keys %{$href}) { $unrolled->{$key} = $href->{$key}; } } return $unrolled; }; } Expression: Object(s) { $return = unroll($item[1]) } Object: String '{' Param(s) '}' { $return = { $item[1] => unroll($item[3]) } } Param: String '=' String { $return = { $item[1] => $item[3] } } | Object(s) { $return = unroll($item[1]) } String: /[\w\d]+/ { $return = $item[1] } }; my $parser = Parse::RecDescent->new($grammar); my $text = do { undef $/; <DATA> }; my $tree = $parser->Expression($text) or die $!; $tree->{ObjectType1}{NestedObject}{DeeplyNested}{Param60} = 'tuna'; print Dumper $tree; __DATA__ ObjectType1 { Param1 = 8 Param2 = SomeText NestedObject { Param1 = 3 Param2 = MoreText DeeplyNested { Param50 = 500 Param60 = squid } } } ObjectType2 { Param1 = 3 Param2 = 40 }
Re: Contextual find and replace large config file
by Veltro (Hermit) on Jan 05, 2019 at 11:35 UTC

    Thanks again for your input everyone.

    With your help I am now able to change a foreign datafile like:

    # comment GlobalParam = 1 Object Type1 { Param1 = Foo NestedObject { Param 1 = Bar } # just another comment } # comment ObjectType2 { Param1 = Quz = z Param2 = 3 NestedObjectX { Param1 = Baz NestedObjectZ { Param1 = Baz } } NestedObjectY { Param1 = 5 } }

    by applying a filter like:

    [ [ # Filter { 'Object Type1' => { 'Param1' => [ "Foo" ], }, 'GlobalParam' => [ '1' ], # 'Junk' => [ 'more junk' ], # Will break the filter }, # Changes { 'Object Type1' => { 'NestedObject' => { 'Param 1' => "\"Box\"", }, }, } ], [ # Filter { 'Object Type1' => { 'Param1' => [ "Foo" ], 'NestedObject' => { 'Param 1' => [ "Box" ], }, }, # 'GlobalParam' => [ '2' ], # Will disable this filter, # but first filter is still # applied }, # Changes { 'Object Type1' => { 'NestedObject' => { 'Param 1' => "\$curVal . \" Baz\"", }, }, } ], [ # Filter { 'ObjectType2' => { 'Param2' => [ '1', '2', '3' ], }, }, # Changes { 'ObjectType2' => { 'NestedObjectY' => { 'Param1' => "\$curVal * 2", }, }, } ], ] ;

    Which changes the configured paramaters into:

    # comment GlobalParam = 1 Object Type1 { Param1 = Foo NestedObject { Param 1 = Box Baz } # just another comment } # comment ObjectType2 { Param1 = Quz = z Param2 = 3 NestedObjectX { Param1 = Baz NestedObjectZ { Param1 = Baz } } NestedObjectY { Param1 = 10 } }

    edit 2019 Jan 07: Without further testing of this particular program I have removed a '^' from my $re_comment = qr/ ^ \h* \# [^\n]* \n / ; and qr/ (?<pre> ^\h* )because it was killing the performance of this program.

    code if you want:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1227916]
Approved by Corion
Front-paged by kschwab
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2021-09-19 18:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?