http://www.perlmonks.org?node_id=1227935


in reply to Contextual find and replace large config file

It works fine (as long as the format does not change too much), however the more complex things that I want to do these kind of snippets tend to become very complex and difficult to maintain. ... I am looking for a very simple approach (search and replace, not reading the entire data file to memory)

It depends a lot on how much you can trust how strict the configuration file format is. For example, if you can be absolutely certain that, like in your example, the opening and closing braces are always on a line by themselves, then it'd be possible to implement a fairly simple line-by-line parser that keeps the names of the current sections on a stack, so that you can differentiate between different nested sections that happen to have the same name - I'm thinking something like the following:

use warnings; use strict; # $sep needs to be some char guaranteed not to be in the section name # (not the most elegant solution, but this isn't meant to be) my $sep = "\n"; my $target_sec = "ObjectType1${sep}NestedObject"; my $target_param = "Param1"; my $replace_val = "Hello"; my ($secname,@sec_stack,$cur_sec); while (<DATA>) { next if /^\s*#/ || !/\S/; if ( /^ (\s*) (.+?) \s* = \s* (.+?) \s* $/x ) { die "section name without a following block" if defined $secname; my ($indent,$param,$val) = ($1,$2,$3); if ($cur_sec eq $target_sec && $param eq $target_param) { $_ = $indent.$param.' = '.$replace_val."\n" } } elsif ( /^ \s* (\w+) \s* $/x ) { die "two section names following one another" if defined $secname; $secname = $1; } elsif ( /^ \s* \{ \s* $/x ) { die "'{' without name" unless defined $secname; push @sec_stack, $secname; $cur_sec = join $sep, @sec_stack; $secname = undef; } elsif ( /^ \s* \} \s* $/x ) { die "section name without a following block" if defined $secname; die "'}' without '{'" unless @sec_stack; pop @sec_stack; $cur_sec = join $sep, @sec_stack; } else { die "Failed to parse '$_'" } print $_; } __DATA__ # comment ObjectType1 { Param1 = Foo NestedObject { Param1 = Bar } # just another comment } ObjectType2 { Param1 = Quz NestedObject { Param1 = Baz } }

But once things start getting more complex, I'd recommend a "real" parser instead. You can check the Config:: namespace to see if there happen to be any modules that match your config format. 500k lines isn't all too much to read into memory at once, IMO, unless you're running on some really memory-restricted machine. In the worst case, you can write a parser yourself, e.g. using the m/\G.../gc technique (there's one example in the Perl docs in perlop under "\G assertion"), or using a full grammar (Parse::RecDescent, Regexp::Grammars, Marpa::R2, ...).

Here's a solution using m/\G.../gc, followed by a Regexp::Grammars example (the latter only parses, it doesn't do the replacement). In both, I've made some assumptions about the file format, such as that a Name = Value pair must appear on a single line by itself, that the section names may or may not contain whitespace, and so on (I've chosen slightly different rules in both). What I like about these kind of solutions is that they're "just" regular expressions, and as long as one can deal with those, it should hopefully be understandable.

use warnings; use strict; use Data::Compare qw/Compare/; my @target_block = ('Object Type1','NestedObject'); my $target_param = 'Param 1'; my $new_val = 'Hello!'; my $data = do { local $/; <DATA> }; my @stack; pos($data)=0; while ( pos($data)<length($data) ) { use re '/msx'; my $repl; if ( $data=~m{\G ^ \h* \# [^\n]* (?:\z|\n) }gc ) {} # comment, nothing to do elsif ( $data=~m{\G \s* ( \w(?:[\w\h]*\w)? ) \s* \{ \h*\n* }gc ) { push @stack, $1 } elsif ( $data=~m{\G (?<pre> ^\h* ) (?<name> [^\n=]+?) (?<mid> \h*=\h* ) (?<value> [^\n]+? ) (?<post> \h*(?:\z|\n) ) }gc ) { if ( Compare(\@stack,\@target_block) && $+{name} eq $target_param ) { $repl = $+{pre}.$+{name}.$+{mid}.$new_val.$+{post}; } } elsif ( $data=~m{\G \s* \} \h*\n* }gc ) { die "'}' with no opening '{'?" unless @stack; pop @stack; } else { die "Failed to parse at: \"" .substr($data, pos $data, 50)."...\"" } print $repl//substr($data, $-[0], $+[0]-$-[0]); } __DATA__ # comment Object Type1 { Param1 = Foo NestedObject { Param 1 = Bar } # just another comment } # comment ObjectType2 { Param1 = Quz = z NestedObject { Param1 = Baz } }
use warnings; use strict; use Regexp::Grammars; my @blockstack; my $grammar = do { use Regexp::Grammars; qr{ \A (?: <.comment> | <[confblock]> )* \z <rule: confblock> ^ <blockname=([^\s\{\}=]+)> \{ (?: <[param]> | <[confblock]> | <.comment> )* \} <.ws> <rule: param> ^ <name=([^\n=]+?)> = <value=([^\n]+?)> (?:\n|\z) <token: comment> ^ \h* \# [^\n]* (?:\n|\z) }xms }; my $data = do { local $/; <DATA> }; $data =~ $grammar or die "failed to parse"; my %conf = %/; #/ use Data::Dump; dd \%conf; __DATA__ # comment ObjectType1 { Param1 = Foo NestedObject { Param1 = Bar } # just another comment } # comment ObjectType2 { Param1 = Quz NestedObject { Param1 = Baz } }

Replies are listed 'Best First'.
Re^2: Contextual find and replace large config file
by Veltro (Hermit) on Jan 03, 2019 at 10:57 UTC

    This is great stuff haukex

    I think that using Regexp::Grammars is probably the best solution, however I am getting this YACC feeling over me and think this kind of thing is programming on an entire different level. So currently I am looking at your second approach which I think will offer me the flexibility that I am looking for.

    Actually I think this will help me to take this even one step further and build a more advanced configuration which will allow me to specify a filter and formulas to act on parameters. And for this I am thinking in the same lines as LanX (using a cache, separate functionality in functions etc. etc.).

    I understand about 95% of the code, but I am still struggling with some of the regex items which are:

    • Why (?:\z|\n) and not just \z when \z is 'up to and including \n'
    • Why \h*\n* and not \s*

    Thanks for your elaborate post

      Why (?:\z|\n) and not just \z when \z is 'up to and including \n'

      Not quite, \z only ever matches at the very end of the string, whereas \Z also matches before the newline at the end of the string, and the meaning of $ is changed by the /m modifier to match before every newline or at the end of the string. When I want to express "match up to the end of this line", I sometimes prefer (?:\z|\n) over $+/m because the former explicitly consumes the \n.

      Why \h*\n* and not \s*

      Because /\s*/ would also match e.g. \t\n\t, which causes a following /^.../ to no longer match, since /\s*/ consumed the \t at the beginning of the line.

      Update: Regarding the first point:

      $ perl -MData::Dump -e 'dd split /($)/m, "x\ny\nz"' ("x", "", "\ny", "", "\nz") $ perl -MData::Dump -e 'dd split /(\z|\n)/m, "x\ny\nz"' ("x", "\n", "y", "\n", "z")