Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Replace string in one file using input from another file

by lewars (Initiate)
on Feb 07, 2018 at 16:24 UTC ( #1208637=perlquestion: print w/replies, xml ) Need Help??
lewars has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks,

I have two files. A lookup file, and a master data file.

I need to take the lookup and use that to replace all the instances/matches from the lookup into the master data file.

Lookup.csv (Tab delimited but I can change this easily)

Ref00004 https://dealerportal4.xx.com/siteminderagent/forms/xx. +fcc;ACS=0 Ref00005 https://sso.xx.com/siteminderagent/forms/xx.fcc;ACS=0; +REL=0 Ref00006 https://secure3.xx.com/siteminderagent/forms/xx.fcc;AC +S=0;REL=0 Ref00007 https:///siteminderagent/cert/smgetcred.scc?cert Ref00008 https://secure4.xx.com/siteminderagent/forms/xx.fcc;AC +S=0;REL=0 Ref00009 https://vbos-uat.xx.com/siteminderagent/forms/xx.fcc;A +CS=0;REL=0
Master Datafile (file is 38 MB)
<Property Name="CA.SM::AuthScheme.IsUsedbyAdmin"> <BooleanValue>false</BooleanValue> </Property> <Property Name="CA.SM::AuthScheme.Desc"> <StringValue>TCP portal auth scheme</StringValue> </Property> <Property Name="CA.SM::AuthScheme.Level"> <NumberValue>5</NumberValue> </Property> <Property Name="CA.SM::AuthScheme.IsTemplate"> <BooleanValue>false</BooleanValue> </Property> <Property Name="CA.SM::AuthScheme.Param"> <LinkValue><XREF>Ref00005</XREF></LinkValue> </Property> <Property Name="CA.SM::AuthScheme.Library"> Final/file (replaced): <Property Name="CA.SM::AuthScheme.IsUsedbyAdmin"> <BooleanValue>false</BooleanValue> </Property> <Property Name="CA.SM::AuthScheme.Desc"> <StringValue>TCP portal auth scheme</StringValue> </Property> <Property Name="CA.SM::AuthScheme.Level"> <NumberValue>5</NumberValue> </Property> <Property Name="CA.SM::AuthScheme.IsTemplate"> <BooleanValue>false</BooleanValue> </Property> <Property Name="CA.SM::AuthScheme.Param"> <LinkValue><XREF>https://sso.xx.com/siteminderagent/fo +rms/xx.fcc;ACS=0;REL=0</XREF></LinkValue> </Property> <Property Name="CA.SM::AuthScheme.Library">
Thanks in advance.

Replies are listed 'Best First'.
Re: Replace string in one file using input from another file
by AnomalousMonk (Chancellor) on Feb 07, 2018 at 17:16 UTC

    Try something like this:

    Notes:
    • This approach slurps the entire master file into memory, so it should work fine with a 38 MB or even 380 MB file, but will not scale to larger file sizes indefinitely.
    • The regex for matching references assumes the reference string is always bounded by a non-\w character. If this is not the case, adjust as needed.
    • The substitution replaces Ref00004-like strings anywhere and everywhere in the file. If you need this replacement done, e.g., only between certain tags, adjust the match regex as needed or perhaps use an XML parser.
    • The example code only print-s to standard out; adjust as needed.
    • Update: No validation is done on the content of the lookup.dat file. It might be wise to consider this.
    • Update: I think the regex for extracting URLs from the lookup data file will support embedded whitespace in the URL, but I haven't tested this. Caveat Programmor.
    • Update: The regex for extracting reference placeholders and URLs from records in the lookup file is very naive. For instance,  \S+ matches a reference placeholder. Personally, I would feel better with a more specific match, maybe something like
          qr{ (?<! [[:alpha:]]) Ref \d{5} (?! \d) }xms
      Likewise, I'm sure there are canned regexes for matching URLs available.

    Update: For a good discussion of the technique used above to build the  $rx_ref regex matching object, see Building Regex Alternations Dynamically by haukex.


    Give a man a fish:  <%-{-{-{-<

      Thanks so much! This works flawlessly!

Re: Replace string in one file using input from another file
by Laurent_R (Canon) on Feb 07, 2018 at 17:12 UTC
    Generally, the idea would be to load the Lookup.csv file into a hash at start, and then to process the other file and make changes where needed.

    Are the changes always to be made in

    <LinkValue><XREF>...</XREF></LinkValue>
    tags, or do they have to be done also in other tags?
Re: Replace string in one file using input from another file
by tybalt89 (Priest) on Feb 07, 2018 at 18:17 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1208637 use strict; use warnings; use Inline::Files; my %replace = map /^(\S+)\s+(\S.*)$/, <LOOKUP>; while( <MASTER> ) { s#\bRef\d+# $replace{$&} // $& #ge; print; } __LOOKUP__ Ref00004 https://dealerportal4.xx.com/siteminderagent/forms/xx. +fcc;ACS=0 Ref00005 https://sso.xx.com/siteminderagent/forms/xx.fcc;ACS=0; +REL=0 Ref00006 https://secure3.xx.com/siteminderagent/forms/xx.fcc;AC +S=0;REL=0 Ref00007 https:///siteminderagent/cert/smgetcred.scc?cert Ref00008 https://secure4.xx.com/siteminderagent/forms/xx.fcc;AC +S=0;REL=0 Ref00009 https://vbos-uat.xx.com/siteminderagent/forms/xx.fcc;A +CS=0;REL=0 __MASTER__ <Property Name="CA.SM::AuthScheme.IsUsedbyAdmin"> <BooleanValue>false</BooleanValue> </Property> <Property Name="CA.SM::AuthScheme.Desc"> <StringValue>TCP portal auth scheme</StringValue> </Property> <Property Name="CA.SM::AuthScheme.Level"> <NumberValue>5</NumberValue> </Property> <Property Name="CA.SM::AuthScheme.IsTemplate"> <BooleanValue>false</BooleanValue> </Property> <Property Name="CA.SM::AuthScheme.Param"> <LinkValue><XREF>Ref00005</XREF></LinkValue> </Property> <Property Name="CA.SM::AuthScheme.Library">
Re: Replace string in one file using input from another file
by Jenda (Abbot) on Feb 08, 2018 at 23:25 UTC

    I have to repeat what the previous node already said, DO NOT TREAT XML FILES AS TEXT. Use a library that actually understands the format.

    In this case one of the options is XML::Rules in the filtering mode. You would read the CSV into a hash and then process the XML file with something like

    use XML::Rules; my $filter = XML::Rules->new(style => 'filter', rules => { 'XREF' => sub { return $references{$_[1]->{_content}} ?? "Unknown reference $_[1]- +>{_content}"; } }); $filter->filterfile($source_path, $result_path);
    If you do not want to process all <XREF> tags, but only those within <LinkValue> you can change the code to something like this:
    my $filter = XML::Rules->new(style => 'filter', rules => { 'XREF' => { qr{/LinkValue$} => sub { return $references{$_[1]->{_content}} ?? "Unknown reference $_[ +1]->{_content}"; } # or # qr{/Property/LinkValue$} => sub { # return $references{$_[1]->{_content}} ?? "Unknown reference $_[ +1]->{_content}"; # } # for only <Property><LinkValue><XREF> } });

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: Replace string in one file using input from another file
by Anonymous Monk on Feb 07, 2018 at 18:24 UTC
    The master-file is XML and therefore should be manipulated using XML tools, not direct manipulation as though it were a text-file. You can afford to read the entire file into memory using a tool like XML::LibXML and then manipulate the structure internally. Then, write out the modified XML, preferably into a new file so that the original input is not corrupted when if you make a mistake.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1208637]
Approved by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2018-05-24 18:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?