Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Search and replace on a 130000+ line file.

by brendonc (Novice)
on May 11, 2001 at 21:56 UTC ( [id://79789]=perlquestion: print w/replies, xml ) Need Help??

brendonc has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I have two files, one is a list of computer names and IP addresses:
10.1.3.1:frank 10.1.3.2:john 10.1.3.3:mis23455 ...
And the other file looks like this (DJB data file):
... +host-1-3-1-10:10.1.3.1 +host-2-3-1-10:10.1.3.2 +host-3-3-1-10:10.1.3.3 ...
I want to take the first file, loaded with ip:machinename, and search the second file based on the IP address. Then, I want to replace the +host-* part with the proper machine name for that IP address. The finished data file (file 2) would look like this:
... +frank:10.1.3.1 +john:10.1.3.2 +mis23455:10.1.3.3 ...
In other words, I want to use the data in the first file to search and replace data in the second. The data file is 130000+ lines long. What would be the best way process such a large file? I'm quite stuck at this point so any help would be useful.
Thanks.

Replies are listed 'Best First'.
(Ovid) Re: Search and replace on a 130000+ line file.
by Ovid (Cardinal) on May 11, 2001 at 22:21 UTC
    Assumptions:
    • The mapping of IP addresses to Computer names is in a relatively small file named 'ip.dat'.
    • You don't want to overwrite your data file.
    • You trust Ovid's untested code :)
    use strict; use warnings; my $ips = 'ip.dat'; my $in_data = 'DJB.dat'; my $out_data = 'DJB_new.dat'; my %ip_map; # you should ask for a shared lock in case someone updates this open IPS, "< $ips" or die "Couldn't open $ips for reading: $!"; while ( <IPS> ) { my ( $ip, $name ) = split /:/, $_, 2; warn "$ip is already mapped to $ip_map{$ip }" if exists $ip_map{ $ +ip }; chomp $name; $ip_map{ $ip } = $name; } close IPS; open IN_DATA, "< $in_data" or die "Couldn't open $in_data for readin +g: $!"; open OUT_DATA, "> $out_data" or die "Couldn't open $out_data for writi +ng: $!"; while ( <IN_DATA> ) { my $replace; my $ip = ''; ( $replace, $ip ) = ( /^\+([^:]+):([\d.]+)/ ); if ( exists $ip_map{ $ip } ) { s/$replace/$ip_map{ $ip }/; } print OUT_DATA $_; } close OUT_DATA; close IN_DATA;

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Search and replace on a 130000+ line file.
by ncw (Friar) on May 11, 2001 at 22:27 UTC
    Here is my quick and nasty solution. This loads the first file into a hash so it may use up lots of memory. Though if it too is 130,000 lines that is only a few megabytes.

    Should run fast! (Tested also ;-)

    use strict; die "Syntax: $0 <names> < djb-file > output-file" unless @ARGV == 1; my %name_to_ip; open(IN, "<".shift) or die; while (<IN>) { chomp; my ($ip, $name) = split /:/; $name_to_ip{$ip} = $name; } close IN or die; while (<>) { chomp; my ($name, $ip) = split /:/; print $name_to_ip{$ip} || $name, ":", $ip, "\n"; }
      For whatever reason, the line:
      open(IN, "<".shift) or die;
      doesn't work for me in ActiveState Perl 5.6 /NT workstation when I call the above program as
      perl tempo.pl ip.dat djb.dat

      However, the program works great when I replace that line with:

      open(IN, "<ip.dat") or die;
      and call it as:
      perl tempo.pl djb.dat
Re: Search and replace on a 130000+ line file.
by jepri (Parson) on May 11, 2001 at 22:11 UTC
    If you have the disc space then the simplest option is a stream edit, kinda like this (untested code):

    open (INFILE,"<infile.txt"); open (OUTFILE, ">outfile.txt); while (<INFILE>) { # $_ will hold one line of your file, and the loop will (slowly) go t +hrough your entire file #Mangle $_ here e.g. s/^\+host-1-3-1-10:/$ipnum/; #print it into the other file print OUTFILE $_; }

    Or variations on the above, to suit.

    Update: I forgot about the ip bit. Depending on how many there are you could push them into a hash or use some database solution, as tilly didn't quite mention in the chatbox.

    ____________________
    Jeremy
    I didn't believe in evil until I dated it.

      If the substitution table is too large to load into a hash in its entirety, you could load in a portion of it, process the second file, and load in additional portions until it was completed. This way, you would have to make n passes of the secondary file, where the substitution file is loaded into n pieces.

      This is, of course, assuming that you can't fit the substitution file entirely into RAM in a hash, and that using a tied hash is out of the question because of speed concerns. However, 130K lookups in a tied hash cannot take that long, so it might be a vialbe solution. If you were processing a file with 130M lines, then you are going to have to think of something else.

      A curious observation, though, is that in your example the "output" file is the same as the input file, just reorganized. I'm sure this is because of simplification on your part.
Re: Search and replace on a 130000+ line file.
by Sifmole (Chaplain) on May 11, 2001 at 22:09 UTC
    I think in this case I would have to ask myself, "What are the criteria that defines -- best --?". Is it to minimize memory usage? processing time? Does it have to remain Pure Perl?

    Can you answer these? It would help me possibly provide a suitable suggstion.

Re: Search and replace on a 130000+ line file.
by brendonc (Novice) on May 12, 2001 at 00:02 UTC
    Thanks everyone! I'm now on the right track.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://79789]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-06-16 16:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.