Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

file merge with config(parameter) files

by hyans.milis (Novice)
on Mar 02, 2013 at 13:16 UTC ( #1021412=perlquestion: print w/ replies, xml ) Need Help??
hyans.milis has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

currently i have doubt regarding how to merge two big files and 1 config files. the 1st files is around 20 Mils records and 2nd one is 400K and 1 config files around 50 records.

due to this, need your help whether we can use perl (using hash) for merging files or using sql database (pl sql) to make it faster.

logic :

1. get the 1st, 4th and 6th columns from file 1

2. compare the 4th and 6th columns with config files for the mapping

3. if the mapping is found from 2nd step, compare with file2 by :

a. comparing the 1st column on file1 and 2nd column on file2

b. comparing the 2nd column from config file and 1st column on file2

4. merge into new files by :

a. replacing the 4th column on file1 with 1st column on config file

b. replacing 6th column on file1 with 5th column on config file

c. add the 3rd & 4th column from file2

first files : customer, subscriber,cust_code,custplan,status,plan_rate,remaining_pla +n,quote 623192729079,510993192729079,19,22,0,90,282051608, 623192728769,510993192728769,19,22,0,210,84950715, 623192729901,510993192729901,19,70,0,82, 623192609007,510993192609007,19,70,0,85, 623416771429,510993416771429,19,1,0,95, 622319309157,510992319309157,19,22,0,30, 623192724581,510993192724581,19,75,0,4, 622319381619,510992319381619,19,70,0,5, 622198575655,510992198575655,19,1,0,5, 623192724589,510993192724589,19,70,0,6, 622743581281,510992743581281,19,71,0,8,
2nd files : prod, subscriber,startdate,expdate, SUPERG30^623192729079^20121225120308^20130124120308 SUPERM30^623192728769^20130105185053^20130204185053 SUPERB30^623192729901^20121209150053^20130108150053 SUPERC30^623192609007^20130102111208^20130201111208
config files : ccode prod, custplan plan_rate new_plan 5 SUPERG30 22 90 100_1d_unlim 10 SUPERM30 22 210 1g_15d_unlim 15 SUPERB30 70 82 700_30d_lim10 25 SUPERC30 70 85 1g_7d_lim5
expected results : customer, subscriber,cust_code,ccode,status,new_lan,remaining_plan,quo +te,startdate,expdate 623192729079,510993192729079,19,5,0,100_1d_unlim,282051608,20121225120 +308,20130124120308, 623192728769,510993192728769,19,10,0,1g_15d_unlim,84950715,20130105185 +053,20130204185053, 623192729901,510993192729901,19,15,0,700_30d_lim10,,20121209150053,201 +30108150053, 623192609007,510993192609007,19,25,0,1g_7d_lim5,,20130102111208,201302 +01111208, 623416771429,510993416771429,19,1,0,unlim,,,, 622319309157,510992319309157,19,22,0,unlim,,,, 623192724581,510993192724581,19,75,0,unlim,,,, 622319381619,510992319381619,19,70,0,unlim,,,, 622198575655,510992198575655,19,1,0,unlim,,,, 623192724589,510993192724589,19,70,0,unlim,,,, 622743581281,510992743581281,19,71,0,unlim,,,,
please advice about this

Comment on file merge with config(parameter) files
Select or Download Code
Re: file merge with config(parameter) files
by moritz (Cardinal) on Mar 02, 2013 at 13:47 UTC
    due to this, need your help whether we can use perl (using hash) for merging files or using sql database (pl sql) to make it faster.

    It should be quite possible to do it perl. Keep the config file and the relevant parts of the smaller file (file2) in memory, and process the larger file (file1) line by line.

    Of course you can do it in SQL too. If the records already exists in a database, that's probably a simpler approach. If not, importing it likely takes as long as doing the whole processing in Perl.

      Hi Moritz,

      my idea is to process 1st file line by line. may i know what's the meaning to store the smaller file (file2) in memory ?is there any sample about this? bcoz i'm newbie in perl. thanks

Re: file merge with config(parameter) files
by ww (Bishop) on Mar 02, 2013 at 22:57 UTC
    And the code you're having trouble with is where?

    If you didn't program your executable by toggling in binary, it wasn't really programming!

      hi, here is the code i create using suggestion by moritz.

      1. put file2 into hash

      2. processing file1 line by line, mapping with the config files.

      and then search line by line by comparing 1st column of file1 and 2nd column of file2.

      please advice whether i can optimize the code as the 1st files contain huge records.

      #!/usr/bin/perl use strict; use warnings; my $sum = 0; #open(my $datamap, '<', "mapping.txt") or die "Could not open '$file' +$!\n"; open(my $data2, '<', "file2.txt") or die "Could not open '$file' $!\n" +; while (my $line = <$data2>) { chomp $line; my @fields = split "^", $line, -1; my $col2Val = $fields[1]; my $col1Val = $fields[0]; my $col3Val = $fields[2]; my $col4Val = $fields[3]; if (! defined $firstHashRef->{$col2Val}) { $firstHashRef->{$col2Val}->{Col1} = $col1Val; $firstHashRef->{$col2Val}->{Col3} = $col3Val; $firstHashRef->{$col2Val}->{Col4} = $col4Val; } } close($data2); open(my $data1, '<', "file1.txt") or die "Could not open '$file' $!\n" +; while (my $line = <$data2>) { chomp $line; my @fields = split ",", $line, -1; my $col1= $fields[1]; my $col4= $fields[3]; my $col6= $fields[5]; my $genmapping="grep '$col4\t$col6' $datamap"; my $mapping=`$genmapping`; if($mapping) { my $mapfound=split "\t", $mapping -1; if ((defined $firstHashRef->{$col1}) && ($mapfound[2]==$firstH +ashRef->{$col1}->{Col1}) ) { $fields[3]=$mapfound[0]; $fields[5]=$mapfound[4]; $fields[8]=$firstHashRef->{$col1}->{Col3}; $fields[9]=$firstHashRef->{$col1}->{Col4}; print OUT join( ',', @fields ) . "\n" ; } else { warn "We did not locate entry in hash table \n"; } } else{ $fields[3] = "unlim"; print join( ',', @fields ) . "\n" ; } } close($data1);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1021412]
Approved by mbethke
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2014-07-26 10:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls