Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Merge Purge

by krazken (Scribe)
on Mar 22, 2002 at 05:34 UTC ( [id://153491]=perlquestion: print w/replies, xml ) Need Help??

krazken has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone have or know of an easy way to do matchgroup/overlay type of flat file I/O? Example:
##layout is as follows.. ID|name|address|city|state|zip|phone|matchkey 1|krazken|123 Main|BFE|AR|72210|555-2345|1 2|kraken||||||1 3|krayken||||555-2345|1
Here I have 3 distinct records. I have done something to match them together as noted by the matchkey. I know that they are a duplicate record even though there is variation in the name. My question is does anyone have a good way of grouping these records together so that I can populate missing data where fields are missing? Basically I am wanting to do an overlay for those who have heard of that before... my output should look like
1|krazken|123 Main|BFE|AR|72210|555-2345|1 2|kraken|123 Main|BFE|AR|72210|555-2345|1 3|krayken|123 Main|BFE|AR|72210|555-2345|1
I have tried anonymous hashes on the matchkey, and that works ok for small stuff, but when you have flat files that have millions of records in it, this gets expensive in a hurry, and I usually run out of memory. I have tried tie'ing my hashes with DB_File to save memory, but that is just too dang slow. So I was wondering if anyone has done this type of stuff in perl, and if so, how did you do it?

TIA krazken

Replies are listed 'Best First'.
(jeffa) Re: Merge Purge
by jeffa (Bishop) on Mar 22, 2002 at 06:07 UTC
    Seems rather odd ... but interesting:
    use strict; my (@line,@last); while (<DATA>) { @line = split('\|',$_,8); @last = map { $line[$_] || $last[$_] } (0..$#line); print join('|',@last); } __DATA__ 1|krazken|123 Main|BFE|AR|72210|555-2345|1 2|kraken||||||1 3|krayken|||||555-2345|1
    So basically, read the line and use the value found on the previous (last) line if one was not found on this line. Only works if the first line contains no blank values. (and you only had six pipes on the third line - i added a seventh to my example.)


    (the triplet paradiddle with high-hat)
Re: Merge Purge
by shotgunefx (Parson) on Mar 22, 2002 at 06:02 UTC
    You probably should be using a database, but if you are dead set against that for some reason...
    My first thought is how have you determined that this record matches another? By looking at other records? If so, why break the "clumping" into a seperate pass?

    Tailoring a better solution will depend on what your data is. What changes, what doesn't etc. Should all output records contain all fields that every other matching record contains?

    If you on nix or are using cygwin, maybe you should make match the first key in the file and then pipe it into GNU sort (Good at handling large files and pretty fast.) Then all you reads should be sequential and you'll only have to hold the current match data in memory. Hope this helps.


    "To be civilized is to deny one's nature."
      I would use a database for this, but I already have it in a flat file, and the program that assigns that matchkey runs on a flat file as well, so instead of wasting time trying to load millions of records into a database, I just work on the flat file. Plus the file is already sorted on the matchkey coming out the previous program. I probably need to take the approach of taking advantage of the fact that the file is sorted and read until my matchkey changes then process that matchgroup and then read the next. But, there are times when I /try/ to write flexible code to where it wouldn't matter if the file was sorted or not. I would like it to work either way. make sense?
        I think with your DB_File approach, the biggest problem is one read/write for every record. I had a similar problem with a search index for 5,000,000 books. The thing took around 18 hours to finish. Taking advantage or sorting it and working with the current record cut it down to 17 minutes.

        One thing I thought of (I don't know if someone else has done it. Couldn't find it at the time.) was to subclass tie DB_File and make a hash that wouldn't always read and write on every access. It would have an intermidiate cache. If you implemented caching behavior like this, it would probably speed it up an order of magnitude when the data was fairly sorted and still work about the same for the general case all while being nice and generic.


        "To be civilized is to deny one's nature."
Re: Merge Purge
by abstracts (Hermit) on Mar 22, 2002 at 06:20 UTC

    What you need is a hash of hashes with one of the values is an array of names. The keys of the hash are the matchkeys and the values would be references to arrays holding the values.

    #!/usr/bin/perl -w use strict; use Data::Dumper; my @FLDS = qw(ID name address city state zip phone matchkey); my @RFLDS = qw(address city state zip phone); my @UFLDS = qw(ID name); my %records; for(<DATA>){ chomp; my %rec; @rec{@FLDS} = split/\|/; my %urec; @urec{@UFLDS} = @rec{@UFLDS}; my $key = $rec{matchkey}; defined($key) or die "This record has no key"; push @{ $records{$key}{'records'} }, \%urec; for(@RFLDS){ $records{$key}{$_} = $rec{$_} unless exists $records{$key}{$_}; } } for my $key (sort keys %records){ my %master = %{$records{$key}}; for(@{ $master{records} }){ my %rec = (%master, %$_); $rec{matchkey} = $key; print join('|', @rec{@FLDS}) . "\n"; } } print Dumper(\%records); __DATA__ 1|krazken|123 Main|BFE|AR|72210|555-2345|1 2|kraken||||||1 3|krayken|||||555-2345|1
    I believe this is as effecient a data structure as one could come up with.

    Here is a dump of the data structure to make you believe what the code is doing:

    1|krazken|123 Main|BFE|AR|72210|555-2345|1 2|kraken|123 Main|BFE|AR|72210|555-2345|1 3|krayken|123 Main|BFE|AR|72210|555-2345|1 $VAR1 = { '1' => { 'state' => 'AR', 'zip' => '72210', 'address' => '123 Main', 'city' => 'BFE', 'phone' => '555-2345', 'records' => [ { 'ID' => '1', 'name' => 'krazken' }, { 'ID' => '2', 'name' => 'kraken' }, { 'ID' => '3', 'name' => 'krayken' } ] } };
    Update: As others have already suggested, if you have millions of records, you really should consider an SQL based Database. Dumping the flat file to the database should be simple and it does avoid all the redundant information you're trying to add to your flat file.

    Update2: The code presented assumes nothing about the order at which records appear on the file. The runtime and memory requirements can be reduces dramatically IF there are some assumptions about the records (e.g. Records with data always appear before records with missing data. Or records with the same matchkey always appear grouped together.) If you sort your data beforehand, then you can achieve better runtime.

    Hope this helps,,,


      A compromise might be to make a pass over the file once and make a DB_File database of the canonical information. Then run over the file again, and when there's missing fields consult the db for the information. That way you reduce the amount of info you have to keep in memory, at the cost of running over the file twice.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://153491]
Approved by root
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-05-21 07:00 GMT
Find Nodes?
    Voting Booth?

    No recent polls found