Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: File Handling for Duplicate Records

by Thelonius (Priest)
on Dec 22, 2006 at 11:27 UTC ( [id://591299]=note: print w/replies, xml ) Need Help??


in reply to File Handling for Duplicate Records

If you're on a Unix-ish system (having sort and join)--or cygwin on Windows--you can do this with a few lines of shell:
perl -ne ' if(!/^2/) { $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k|$_" }' file1 | sort -t "|" -k 1,1 >file1.sorted # This code assumes the fields are in the same place in file2 # as they are in file1, but if not, you'll have to change this. perl -ne ' $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k\n" ' file2 | sort -t "|" -k 1,1 >file2.sorted # I am only outputting the key here since you don't seem # to be doing anything with the rest of 'line2' join -t '|' file1.sorted file2.sorted | cut -d '|' -f 2 > duplicates
With the input of file1:
3 110582 SFCA 4158675309 041414041421 3 060784 NYNY 2125552368 190159204657 3 121906 RANC 9195551234 123401123620
and file2:
3 110582 SFCA 4158675309 041414041421
your program and mine both produced the output:
3 110582 SFCA 4158675309 041414041421

Notes:

  • Make sure you use a delimiter character (I used "|") that's not in the data. You're not limited to printable characters.
  • Strictly speaking, there could be some difference in the output of the two programs. You truncated line1 at 210 characters; I don't. If line1 matches more that one line in file2, I produce multiple lines of output; you only one. Our output is also in a different order.
  • You could save time if you know one of the files is already sorted. For example, maybe file2 doesn't change each run. You can also merge two sorted files using sort -m
  • If you want the lines that are not duplicates, use join -v

For example, say you have a new file, newdata and a file, alreadyprocessed, which corresponds to my file2.sorted, above. That is, it's just the keys in sorted order. You could do this:

perl -ne ' if(!/^2/) { $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k|$_" }' newdata | sort -t "|" -k 1,1 >newdata.sorted join -t '|' -v 1 newdata.sorted alreadyprocessed >needsprocessing cut -d '|' -f 2 needsprocessing >processinput # Then do the processing # ... # ... # If everything runs okay cut -d '|' -f 1 needsprocessing | sort -m - alreadyprocessed >mergeout mv alreadyprocessed alreadyprocessed.bak mv mergeout alreadyprocessed

Replies are listed 'Best First'.
Re^2: File Handling for Duplicate Records
by sgt (Deacon) on Dec 22, 2006 at 15:31 UTC

    what about comm? or am I missing something. Of course if file1 needs transforming use perl or whatever filter

    # comm -12 <(sort file1) <(sort file2) > dups.out

    the 'cmd <(cmd1 ...) ...' notation if not supported by your shell means the two-step process 'cmd ... > temp1; cmd1 temp1' cmd1 being a "filter".

    or in other "words" ;)

    % stephan@armen (/home/stephan) % % cat dat1 3 110582 SFCA 4158675309 041414041421 3 060784 NYNY 2125552368 190159204657 3 121906 RANC 9195551234 123401123620 % stephan@armen (/home/stephan) % % cat dat2 3 110582 SFCB 2258675309 041414041421 3 110582 SFCA 4158675309 041414041421 % stephan@armen (/home/stephan) % % sort dat1 > dat1.sorted % stephan@armen (/home/stephan) % % sort dat2 > dat2.sorted % stephan@armen (/home/stephan) % % comm -12 dat1.sorted dat2.sorted 3 110582 SFCA 4158675309 041414041421
    hth --stephan, just another unix hacker,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://591299]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-06-15 20:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.