Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Best way to search file

by insta.gator (Novice)
on Apr 15, 2015 at 15:44 UTC ( #1123519=perlquestion: print w/replies, xml ) Need Help??

insta.gator has asked for the wisdom of the Perl Monks concerning the following question:

I am a super novice Perl programmer.

I have 2 delimited files. File 1 is pipe delimited and contains names with social security numbers. File 2 is comma delimited and contains names, socials and a unique identifier that is alphanumeric. There is a max of 1 row of data for each person in file 2. There could be none. The is 1 or more rows of data for each person in file 1. Could be 1 or there may be 30. And the rows are typically interspersed throughout the file (not together). What I need to do is grab the SSN from file 1, find that SSN in file 2, grab the corresponding unique identifier from file 2 and plug it into a field in the record in file 1. I have a Perl script that is working but it is very, very slow. I am splitting the file 1 lines into an array, doing the lookups and replaces and then writing the line to a file. I repeat this until I process all the records in file 1.

Sorry to be so long winded. Any suggestions on how to improve (speed up) would be greatly appreciated.

I would be happy to post my code but it is probably too long for this small box.

Replies are listed 'Best First'.
Re: Best way to search file
by jeffa (Bishop) on Apr 15, 2015 at 16:24 UTC

    Storing SSN data plain text? Shame! As someone who has had their identity stolen, i strongly urge you to consider using an alternate identifier. Seriously. Now is the time.

    Perhaps you can construct 2 sample data files -- minus any personal information of course -- and post a portion of your algorithm along with it. By breaking your problem down, you often find your own solution but at the least you have a better chance of receiving meaningful answers, sooner.


    (the triplet paradiddle with high-hat)

      Storing SSN data plain text? Shame!

      Isn't that a "crime" now?

Re: Best way to search file
by choroba (Cardinal) on Apr 15, 2015 at 16:26 UTC
    Store the mapping of social numbers to identifiers in a hash:
    #! /usr/bin/perl use warnings; use strict; my %alphanum; open my $COMMA, '<', '' or die $!; while (<$COMMA>) { my ($social, $id) = (split /,/)[1, 2]; $alphanum{$social} = $id; } open my $VBAR, '<', '' or die $!; while (<$VBAR>) { chomp; my $social = (split /\|/)[1]; if (exists $alphanum{$social}) { print "$_|$alphanum{$social}"; } else { print "$_\n"; } }
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Best way to search file
by Laurent_R (Canon) on Apr 15, 2015 at 17:31 UTC
    If your process is so slow, it is quite likely because your are scanning the full content of file2 for each line of file1.

    If this is the case, then you will find that storing file2 in a hash before starting to process file1 will make the process incredibly faster. And the larger file2 is, the higher the speed gain.

    As mentioned by sundialsvc4, the only limit to that is that if file2 is so big that the hash will take all the memory, then the hash is no longer a solution. (It depends on your system, but with today's typical RAM, my experience is that the limit could be somewhere between 5 and 15 million lines for file2.)

    In that case, I would really recommend sorting the files and reading sequentially both files in parallel. This is in my experience with huge files way faster than using a database. The only downside with this approach is that the algorithm for reading 2 files in parallel can be a bit tricky, with quite a few edge cases to be taken care of.

    Je suis Charlie.

      Thanks for the response. I have yet to come across a file 2 that is greater than 100,000 lines. So I think that I will look at loading file2 into a hash and search that. Now I just have to figure out how to do that... :-)

      Thanks for the help. Depending on my results, I may reach out for more assistance.

        Now I just have to figure out how to do that... :-)
        choroba gave you the basic idea of the hash solution in this post: Re: Best way to search file in answer to your OP. But feel free to come back if you encounter implementation problems.

        Je suis Charlie.

        Feel free to reach out, but I doubt that you will have any trouble with it, once you’ve studied the previous example.   (If you do, don’t waste your own time:   ask.)

        Also:   when you load the data into your hash, you should not take for granted that there is not an error in your input-file.   As you load the hash, I would recommend that you test to see if the key already exists() in the hash, and die() if it does.   “Trust, but verify.”

        The data volumes that you indicate certainly seem to be appropriate for the use of a hash, and that’s the way I would pursue it.

Re: Best way to search file
by sundialsvc4 (Abbot) on Apr 15, 2015 at 17:16 UTC

    A lot depends on just how big the two files are.   If they are of a size that would comfortably(!) fit in memory, a hash would do nicely.   If they are larger, consider either sorting the two files (which will reduce the problem to a simple “merge”), or perhaps use an [Sqlite?] database file.   If you do the latter, your problem becomes an INNER JOIN.

    “Two identically sorted files” is the old-school technique ... that’s literally what they were doing with all those tape drives, in the days of yore ... but it is a good one, especially if one or both of the files are already sorted and can stay that way.   The entire operation can be peformed using one sequential pass through both files, no matter how large they are.   The price-paid is the cost of sorting.   (That cost is amortized if the file, known to be sorted and kept sorted, can then be reused in the future.   A sequential file can be sequentially-updated by a sorted transaction file that is applied to it by appropriate code, and this also occurs in one sequential pass.)

    If you use a hash, the operative word is “comfortably.”   If the hash is so large that the operating-system starts paging, a hash can perform exceptionally badly because it makes fairly-random references to memory addresses.   Hashes exhibit the opposite of the “locality of reference” behavior upon which efficient virtual-memory depends.

    If you use SQLite, then once again you are paying a stiff file-copying price ... unless you can use the file multiple times in multiple runs ... say, using it as your master-file instead of your present flat-file #1.

      (...) or perhaps use an Sqlite? database file. If you do the latter, your problem becomes an INNER JOIN.
      Or even easier, if the files are small enough to fit in ram, use DBD::RAM which supports both file formats.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1123519]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2023-06-05 09:02 GMT
Find Nodes?
    Voting Booth?
    How often do you go to conferences?

    Results (24 votes). Check out past polls.