Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^3: Best way to search file

by sundialsvc4 (Abbot)
on Apr 15, 2015 at 22:57 UTC ( [id://1123574]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Best way to search file
in thread Best way to search file

Feel free to reach out, but I doubt that you will have any trouble with it, once you’ve studied the previous example.   (If you do, don’t waste your own time:   ask.)

Also:   when you load the data into your hash, you should not take for granted that there is not an error in your input-file.   As you load the hash, I would recommend that you test to see if the key already exists() in the hash, and die() if it does.   “Trust, but verify.”

The data volumes that you indicate certainly seem to be appropriate for the use of a hash, and that’s the way I would pursue it.

Replies are listed 'Best First'.
Re^4: Best way to search file
by insta.gator (Novice) on Apr 16, 2015 at 18:19 UTC

    Thanks Sundial. A couple of questions.

    I was able to get the hash created and working properly. Now I need to take care of some details. Depending on the type of file that I am using, the SSN may or may not have hyphens in it. How would you strip the hyphens while loading the hash? This is what I have now:

    while (<$HRDATA>) { my ($ssn,$aoid) = split(/","/)[4,2]; $ssnhash{$ssn} = $aoid; }

    Basic I am sure but I am just learning.

    Secondly, again, depending on file type, the SSN may be in field 2 or 4 of file 2. One file type, where the SSN is in field 2 has a file header at the top. The only way that I can see to programatically know which is which it to query the file line of the file. Once I know that I can tweak my code to load the SSN in the hash from the proper fields. Does that make sense? Any thoughts on a better way?

    Thanks!!

      One way to strip out the "-" characters is like this:
      #!usr/bin/perl use strict; use warnings; foreach my $ssn qw(123-45-6789 987654321) { my $digits = $ssn; $digits =~ s/-//g; print "$ssn \t$digits\n"; } __END__ prints: 123-45-6789 123456789 987654321 987654321
      I am not sure of the best way to handle this "sometimes field 2 vs 4" without seeing a few example lines of these databases. Don't post any real SSNs!

      As mentioned before, your HUGE performance gain will come by processing each of the 2 files only once. Process file 2 first to make a memory structure, then process file 1 line by line. Each file only should be read once.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1123574]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-18 14:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found