Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Comparing large files

by Herbert37 (Novice)
on Feb 11, 2014 at 19:04 UTC ( #1074474=perlquestion: print w/replies, xml ) Need Help??
Herbert37 has asked for the wisdom of the Perl Monks concerning the following question:

Okay, part deux. Hash look-ups are the way to solve this problem. I withdraw the question.

Okay, sorting both files appears to be the only answer I can find... Is there another?

I have two large (10Mg plus) files of words, one file contains essentially just words, the other words and their pronunciations.

I want to discover and store how many of the words in the file without pronunciations have their pronunciations in the other file.

The first way, and only way, I can think of to do this is to check each word in the file without pronunciations against every word in the pronunciation file, but that will result in 10M X 10M comparisons and be very costly in time.

Is there another way? I believe I have seen one, but cannot remember it.

Thank you.

Replies are listed 'Best First'.
Re: Comparing large files
by BrowserUk (Pope) on Feb 11, 2014 at 19:30 UTC
    I have two large (10Mg plus)

    10 Milligram files? Must be light words :)

    Assuming you mean 10 million, and you have at least 2GB of ram, then:

    Load the first file of words into a hash.

    Read the second file line by line and check if the word is in the hash.

    Don't forget to chomp the newlines.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Comparing large files
by LanX (Bishop) on Feb 11, 2014 at 19:30 UTC
    If 10 "Mg" is just the size this should result in 1e6 words.

    IIRC does one hash entry result approx. 100 bytes overhead, so putting all errors in a hash should be feasible even on my pity NetBook.

    Parse the pronunciation-file line by line and build a lookup hash.

    Then parse the other file per line and look for missing entries.

    If you really have RAM problems try splitting the hash into several disjunct ones (like for every 10% of the file) and parse the second file once for each hash.

    Shouldn't take longer then seconds (at most minutes)

    HTH! :)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Comparing large files
by wjw (Priest) on Feb 11, 2014 at 19:41 UTC
    First I would start the other way around, look for words in words-only file that match those
    with pronunciations. The assumption is that the smaller set is going to be those with pronunciations.
    pronunciation-words -> words
    instead of

    Next, I think I would look for uniqueness. With a 10Mg file, it is hard to imagine that some
    words are not in there more than once. That could reduce the whole thing substantially.
    (guess I could be way wrong there... but...

    The other thing I might look at if this is not a one-off type thing is using a database if
    one is handy.

    Otherwise: pumping comparisons into a simple hash like
    $words{$word} = $pronunciation does a lot of this for you.

    Hope that is somewhat helpful..

    ...the majority is always wrong, and always the last to know about it...
    Insanity: Doing the same thing over and over again and expecting different results.

      Ah, in too much of a hurry...

      Turns out I am trying to replace a Postgresql database that already sort of does what I want to do (I created it myself), but I believe that Perl is a far more flexible and powerful tool.

      I can store my hashes, I believe, and once I have done that, I can do whatever I want... in my vague ideation.

      Any feelings about that? And thanks once again for great help

        I don't really see what you are trying to do but if you mention PostgreSQL: perhaps the hstore data type can help? It's a hash-like data type that is indexable (for read-mostly data use the GIN index).

        Of course, perl hashes will be much faster if memory serves ;-)

        If your words are in your data base already along with your pronunciations, then your job just got much easier.

        I am just guessing here, as I don't know the db schema you have...
        The approach remains the same whether you decide to use the DB or not.
        A database view will give you a long term solution. Combine that with Perl and you can do anything you want very quickly...
        Based on your last post, you have found what you were looking for which is what counts!

        Have a good one...

        ...the majority is always wrong, and always the last to know about it...
        Insanity: Doing the same thing over and over again and expecting different results.
Re: Comparing large files
by Laurent_R (Canon) on Feb 11, 2014 at 23:32 UTC
    Hmm, not that easy... If your files are 10 MB or probably even 100 MB, then hash lookup is definitely the best solution. If your files are dozens of GB, then sorting them to compare them is the only solution I can think of. Between these two sizes, it is your draw.
Re: Comparing large files
by Herbert37 (Novice) on Feb 12, 2014 at 05:15 UTC
    Thank all of you very much. Really and truly superb help. Thanks. As to Mg, well, I have written mg more often than Mb over the course of time, and I think Freud will give me a pass. Thanks again.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1074474]
Approved by BrowserUk
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2018-01-18 04:17 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (206 votes). Check out past polls.