Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Comparing large files

by wjw (Deacon)
on Feb 11, 2014 at 19:41 UTC ( #1074487=note: print w/ replies, xml ) Need Help??


in reply to Comparing large files

First I would start the other way around, look for words in words-only file that match those
with pronunciations. The assumption is that the smaller set is going to be those with pronunciations.
pronunciation-words -> words
instead of
words->pronunciation-words

Next, I think I would look for uniqueness. With a 10Mg file, it is hard to imagine that some
words are not in there more than once. That could reduce the whole thing substantially.
(guess I could be way wrong there... but...

The other thing I might look at if this is not a one-off type thing is using a database if
one is handy.

Otherwise: pumping comparisons into a simple hash like
$words{$word} = $pronunciation does a lot of this for you.

Hope that is somewhat helpful..

...the majority is always wrong, and always the last to know about it...
Insanity: Doing the same thing over and over again and expecting different results.


Comment on Re: Comparing large files
Download Code
Re^2: Comparing large files
by Herbert37 (Novice) on Feb 12, 2014 at 20:44 UTC

    Ah, in too much of a hurry...

    Turns out I am trying to replace a Postgresql database that already sort of does what I want to do (I created it myself), but I believe that Perl is a far more flexible and powerful tool.

    I can store my hashes, I believe, and once I have done that, I can do whatever I want... in my vague ideation.

    Any feelings about that? And thanks once again for great help

      I don't really see what you are trying to do but if you mention PostgreSQL: perhaps the hstore data type can help? It's a hash-like data type that is indexable (for read-mostly data use the GIN index).

      Of course, perl hashes will be much faster if memory serves ;-)

      If your words are in your data base already along with your pronunciations, then your job just got much easier.

      I am just guessing here, as I don't know the db schema you have...
      The approach remains the same whether you decide to use the DB or not.
      A database view will give you a long term solution. Combine that with Perl and you can do anything you want very quickly...
      Based on your last post, you have found what you were looking for which is what counts!

      Have a good one...

      ...the majority is always wrong, and always the last to know about it...
      Insanity: Doing the same thing over and over again and expecting different results.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1074487]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2014-07-30 01:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (229 votes), past polls