Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

hash of unique words

by Ninke (Novice)
on Apr 22, 2013 at 16:23 UTC ( #1029916=perlquestion: print w/replies, xml ) Need Help??
Ninke has asked for the wisdom of the Perl Monks concerning the following question:

Dear perlmonks!

Once again I have a question. Actually, when I write a question to the forum on some problem I face, I think of smaller examples and adjust the code, and I sometimes realise myself how to do it by describing the task in a written form. But this time I just can't do a very simple thing, stuck with it for two days.

This concerns a sentence alignment. I have two files in English and in another foreign language with the same number of lines which are the translations of each other(number of words in lines is different). Like this:

>>>FILE-EN>>> The cat sees the dog The rat is in the cat The cat runs >>>>>FILE-RU>>>>>> Koshka vidit sobaku Krisa v koshke Koshka bezhit
For each sentence pair in English and Russian and for each English word from FILE_EN I need to calculate the number of unique Russian words that this English word can be hypothetically aligned to. In other words, it is the number of unique words on the Russian side. For example, the word "the" occurs in each sentence and can be aligned to any Russian word, so $uniform{"The"} should be 7 (a word 'Koshka' occurs twice), and I get $uniform{"The"} = 8 - counts with repeated words.

And so far I can calculate the number of not unique words. What shall I use - hash of arrays of unique words? Or some trick with hashes? I commented the staff I have tried - collecting only unique foreign words, this does not work:)

#!/usr/bin/perl use strict; use utf8; use warnings; use Data::Dumper; open ENGLISH, "corpus.e" or die $!; open FOREIGN, "corpus.f" or die $!; my @sents_en; my @sents_f; while (<ENGLISH>){ chomp; push @sents_en, $_; } while (<FOREIGN>){ chomp; push @sents_f, $_; } my %uniform; my $k;#index of english/foreign sentence for ($k = 0; $k <= $#sents_en; $k++){ my @words_en; my @words_f; @words_en = map { split / / } $sents_en[$k]; @words_f = map { split / / } $sents_f[$k]; my $j; for ($j = 0; $j <= $#words_en; $j++ ){ my $i; my %seen; for ($i = 0; $i <= $#words_f; $i++){ #$seen{$words_f[$i]}++; #TRY TO COUNT UNIQUE WORDS if ( defined( $uniform{ $words_en[$j] } ) ) { # and !$ +seen{$words_f[$i]}) ) { $uniform{ $words_en[$j] } ++; } else { $uniform{ $words_en[$j]} = 1; } } } } print Dumper \%uniform;
That are the numbers I get:
$VAR1 = { 'the' => 6, 'rat' => 3, 'is' => 3, 'cat' => 8, 'dog' => 3, 'in' => 3, 'runs' => 2, 'sees' => 3, 'The' => 8 };
...and I need the counts for unique words. Thank you in advance and sorry for too many letters:)

Replies are listed 'Best First'.
Re: hash of unique words
by choroba (Chancellor) on Apr 22, 2013 at 16:37 UTC
    Rather than hash of arrays, use a hash of hashes. At the end, replace each inner hash with its number of keys:
    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; open my $ENGLISH, '<', 'corpus.e' or die $!; open my $FOREIGN, '<', 'corpus.f' or die $!; chomp(my @sents_en = <$ENGLISH>); chomp(my @sents_f = <$FOREIGN>); my %uniform; for my $sentence_index (0 .. $#sents_en) { my @words_en = split ' ', $sents_en[$sentence_index]; my @words_f = split ' ', $sents_f[$sentence_index]; for my $word_index (0 .. $#words_en) { $uniform{ $words_en[$word_index] }{$_}++ for @words_f; } } for my $word (keys %uniform) { $uniform{$word} = keys %{ $uniform{$word} }; } print Dumper \%uniform;
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Choroba, thanx very much, that does exactly what I want. A nice trick with {$_} for @words, that reduced the number of lines twice:) Though I don't understand the magic to the end, especially when a hash of hashes ($uniform{ $words_en$word_index }{$_}) turnes into a one-dimentional hash: $uniform{$word}. I just need to use it in practice and then I'll get it:)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1029916]
Approved by Corion
and nobody stirs...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2017-07-25 04:21 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (363 votes). Check out past polls.