http://www.perlmonks.org?node_id=1067263


in reply to Aligning text and then perfom calculations

Here's another option:

use strict; use warnings; use Text::ASCIITable; use open qw(:std :utf8); my %hash; my $tb = Text::ASCIITable->new(); $tb->setCols( 'WordF1', 'WordF2', 'Difference' ); while (<>) { next if $. < 3; push @{ $hash{$1} }, $2 while /\|\s+(\w+)\s+\|\s+([.\d]+)/g; } for my $word ( keys %hash ) { if ( @{ $hash{$word} } == 2 ) { $hash{$word} = $hash{$word}->[0] - $hash{$word}->[1]; } else { delete $hash{$word}; } } for my $word ( sort { $hash{$b} <=> $hash{$a} } keys %hash ) { $tb->addRow( $word, $word, sprintf( '%0.05f', $hash{$word} ) ); } print $tb;

Usage: perl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

.--------------------------------------. | WordF1 | WordF2 | Difference | +------------+------------+------------+ | politici | politici | 0.01940 | | referendum | referendum | 0.01726 | | verità | verità | 0.01454 | | scandalo | scandalo | 0.00978 | | consenso | consenso | 0.00887 | | vergogna | vergogna | 0.00592 | '------------+------------+------------'

The script initially creates a hash of arrays (HoA), pairing the word with the associated value(s). Next, it iterates through the hash, removing key/value pairs for those words occurring in only one file, then pairs the word with the calculated difference. Lastly, it builds the table, sorting the rows in descending Difference, since your original table displayed words in descending percentage. Use $hash{$a} <=> $hash{$b} if you want the rows shown in ascending Difference.

You said, "I have processed two text files..." I (somehow) get the impression that each of the two files contain a corpus which underwent processing resulting in generating your original table (perhaps you sent a program a list of files to analyze)--this, instead of merely having word/value pairs in those two files. Is this correct? If not, and you do have these word/value pairs in those files, consider the offered file solutions.

Hope this helps!

Edit: Below is a script which takes two files containing the two data sets you posted earlier. It's just slightly modified from the script above:

use strict; use warnings; use Text::ASCIITable; use open qw(:std :utf8); my %hash; my $tb = Text::ASCIITable->new(); $tb->setCols( 'WordF1', 'WordF2', 'Difference' ); while (<>) { my ( $word, $val ) = (split)[ 0, -1 ]; push @{ $hash{$word} }, $val; } for my $word ( keys %hash ) { if ( @{ $hash{$word} } == 2 ) { $hash{$word} = $hash{$word}->[0] - $hash{$word}->[1]; } else { delete $hash{$word}; } } for my $word ( sort { $hash{$b} <=> $hash{$a} } keys %hash ) { $tb->addRow( $word, $word, sprintf( '%0.06f', $hash{$word} ) ); } print $tb;

Usage: perl inFile1 inFile2 [>outFile]

Output on your datasets:

.----------------------------------------------------. | WordF1 | WordF2 | Difference | +-------------------+-------------------+------------+ | consensi | consensi | 0.000626 | | disonesti | disonesti | 0.000507 | | antidemocratico | antidemocratico | 0.000102 | | antidemocraticità | antidemocraticità | 0.000029 | | antidemocratica | antidemocratica | -0.000014 | | antidemocratici | antidemocratici | -0.000017 | | consensuali | consensuali | -0.000040 | | antidemocratiche | antidemocratiche | -0.000130 | | consensuale | consensuale | -0.000230 | | consenso | consenso | -0.008922 | '-------------------+-------------------+------------'

Replies are listed 'Best First'.
Re^2: Aligning text and then perfom calculations
by epimenidecretese (Acolyte) on Dec 17, 2013 at 14:56 UTC
    You said, "I have processed two text files..." I (somehow) get the impression that each of the two files contain a corpus which underwent processing resulting in generating your original table (perhaps you sent a program a list of files to analyze)--this, instead of merely having word/value pairs in those two files. Is this correct?

    You got it right. I am doing some NLP.I got two corpus, tokenized and then simply performed some querys over it. Thank you very much for your help.

      You're most welcome, epimenidecretese!