Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Aligning text and then perfom calculations

by Kenosis (Priest)
on Dec 15, 2013 at 22:37 UTC ( #1067263=note: print w/ replies, xml ) Need Help??


in reply to Aligning text and then perfom calculations

Here's another option:

use strict; use warnings; use Text::ASCIITable; use open qw(:std :utf8); my %hash; my $tb = Text::ASCIITable->new(); $tb->setCols( 'WordF1', 'WordF2', 'Difference' ); while (<>) { next if $. < 3; push @{ $hash{$1} }, $2 while /\|\s+(\w+)\s+\|\s+([.\d]+)/g; } for my $word ( keys %hash ) { if ( @{ $hash{$word} } == 2 ) { $hash{$word} = $hash{$word}->[0] - $hash{$word}->[1]; } else { delete $hash{$word}; } } for my $word ( sort { $hash{$b} <=> $hash{$a} } keys %hash ) { $tb->addRow( $word, $word, sprintf( '%0.05f', $hash{$word} ) ); } print $tb;

Usage: perl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

.--------------------------------------. | WordF1 | WordF2 | Difference | +------------+------------+------------+ | politici | politici | 0.01940 | | referendum | referendum | 0.01726 | | verità | verità | 0.01454 | | scandalo | scandalo | 0.00978 | | consenso | consenso | 0.00887 | | vergogna | vergogna | 0.00592 | '------------+------------+------------'

The script initially creates a hash of arrays (HoA), pairing the word with the associated value(s). Next, it iterates through the hash, removing key/value pairs for those words occurring in only one file, then pairs the word with the calculated difference. Lastly, it builds the table, sorting the rows in descending Difference, since your original table displayed words in descending percentage. Use $hash{$a} <=> $hash{$b} if you want the rows shown in ascending Difference.

You said, "I have processed two text files..." I (somehow) get the impression that each of the two files contain a corpus which underwent processing resulting in generating your original table (perhaps you sent a program a list of files to analyze)--this, instead of merely having word/value pairs in those two files. Is this correct? If not, and you do have these word/value pairs in those files, consider the offered file solutions.

Hope this helps!

Edit: Below is a script which takes two files containing the two data sets you posted earlier. It's just slightly modified from the script above:

use strict; use warnings; use Text::ASCIITable; use open qw(:std :utf8); my %hash; my $tb = Text::ASCIITable->new(); $tb->setCols( 'WordF1', 'WordF2', 'Difference' ); while (<>) { my ( $word, $val ) = (split)[ 0, -1 ]; push @{ $hash{$word} }, $val; } for my $word ( keys %hash ) { if ( @{ $hash{$word} } == 2 ) { $hash{$word} = $hash{$word}->[0] - $hash{$word}->[1]; } else { delete $hash{$word}; } } for my $word ( sort { $hash{$b} <=> $hash{$a} } keys %hash ) { $tb->addRow( $word, $word, sprintf( '%0.06f', $hash{$word} ) ); } print $tb;

Usage: perl inFile1 inFile2 [>outFile]

Output on your datasets:

.----------------------------------------------------. | WordF1 | WordF2 | Difference | +-------------------+-------------------+------------+ | consensi | consensi | 0.000626 | | disonesti | disonesti | 0.000507 | | antidemocratico | antidemocratico | 0.000102 | | antidemocraticità | antidemocraticità | 0.000029 | | antidemocratica | antidemocratica | -0.000014 | | antidemocratici | antidemocratici | -0.000017 | | consensuali | consensuali | -0.000040 | | antidemocratiche | antidemocratiche | -0.000130 | | consensuale | consensuale | -0.000230 | | consenso | consenso | -0.008922 | '-------------------+-------------------+------------'


Comment on Re: Aligning text and then perfom calculations
Select or Download Code
Re^2: Aligning text and then perfom calculations
by epimenidecretese (Acolyte) on Dec 17, 2013 at 14:56 UTC
    You said, "I have processed two text files..." I (somehow) get the impression that each of the two files contain a corpus which underwent processing resulting in generating your original table (perhaps you sent a program a list of files to analyze)--this, instead of merely having word/value pairs in those two files. Is this correct?

    You got it right. I am doing some NLP.I got two corpus, tokenized and then simply performed some querys over it. Thank you very much for your help.

      You're most welcome, epimenidecretese!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1067263]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2014-09-21 17:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (173 votes), past polls