http://www.perlmonks.org?node_id=1067251

epimenidecretese has asked for the wisdom of the Perl Monks concerning the following question:

Ciao guys, I'm trying to solve a problem with some text I have. I have processed two text files and now I have something like this:

| wordF1 | percentageF1 | wordF2 | percentageF2 | |------------+--------------+--------------+--------------| | politici | 0.0489 | politici | 0.0295 | | referendum | 0.0238 | consenso | 0.0126 | | verità | 0.0198 | referendum | 0.00654 | | scandalo | 0.0112 | verità | 0.00526 | | vergogna | 0.00723 | tradizionali | 0.00343 | | corrotto | 0.00439 | tradizione | 0.00266 | | scandali | 0.00394 | tradizioni | 0.00234 | | consenso | 0.00373 | tradizionale | 0.0022 | | corrotti | 0.00332 | scandalo | 0.00142 | | propaganda | 0.0027 | vergogna | 0.00131 | |------------+--------------+--------------+--------------|

What I am trying to do is to align the words (so I understand I should do some string compairison but then I don't know how) by keeping file1 as reference: this means that if a word is present in file1 but not in file2 then the whole raw should be deleted. Once this is done I would like to compute the differences of the percentages (f1-f2).

At the end I would like something like this:

| wordF1 | wordF2 | difference | |------------+------------+------------| | politici | politici | +0.5 | | referendum | referendum | +0.126 | | verità | verità | +0.006 | | ... | ... | ... | |------------+------------+------------|

I was trying to do this in awk but after a lot of tryings I gave up. If somebody could help I'd be very happy.

One of Crete's own prophets has said it: 'Cretans are always liars, evil brutes, lazy gluttons'.
He has surely told the truth.

Replies are listed 'Best First'.
Re: Aligning text and then perfom calculations
by choroba (Cardinal) on Dec 15, 2013 at 21:42 UTC
    The following code transforms the first table into the second. It might be easier, though, to skip the creation of the first table and do the calculations right when processing the two text files.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Aligning text and then perfom calculations
by GrandFather (Saint) on Dec 15, 2013 at 21:37 UTC

    Use the split technique kcott showed you in Re: Condition on multiple lines and big file to split your line up into an array. Try writing the code for yourself and if you run into trouble come back for more help.

    So far you've asked for and been given a lot of fish. Now it's time you learned to fish for yourself.

    True laziness is hard work

      I went out fishing and brought home something. I got the data in a way to fit the previous script and now I can print the rows that match, but still can't skip the one who don't.

      Thank you very much for pointing me in the correct direction. I'd be happy to figure it out myself if you could give me one more tip on this way.

      What I can't figure out is how to sort before comparing and printing, so that I get the words that are present in both lists but are not aligned.

      #!/usr/bin/env perl use strict; use warnings; while (<DATA>) { my ($f1, $f2,$perc1,$perc2) = (split)[0,-3,2,-1]; if ($f1 eq $f2){ print $f1,($perc1-$perc2),"\n"; } else{ next; } } print "\n"; __DATA__ antidemocratica 8 0.000274459 antidemocratica 58 0.000288782 antidemocratiche 1 3.43074e-05 antidemocratiche 33 0.000164307 antidemocratici 4 0.00013723 antidemocratici 31 0.000154349 antidemocraticità 1 3.43074e-05 antidemocraticità 1 4.979e-06 antidemocratico 14 0.000480303 antidemocratico 76 0.000378404 antidemocratico.questa 1 3.43074e-05 consensi 74 0.000368446 consensi 29 0.000994914 consenso 2543 0.0126616 consenso 109 0.00373951 consensocrazia 1 4.979e-06 consensuale 2 6.86148e-05 consensuale 60 0.00029874 consensuali 1 3.43074e-05 consensuali 15 7.4685e-05 consensus 2 6.86148e-05 consensualmente 9 4.4811e-05 corrotto 128 0.00439135 disonesta 7 3.4853e-05 disonesti 19 0.00065184 disonesti 29 0.000144391

      OUTPUT:

      antidemocratica-1.4323e-05 antidemocratiche-0.0001299996 antidemocratici-1.7119e-05 antidemocraticità2.93284e-05 antidemocratico0.000101899 consensuale-0.0002301252 consensuali-4.03776e-05 disonesti0.000507449

      One of Crete's own prophets has said it: 'Cretans are always liars, evil brutes, lazy gluttons'.
      He has surely told the truth.

        This is more efficient if you have the data available as two files. Build a lookup table (hash) using the first file then consult it while reading the second file:

        #!/usr/bin/env perl use strict; use warnings; my $f1 = <<F1; antidemocratica 8 0.000274459 antidemocratiche 1 3.43074e-05 antidemocratici 4 0.00013723 antidemocraticità 1 3.43074e-05 antidemocratico 14 0.000480303 antidemocratico.questa 1 3.43074e-05 consensi 29 0.000994914 consenso 109 0.00373951 consensuale 2 6.86148e-05 consensuali 1 3.43074e-05 consensus 2 6.86148e-05 corrotto 128 0.00439135 disonesti 19 0.00065184 F1 my $f2 = <<F2; antidemocratica 58 0.000288782 antidemocratiche 33 0.000164307 antidemocratici 31 0.000154349 antidemocraticità 1 4.979e-06 antidemocratico 76 0.000378404 consensi 74 0.000368446 consenso 2543 0.0126616 consensocrazia 1 4.979e-06 consensuale 60 0.00029874 consensuali 15 7.4685e-05 consensualmente 9 4.4811e-05 disonesta 7 3.4853e-05 disonesti 29 0.000144391 F2 my %f1Words; open my $fIn, '<', \$f1; while (<$fIn>) { chomp; my ($word, $num, $value) = split; $f1Words{$word} = $value; } close $fIn; open $fIn, '<', \$f2; while (<$fIn>) { chomp; my ($word, $num, $value) = split; next if ! exists $f1Words{$word}; print "$word ", $f1Words{$word} - $value, "\n"; } close $fIn;

        Prints:

        antidemocratica -1.4323e-005 antidemocratiche -0.0001299996 antidemocratici -1.7119e-005 antidemocraticità 2.93284e-005 antidemocratico 0.000101899 consensi 0.000626468 consenso -0.00892209 consensuale -0.0002301252 consensuali -4.03776e-005 disonesti 0.000507449

        If you only have the combined rows available then you need two lookup tables. Populate the tables in the file input loop, then loop over the keys from one of the tables to generate the output:

        #!/usr/bin/env perl use strict; use warnings; my %f1Entries; my %f2Entries; while (<DATA>) { my ($f1, $f2, $perc1, $perc2) = (split)[0, -3, 2, -1]; $f1Entries{$f1} = $perc1; $f2Entries{$f2} = $perc2; } for my $f2 (sort keys %f2Entries) { next if ! exists $f1Entries{$f2}; print "$f2 ", $f1Entries{$f2} - $f2Entries{$f2}, "\n"; } __DATA__ antidemocratica 8 0.000274459 antidemocratica 58 0.000288782 antidemocratiche 1 3.43074e-05 antidemocratiche 33 0.000164307 antidemocratici 4 0.00013723 antidemocratici 31 0.000154349 antidemocraticità 1 3.43074e-05 antidemocraticità 1 4.979e-06 antidemocratico 14 0.000480303 antidemocratico 76 0.000378404 antidemocratico.questa 1 3.43074e-05 consensi 74 0.000368446 consensi 29 0.000994914 consenso 2543 0.0126616 consenso 109 0.00373951 consensocrazia 1 4.979e-06 consensuale 2 6.86148e-05 consensuale 60 0.00029874 consensuali 1 3.43074e-05 consensuali 15 7.4685e-05 consensus 2 6.86148e-05 consensualmente 9 4.4811e-05 corrotto 128 0.00439135 disonesta 7 3.4853e-05 disonesti 19 0.00065184 disonesti 29 0.000144391

        prints:

        antidemocratica -1.4323e-005 antidemocratiche -0.0001299996 antidemocratici -1.7119e-005 antidemocraticità 2.93284e-005 antidemocratico 0.000101899 consensi 0.000626468 consenso -0.00892209 consensuale -0.0002301252 consensuali -4.03776e-005 disonesti 0.000507449
        True laziness is hard work
Re: Aligning text and then perfom calculations
by shmem (Chancellor) on Dec 15, 2013 at 22:01 UTC

    The data you present is perfect for a hash (see perldata), since each row consists of a key (e.g. "politici") and a value (0.0489). You have two files with that structure, so you set up two hashes. You can then iterate over the keys of one hash with keys and see if it is present in the other hash with exists. Then access the corresponding values of both hashes and do your calculation.

    for my $key (keys %left) { if ( exists $right{$key} ) { my result = $left{$key} - $right{$key}; print "$key | $result\n"; } }

    Storing the key/value tupels from the files into the hashes %left and %right is left as an exercise to the reader.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Aligning text and then perfom calculations
by Kenosis (Priest) on Dec 15, 2013 at 22:37 UTC

    Here's another option:

    use strict; use warnings; use Text::ASCIITable; use open qw(:std :utf8); my %hash; my $tb = Text::ASCIITable->new(); $tb->setCols( 'WordF1', 'WordF2', 'Difference' ); while (<>) { next if $. < 3; push @{ $hash{$1} }, $2 while /\|\s+(\w+)\s+\|\s+([.\d]+)/g; } for my $word ( keys %hash ) { if ( @{ $hash{$word} } == 2 ) { $hash{$word} = $hash{$word}->[0] - $hash{$word}->[1]; } else { delete $hash{$word}; } } for my $word ( sort { $hash{$b} <=> $hash{$a} } keys %hash ) { $tb->addRow( $word, $word, sprintf( '%0.05f', $hash{$word} ) ); } print $tb;

    Usage: perl inFile [>outFile]

    The last, optional parameter directs output to a file.

    Output on your dataset:

    .--------------------------------------. | WordF1 | WordF2 | Difference | +------------+------------+------------+ | politici | politici | 0.01940 | | referendum | referendum | 0.01726 | | verità | verità | 0.01454 | | scandalo | scandalo | 0.00978 | | consenso | consenso | 0.00887 | | vergogna | vergogna | 0.00592 | '------------+------------+------------'

    The script initially creates a hash of arrays (HoA), pairing the word with the associated value(s). Next, it iterates through the hash, removing key/value pairs for those words occurring in only one file, then pairs the word with the calculated difference. Lastly, it builds the table, sorting the rows in descending Difference, since your original table displayed words in descending percentage. Use $hash{$a} <=> $hash{$b} if you want the rows shown in ascending Difference.

    You said, "I have processed two text files..." I (somehow) get the impression that each of the two files contain a corpus which underwent processing resulting in generating your original table (perhaps you sent a program a list of files to analyze)--this, instead of merely having word/value pairs in those two files. Is this correct? If not, and you do have these word/value pairs in those files, consider the offered file solutions.

    Hope this helps!

    Edit: Below is a script which takes two files containing the two data sets you posted earlier. It's just slightly modified from the script above:

      You said, "I have processed two text files..." I (somehow) get the impression that each of the two files contain a corpus which underwent processing resulting in generating your original table (perhaps you sent a program a list of files to analyze)--this, instead of merely having word/value pairs in those two files. Is this correct?

      You got it right. I am doing some NLP.I got two corpus, tokenized and then simply performed some querys over it. Thank you very much for your help.

        You're most welcome, epimenidecretese!

Re: Aligning text and then perfom calculations
by soonix (Canon) on Dec 16, 2013 at 09:06 UTC

    Your OP probably is solved by now, but may I suggest some alterations?

    a) If the "name" columns are to be equal, you would need only one of them,

    b) if a word is present in one of the files, and missing in the other, you could assume this as a value of zero, resulting in

    corrotto +0.00439 tradizioni -0.00234 ...
    Update: c) and do the formatting/alignment of decimal points (and sorting, like Kenosis in Re: Aligning text and then perfom calculations) after doing the calculations