Mordan has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I am trying to get some expressions in the regex to print. I asked a related question to this and made a bit of a hash so starting afresh.

I am using tagger and want to be able to display the tags found in the regex.

$NUM = get_exp('cd'); $GER = get_exp('vbg'); $ADJ = get_exp('jj[rs]*'); $PART = get_exp('vbn'); $NN = get_exp('nn[sp]*'); $NNP = get_exp('nnp'); $PREP = get_exp('in'); $DET = get_exp('det'); $PAREN= get_exp('[lr]rb'); $QUOT = get_exp('ppr'); $SEN = get_exp('pp'); $WORD = get_exp('\p{IsWord}+');

I can display the text I input all tagged (code below), but what I want to do is display the count of tags. So like:

This code will output tagged text, but I can't seem to get it to tabulate the tags. My efforts, such as print $tag, print $GER and so on won't work.

Also I heard that tagger has problems accepting input from files rather than text in the coding, anyone else heard that?

#!/usr/bin/env perl use Lingua::EN::Tagger qw(add_tags); my $postagger = new Lingua::EN::Tagger; my $text = "the quick brown fox jumped over the lazy dog"; my $tagged = $postagger->add_tags($text); print $tagged, "\n";

Replies are listed 'Best First'.
Re: Best way to print variables in regex
by Kenosis (Priest) on Jan 20, 2014 at 23:22 UTC

    Perhaps the following will be helpful:

    use strict; use warnings; use Lingua::EN::Tagger qw(add_tags); my %tags; my $postagger = new Lingua::EN::Tagger; my $text = "the quick brown fox jumped over the lazy dog"; my $tagged = $postagger->add_tags($text); print $tagged, "\n\n"; $tags{ uc $1 }++ while $tagged =~ m!<([^/]+?)>!g; print "$_: $tags{$_}\n" for sort keys %tags;


    <det>the</det> <jj>quick</jj> <jj>brown</jj> <nn>fox</nn> <vbd>jumped< +/vbd> <in>over</in> <det>the</det> <jj>lazy</jj> <nn>dog</nn> DET: 2 IN: 1 JJ: 3 NN: 2 VBD: 1

      Thank you Kenosis, your method seems the most straightforward. Thanks everyone who answered here and on the other thread.

      Are there any recommendations on how best to put this into a spreadsheet? I want to run this on a few phrases so think it would be a good idea to put it in a spreadsheet in a consistent way rather than copy and paste from terminal. So DET would values would always be in column 1, IN in 2.

        One way would be to create a CSV file and then import that into your spreadsheet:
        my $filename = '/path/to/file.csv'; open (my $fh, '>', $filename) or die "Could not open $filename, $!"; my @headers = qw( DET IN JJ NN VBD ); print $fh join(',',@headers) . "\n"; # then, for each of your phrases print $fh join(',', map($tags{$_} || 0, @headers) ) . "\n"; close $fh;
        However, if you intend to do the tagging at different times you will need a way to update the data. You could use Spreadsheet::WriteExcel but there is a learning curve and probably overkill. Alternatively, you can keep your spreadsheet data as a CSV file and append to that file, or use Tie::Array::CSV to append:
        use Tie::Array::CSV; my $filename = '/path/to/file.csv'; tie my @file, 'Tie::Array::CSV', $filename; # (this bit has been fixed - see comment below) # for each of your phrases my @row = map { $tags{$_} || 0 } @headers; push(@file,\@row); untie @file;
Re: Best way to print variables in regex
by jethro (Monsignor) on Jan 20, 2014 at 23:24 UTC

    Where does get_exp come from, it isn't mentioned in tagger's documentation?

    Just looked at the documentation and it seems there are methods that return hashes with occurrence frequencies ready to use. Especially get_nouns and get_proper_nouns seem to offer just what you want:

    "get_proper_nouns TAGGED_TEXT

    Given a POS-tagged text, this method returns a hash of all proper nouns and their occurrence frequencies...."

Re: Best way to print variables in regex
by tangent (Vicar) on Jan 21, 2014 at 00:00 UTC
    I have posted a reply to your other question which may help here as well (update: ignore this - Kenosis solution above is far more elegant).
Re: Best way to print variables in regex
by Anonymous Monk on Jan 21, 2014 at 02:48 UTC