Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^3: problem count the number of words (updated)

by AnomalousMonk (Archbishop)
on Dec 27, 2018 at 20:16 UTC ( [id://1227766]=note: print w/replies, xml ) Need Help??


in reply to Re^2: problem count the number of words
in thread problem count the number of words

Here's an example of counting defined sets of "words" (which can be tricky to define) based on the technique described in the Building Regex Alternations Dynamically article by haukex. If you can figure out how to get the contents of your positive and negative word data files into the corresponding arrays (and if my notion of what you want is anywhere near what you actually want), you may be on your way.

Note that the code is set up for case-insensitive matching and counting: the negative word "fourscore" matches "FoUrScOrE" in the example sentence, and so on. Note, again, that the concept of a "word" can be slippery, so the use of the  \b boundary assertion, among other details, may not be appropriate.

c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @positive = qw(nation conceived liberty created equal foo); my @negative = qw(fourscore SEVEN fOrTh fathers continent bar); ;; my $sentence = 'FoUrScOrE and seven years ago ' . 'our fathers brought forth, on this continent, ' . 'a new nation, conceived in liberty, and dedicated ' . 'to the proposition that all men are created equal. ' . 'Repeat seven nation fathers nation.' ; ;; my %pos = map { lc($_) => 0 } @positive; my $rx_pos = make_regex(\%pos); print 'for debug: positive rx: ', $rx_pos; ;; my %neg = map { lc($_) => 0 } @negative; my $rx_neg = make_regex(\%neg); print 'for debug: negative rx: ', $rx_neg; ;; my %other; my $rx_undefined = qr{ (?! $rx_pos | $rx_neg) }xms; my $rx_word = qr{ \b [[:alpha:]]+ \b }xms; ;; ++$pos { lc $_ } for $sentence =~ m{ $rx_pos }xmsg; ++$neg { lc $_ } for $sentence =~ m{ $rx_neg }xmsg; ++$other{ lc $_ } for $sentence =~ m{ $rx_undefined $rx_word }xmsg; ;; dd \%pos; dd \%neg; dd \%other; ;; ;; sub make_regex { my ($hr_wordlist) = @_; ;; my ($rx) = map qr{ (?i) \b (?: $_) \b }xms, join '|', map quotemeta, reverse sort keys %$hr_wordlist ; ;; return $rx; } " for debug: positive rx: (?msx-i: (?i) \b (?: nation|liberty|foo|equal| +created|conceived) \b ) for debug: negative rx: (?msx-i: (?i) \b (?: seven|fourscore|forth|fat +hers|continent|bar) \b ) { conceived => 1, created => 1, equal => 1, foo => 0, liberty => 1, na +tion => 3 } { bar => 0, continent => 1, fathers => 2, forth => 1, fourscore => 1, +seven => 2 } { a => 1, ago => 1, all => 1, "and" => 2, are => 1, brought => 1, dedicated => 1, in => 1, men => 1, new => 1, on => 1, our => 1, proposition => 1, repeat => 1, that => 1, the => 1, this => 1, to => 1, years => 1, }

Update: In the  make_regex() function, the lines
    reverse sort
    map  quotemeta,
are swapped | were swapped (fixed); they should be
    map  quotemeta,
    reverse sort
i.e., sort-ing, either lexically or by length, should be done on the raw strings before the quotemeta step.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^4: problem count the number of words
by GHMON (Novice) on Dec 28, 2018 at 09:37 UTC

    Hi

    TNX bro i could find the my problem , my problem was about these section *foreach (my $word = <$inwp>) { $countp{$word}++ ; }* , *foreach (my $word = <$nwt>) {$countn{$word}++ ; }* because i split the words by /\n/ that my code didn't need to this separator

    my new code

    #Hi Codder use warnings ; use strict ; #use DBI ; use utf8 ; use Encode ; my $numlinep = 0 ; my $traincountp = 0 ; my $pt = '/root/Positive.txt' ; my $ptt = '/root/positivetrain1.txt' ; my $pwt = '/root/Positive2.txt' ; my $ntw = '/root/Negative2.txt' ; my $pttt = '/root/positivetest.txt' ; open (my $in , "<:encoding(utf8)" , "$pt") or die "$pt: $!" ; while (my $line = <$in>) { $numlinep++ ; } close $in ; open ($in , "<:encoding(utf8)" , "$pt") or die "$pt: $!" ; while (my $linep = <$in>) { my $inp ; if ($traincountp <= (0.7*$numlinep)){ open ($inp , ">>" , "$ptt") or die "$ptt: $!" ; print $inp $linep , "\n" ; } if ($traincountp > (0.7*$numlinep)){ open (my $inpt , ">>" , "$pttt") or die "$pttt: $!" ; print $inpt $linep , "\n" ; } $traincountp++ ; } close $in ; my $numlinet = 0 ; my $traincountn = 0 ; my $nt = '/root/Negative.txt' ; my $ntt = '/root/negativetrain1.txt' ; my $nttt = '/root/negativetest.txt' ; open (my $it , "<:encoding(utf8)" , "$nt") or die "$nt: $!" ; while (my $line = <$it>) { $numlinet++ ; } close $it ; open ($it , "<:encoding(utf8)" , "$nt") or die "$nt: $!" ; while (my $linen = <$it>) { my $itn ; if ($traincountn <= (0.7*$numlinet)){ open ($itn , ">>" , "$ntt") or die "$ntt: $!" ; print $itn $linen , "\n" ; } if ($traincountn > (0.7*$numlinet)) { open (my $ittn , ">>" , "$nttt") or die "$nttt: $!" ; print $ittn $linen , "\n" ; } $traincountn++ ; } close $it ; my $numlinepw = 0 ; my %countp = () ; open (my $inw , "<:encoding(utf8)" , "$ptt") or die "$ptt: $!" ; open (my $inwp , "<:encoding(utf8)" , "$pwt") or die "$pwt: $!" ; while (<$inw>) { my @pwords ; my @ptw ; my $elementp ; my $countp ; #@pwords = split (/\n/ , $inwp) ; #push @ptw , @pwords ; foreach (my $word = <$inwp>) { $countp{$word}++ ; } while ( ( my $kp , my $vp) = each %countp ) { open (my $hashp , ">>" , 'wordsbagp.txt') ; print $hashp "$kp = $vp\n" ; #print "$kp => $vp\n" ; #print "$kp" , "\n" ; #print "$vp" , "\n" ; #print "$kp" , "\n" , "$vp" , "\n" ; } $numlinepw++ ; } #print "$numlinepw" , "\n" ; my $numlinenw = 0 ; my %countn = () ; open (my $itw , "<:encoding(utf8)" , "$ntt") or die "$ntt: $!" ; open (my $nwt , "<:encoding(utf8)" , "$ntw" ) or die "$ntw: $!" ; while (<$itw>) { $numlinenw++ ; my @nwords ; my @ntw ; my $elementn ; my $countn ; #@nwords = split (/\n/ , $nwt) ; #push @ntw , @nwords ; foreach (my $word = <$nwt>) { $countn{$word}++ ; } while ( ( my $kn , my $vn ) = each %countn ) { open (my $hashn , '>>' , 'wordsbagn.txt') or die $! ; print $hashn "$kn = $vn\n" ; #print "$kn => $vn\n" ; #print "$kn" , "\n" ; #print "$vn" , "\n" ; #print "$kn" , "\n" , "$vn" , "\n" ; } } print 'Finish First Section' , "\n" ;
      foreach (my $word = <$inwp>) { $countp{$word}++ ; }

      You appear to be just counting the words in the words file Positive2.txt and I can't see where you are spitting the sentences in Positive.txt into words.

      Perhaps this will help you progess

      #!/usr/bin/perl use strict; # create positive word hash my $pos_words = '/root/Positive2.txt'; open my $fh_in,'<:encoding(utf8)',$pos_words or die "Could not open $pos_words : $!"; my %pos_word = map { s/^\s+|\s+$//g; $_=>1 } <$fh_in>; close $fh_in; # open positive word count output file my $pos_outfile = '/root/wordsbagp.txt'; open my $fh_out,'>',$pos_outfile or die "Could not open $pos_outfile\n"; # read positive sentence file my $pos_text = '/root/Positive.txt'; open my $fh_in,'<:encoding(utf8)',$pos_text or die "Could not open $pos_text : $!"; my @train = <$fh_in>; close $fh_in; # split into 2 arrays 70% / 30% my $offset = int 0.7 * @train; push my @test, splice @train,$offset; my $lineno = 0; for my $line (@train){ ++$lineno; # split sentence into words chomp($line); my @words = split /\b/,$line; # count positive words only my %count = (); for my $word (@words){ ++$count{$word} if exists $pos_word{$word}; } # print results print $fh_out "\nline $lineno: $line\n"; for my $word (sort keys %count){ printf $fh_out " %-10s => %d\n",$word,$count{$word}; } }; close $fh_in; close $fh_out;
      poj

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1227766]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2026-04-18 07:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.