Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^3: Finding multiword units in a corpus

by tybalt89 (Monsignor)
on Nov 18, 2022 at 16:55 UTC ( [id://11148246]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Finding multiword units in a corpus
in thread Finding multiword units in a corpus

TIMTOWTDI

#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11148202 use warnings; use List::AllUtils qw( rev_nsort_by ); my $corpusfile = '/tmp/d.11148202.corpus'; # FIXME filename my $wordfile = '/tmp/d.11148202.words'; # FIXME filename my %words2ids; { local @ARGV = $wordfile; while( <> ) { my ($key, $value) = split /[\t\n]/; $words2ids{lc $key} .= " $value"; } } my $pat = do { local $" = '|'; qr/(@{[ map quotemeta, rev_nsort_by { length } keys %words2ids ]})/i}; my %found; { local @ARGV = $corpusfile; print s/\b$pat\K/ $found{lc $1}++; $words2ids{lc $1} /ger while <>; } delete @words2ids{ keys %found }; # not found local $, = "\n"; print '',"---------------- Not Found:", sort(keys %words2ids), '';

Outputs:

Lokho udebe <ZUL-SIL-0016-n> kukwenze isilomo. Ukuzihlola izinyo <ZUL-SIL-0018-n> <ZUL-SIL-0018-n-other> kungahlenga +izinyo lomhlathi <ZUL-SIL-0019-n> yakho. Amakhala agxiza amafinyila. Ulimi <ZUL-SIL-0017-n> amafutha ulimi <ZUL-SIL-0017-n> wonke ULIMI <ZU +L-SIL-0017-n> amabheringi. Sebenzisa amafutha ulimi <ZUL-SIL-0017-n>. Zama ukugwema ukudla okuncinca udebe <ZUL-SIL-0016-n>. ---------------- Not Found: ingemuva lomqala umphimbo

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148246]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-04-19 08:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found