http://www.perlmonks.org?node_id=11148202

veg_running has asked for the wisdom of the Perl Monks concerning the following question:

I have code that searches for words from a list in a large corpus of tokenised sentences and then assigns a unique ID to those words if it finds them. I would like to upgrade this code to also match multi-word units in the corpus.

My tag set is a simple 2 column file, tab separated. The first column includes the word (or multi-word unit) to find and the second column the tag to assign to it:

udebe <ZUL-SIL-0016-n> ulimi <ZUL-SIL-0017-n> izinyo <ZUL-SIL-0018-n> izinyo lomhlathi <ZUL-SIL-0019-n> ingemuva lomqala <ZUL-SIL-0024-n> umphimbo <ZUL-SIL-0025-n>

The output I require is also a text file and looks like this (produced with the current code below):

Lokho akusoze <ZUL-SIL-1364-b> kukwenze isilomo . Ukuzihlola amabele <ZUL-SIL-1234-n> kungahlenga impilo <ZUL-SIL-0238-n +> yakho . Amakhala agxiza amafinyila <ZUL-SIL-0095-n> . Gcoba <ZUL-SIL-1484-v> amafutha <ZUL-SIL-0572-n> kuwo wonke amabhering +i . Sebenzisa amafutha <ZUL-SIL-0572-n> afanelekile . Zama <ZUL-SIL-0296-n> ukugwema ukudla <ZUL-SIL-0569-n> okuncinca amafu +tha <ZUL-SIL-0572-n> .

My code currently looks like this:

use strict; use warnings; my $corpusname = "GoldStandardCorpus.Original.MG.2022-11-10"; my %words2ids; open my $lemmas, "<", $corpusname.".tagset.txt" or die $!; while (my $line = <$lemmas>) { chomp($line); my ($word, $id) = split "\t", $line; $words2ids{ lc($word) } = $id; } my %freq; open my $output, ">", $corpusname.".possible-annotation.txt" or die $! +; open my $corpus, "<", $corpusname.".txt" or die $!; while (my $line = <$corpus>) { chomp($line); my @tokens = split ' ', $line; foreach my $token (@tokens) { my $lct = lc $token; if (my $id = $words2ids{ $lct }) { $freq{$lct}++; $token .= " $id"; } } say { $output } "@tokens"; } open my $notfound, ">", $corpusname.".tags-not-found.txt" or die $!; foreach my $word (sort keys(%words2ids)) { next if exists $freq{$word}; say { $notfound } "$word\t$words2ids{$word}"; }

Any suggestions would be greatly appreciated! I am thinking some sort of sliding window to search for strings of words, but have no idea how to implement this. Thank you!