Hi all, I need help adapting this code to assign multiple tags to tokens in a corpus. The code currently only prints the last tag assigned, but when I took a closer look at my tagset, I realised that the same token could have multiple tags and I would like to assign them all so that a human could check these later.
An example of the input here:
Abanye abahlanu balimale kanzima .
Abanye abanazo izinkomo noma izimbuzi .
Abaziphathi ngendlela ekhombisa ukufuna ukudlala .
Abazithamundayo bathi uzimisele ngokuliphika icala .
Abazi ukuthi bakhuluma ngani .
abazonikwa amalungelo okudoba ezinga elincane .
Ahluleke ngisho ukukuphulula umhlane njengasemihleni .
Ahluleke ukuzibamba, azidedele inkaba .
Ahluleke ukuzibamba uMaMthembu .
An example of the tagset:
icala <ZUL-SIL-0035-n>
ukhalo <ZUL-SIL-0036-n>
inkaba <ZUL-SIL-0037-n>
inkaba <ZUL-SIL-0038-n>
isisu <ZUL-SIL-0039-n>
isisu <ZUL-SIL-0040-n>
isibeletho <ZUL-SIL-0041-n>
umhlane <ZUL-SIL-0042-n>
iqolo <ZUL-SIL-0043-n>
izinqe <ZUL-SIL-0044-n>
umdidi <ZUL-SIL-0045-n>
umphambili <ZUL-SIL-0046-n>
amasende <ZUL-SIL-0047-n>
inkomo <ZUL-SIL-0048-n>
ubhontshisi <ZUL-SIL-0049-n>
ingalo <ZUL-SIL-0050-n>
ukuthi bakhuluma <ZUL-SIL-1800-n>
An example of the output I hope to get (with a tab separating tags where two were assigned):
Abanye abahlanu balimale kanzima .
Abanye abanazo izinkomo <ZUL-SIL-0048-n> noma izimbuzi .
Abaziphathi ngendlela ekhombisa ukufuna ukudlala .
Abazithamundayo bathi uzimisele ngokuliphika icala <ZUL-SIL-0035-n> .
Abazi ukuthi bakhuluma <ZUL-SIL-1800-n> ngani .
abazonikwa amalungelo isisu <ZUL-SIL-0039-n>\t<ZUL-SIL-0040-n> ezinga
+elincane .
Ahluleke ngisho ukukuphulula umhlane <ZUL-SIL-0042-n> njengasemihleni
+.
Ahluleke ukuzibamba, azidedele inkaba <ZUL-SIL-0037-n>\t<ZUL-SIL-0038-
+n> .
Ahluleke ukuzibamba uMaMthembu .
The code I currently have:
#!/usr/bin/env perl
use 5.016;
use warnings;
use autodie;
my $corpusname = 'GFSEBcorpus.zul_selected-sentences_original';
my %words2ids;
{
open my $fh, '<', "$corpusname.example.tagset.txt";
while (<$fh>) {
chomp;
my ($text, $token) = split /\t/;
$words2ids{fc $text} = $token;
}
}
my $alt = join '|', sort {
length($b) <=> length($a)
} map fc, keys %words2ids;
my $re = qr{(?i:($alt))};
my %found;
{
open my $in_fh, '<', "$corpusname.txt";
open my $out_fh, '>', "$corpusname.possible-annotation_example.txt
+";
while (<$in_fh>) {
s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg;
print $out_fh $_;
}
}
delete @words2ids{keys %found};
{
open my $fh, '>', "$corpusname.tags-not-found_example.txt";
for (sort keys %words2ids) {
say $fh "$_\t$words2ids{$_}";
}
}
Thank you for the help!