Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Tagging a corpus with multiple tags

by veg_running (Initiate)
on Nov 24, 2022 at 12:10 UTC ( #11148352=perlquestion: print w/replies, xml ) Need Help??

veg_running has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I need help adapting this code to assign multiple tags to tokens in a corpus. The code currently only prints the last tag assigned, but when I took a closer look at my tagset, I realised that the same token could have multiple tags and I would like to assign them all so that a human could check these later.

An example of the input here:

Abanye abahlanu balimale kanzima . Abanye abanazo izinkomo noma izimbuzi . Abaziphathi ngendlela ekhombisa ukufuna ukudlala . Abazithamundayo bathi uzimisele ngokuliphika icala . Abazi ukuthi bakhuluma ngani . abazonikwa amalungelo okudoba ezinga elincane . Ahluleke ngisho ukukuphulula umhlane njengasemihleni . Ahluleke ukuzibamba, azidedele inkaba . Ahluleke ukuzibamba uMaMthembu .

An example of the tagset:

icala <ZUL-SIL-0035-n> ukhalo <ZUL-SIL-0036-n> inkaba <ZUL-SIL-0037-n> inkaba <ZUL-SIL-0038-n> isisu <ZUL-SIL-0039-n> isisu <ZUL-SIL-0040-n> isibeletho <ZUL-SIL-0041-n> umhlane <ZUL-SIL-0042-n> iqolo <ZUL-SIL-0043-n> izinqe <ZUL-SIL-0044-n> umdidi <ZUL-SIL-0045-n> umphambili <ZUL-SIL-0046-n> amasende <ZUL-SIL-0047-n> inkomo <ZUL-SIL-0048-n> ubhontshisi <ZUL-SIL-0049-n> ingalo <ZUL-SIL-0050-n> ukuthi bakhuluma <ZUL-SIL-1800-n>

An example of the output I hope to get (with a tab separating tags where two were assigned):

Abanye abahlanu balimale kanzima . Abanye abanazo izinkomo <ZUL-SIL-0048-n> noma izimbuzi . Abaziphathi ngendlela ekhombisa ukufuna ukudlala . Abazithamundayo bathi uzimisele ngokuliphika icala <ZUL-SIL-0035-n> . Abazi ukuthi bakhuluma <ZUL-SIL-1800-n> ngani . abazonikwa amalungelo isisu <ZUL-SIL-0039-n>\t<ZUL-SIL-0040-n> ezinga +elincane . Ahluleke ngisho ukukuphulula umhlane <ZUL-SIL-0042-n> njengasemihleni +. Ahluleke ukuzibamba, azidedele inkaba <ZUL-SIL-0037-n>\t<ZUL-SIL-0038- +n> . Ahluleke ukuzibamba uMaMthembu .

The code I currently have:

#!/usr/bin/env perl use 5.016; use warnings; use autodie; my $corpusname = 'GFSEBcorpus.zul_selected-sentences_original'; my %words2ids; { open my $fh, '<', "$corpusname.example.tagset.txt"; while (<$fh>) { chomp; my ($text, $token) = split /\t/; $words2ids{fc $text} = $token; } } my $alt = join '|', sort { length($b) <=> length($a) } map fc, keys %words2ids; my $re = qr{(?i:($alt))}; my %found; { open my $in_fh, '<', "$corpusname.txt"; open my $out_fh, '>', "$corpusname.possible-annotation_example.txt +"; while (<$in_fh>) { s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg; print $out_fh $_; } } delete @words2ids{keys %found}; { open my $fh, '>', "$corpusname.tags-not-found_example.txt"; for (sort keys %words2ids) { say $fh "$_\t$words2ids{$_}"; } }

Thank you for the help!

Replies are listed 'Best First'.
Re: Tagging a corpus with multiple tags
by Corion (Patriarch) on Nov 24, 2022 at 12:18 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148352]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2023-06-05 19:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (25 votes). Check out past polls.

    Notices?