String tags

rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a list of tags with their index locations with respect to a given string. For example:

String='Titles consisting of a single word are discouraged, and in most cases are disallowed outright.'

tag: tag category: tag-id: start index location: end index location
consisting of: cat1: id1: 7: 20
discouraged: cat1: id2: 39: 50
most cases: cat2: id3: 59: 69

Using the given tag index locations, I want to tag the string as:

Titles (cat1: id1)consisting of(cat1) a single word are (cat1: id2)discouraged(cat1), and in (cat3: id1)most cases(cat3) are disallowed outright.

I tried to attempt this problem by iterating over tag index positions and using 'substr' function over the string for each iteration. However, the problem is that after each iteration the indexes of the string characters change. Please suggest any efficient way of doing this.

Thanks!

Comment on String tags

Replies are listed 'Best First'.
Re: String tags by tybalt89 (Monsignor) on Sep 25, 2017 at 23:49 UTC
Do it from back to front. `#!/usr/bin/perl # http://perlmonks.org/?node_id=1200077 use strict; use warnings; my $string = 'Titles consisting of a single word are discouraged, and +in most cases are disallowed outright.'; my @tags = split /\n/, <<END; consisting of: cat1: id1: 7: 20 discouraged: cat1: id2: 39: 50 most cases: cat2: id3: 59: 69 END print "$string\n"; for (reverse @tags) { my ($text, $cat, $id, $start, $end) = split /: /; substr $string, $start, $end - $start, "($cat: $id)$text($cat)"; } print "$string\n";` [download] Outputs: `Titles consisting of a single word are discouraged, and in most cases +are disallowed outright. Titles (cat1: id1)consisting of(cat1) a single word are (cat1: id2)dis +couraged(cat1), and in (cat2: id3)most cases(cat2) are disallowed out +right.` [download]	[reply] [d/l] [select]
Re^2: String tags by rajaman (Sexton) on Sep 28, 2017 at 17:50 UTC
Great, these solutions work great. Thanks tybalt89, Anonymous Monk.	[reply]
Re^3: String tags by AnomalousMonk (Archbishop) on Sep 28, 2017 at 18:54 UTC
Please let me (not the Anonymous Monk) know if you need the whitespace-insensitive, file-slurp version; I will post. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re: String tags by AnomalousMonk (Archbishop) on Sep 26, 2017 at 00:56 UTC
Here's a variation that doesn't worry about offsets at all, but goes after the target words or phrases themselves. c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @tags = ( 'consisting of: cat1: id1: 7: 20', 'discouraged: cat1: id2: 39: 50', 'most cases: cat2: id3: 59: 69', ); ;; my %targets; ;; for my $tag (@tags) { my ($string, $cat, $id) = split /:\s+/, $tag; $targets{$string} = [ $cat, $id ]; } ;; my ($rx_target) = map qr{ \b (?: $_) \b }xms, join ' \| ', map quotemeta, reverse sort keys %targets ; ;; my $string = 'Titles consisting of a single word are discouraged, ' . 'and in most cases are disallowed outright.' ; print qq{'$string'}; ;; $string =~ s{ ($rx_target) } {($targets{$1}[0]: $targets{$1}[1])$1($targets{$1}[0])}xmsg; print qq{'$string'}; " 'Titles consisting of a single word are discouraged, and in most cases + are disallowed outright.' 'Titles (cat1: id1)consisting of(cat1) a single word are (cat1: id2)di +scouraged(cat1), and in (cat2: id3)most cases(cat2) are disallowed ou +tright.' [download] Some important caveats: A target phrase like `"most cases"` does not match with a `"most cases"` or `"most\ncases"` source substring because the target requires an exact match with a single space and the given substrings, while similar, are not exactly equal variations. This can be dealt with fairly easily if file-slurp processing of the source text is used. However, file slurping does not scale well to large (say, more than a few hundred megabyte) files. If the `"most\ncases"` case above is encountered in line-by-line processing of source text (which does scale to enormous files), handling becomes more tricky, but can still be done. Update: This approach does not handle nested tags. Update: In almost every case, use of the full-featured Text::CSV module is preferable to the naive use of split that I have in my example code. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]


P is for Practical
	PerlMonks