Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

String tags

by rajaman (Sexton)
on Sep 25, 2017 at 23:24 UTC ( [id://1200077]=perlquestion: print w/replies, xml ) Need Help??

rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a list of tags with their index locations with respect to a given string. For example:

String='Titles consisting of a single word are discouraged, and in most cases are disallowed outright.'

tag: tag category: tag-id: start index location: end index location
consisting of: cat1: id1: 7: 20
discouraged: cat1: id2: 39: 50
most cases: cat2: id3: 59: 69
Using the given tag index locations, I want to tag the string as:

Titles (cat1: id1)consisting of(cat1) a single word are (cat1: id2)discouraged(cat1), and in (cat3: id1)most cases(cat3) are disallowed outright.

I tried to attempt this problem by iterating over tag index positions and using 'substr' function over the string for each iteration. However, the problem is that after each iteration the indexes of the string characters change. Please suggest any efficient way of doing this.

Thanks!

Replies are listed 'Best First'.
Re: String tags
by tybalt89 (Monsignor) on Sep 25, 2017 at 23:49 UTC

    Do it from back to front.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1200077 use strict; use warnings; my $string = 'Titles consisting of a single word are discouraged, and +in most cases are disallowed outright.'; my @tags = split /\n/, <<END; consisting of: cat1: id1: 7: 20 discouraged: cat1: id2: 39: 50 most cases: cat2: id3: 59: 69 END print "$string\n"; for (reverse @tags) { my ($text, $cat, $id, $start, $end) = split /: /; substr $string, $start, $end - $start, "($cat: $id)$text($cat)"; } print "$string\n";

    Outputs:

    Titles consisting of a single word are discouraged, and in most cases +are disallowed outright. Titles (cat1: id1)consisting of(cat1) a single word are (cat1: id2)dis +couraged(cat1), and in (cat2: id3)most cases(cat2) are disallowed out +right.
      Great, these solutions work great. Thanks tybalt89, Anonymous Monk.

        Please let me (not the Anonymous Monk) know if you need the whitespace-insensitive, file-slurp version; I will post.


        Give a man a fish:  <%-{-{-{-<

Re: String tags
by AnomalousMonk (Archbishop) on Sep 26, 2017 at 00:56 UTC

    Here's a variation that doesn't worry about offsets at all, but goes after the target words or phrases themselves.

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @tags = ( 'consisting of: cat1: id1: 7: 20', 'discouraged: cat1: id2: 39: 50', 'most cases: cat2: id3: 59: 69', ); ;; my %targets; ;; for my $tag (@tags) { my ($string, $cat, $id) = split /:\s+/, $tag; $targets{$string} = [ $cat, $id ]; } ;; my ($rx_target) = map qr{ \b (?: $_) \b }xms, join ' | ', map quotemeta, reverse sort keys %targets ; ;; my $string = 'Titles consisting of a single word are discouraged, ' . 'and in most cases are disallowed outright.' ; print qq{'$string'}; ;; $string =~ s{ ($rx_target) } {($targets{$1}[0]: $targets{$1}[1])$1($targets{$1}[0])}xmsg; print qq{'$string'}; " 'Titles consisting of a single word are discouraged, and in most cases + are disallowed outright.' 'Titles (cat1: id1)consisting of(cat1) a single word are (cat1: id2)di +scouraged(cat1), and in (cat2: id3)most cases(cat2) are disallowed ou +tright.'

    Some important caveats:

    • A target phrase like "most cases" does not match with a "most  cases" or "most\ncases" source substring because the target requires an exact match with a single space and the given substrings, while similar, are not exactly equal variations. This can be dealt with fairly easily if file-slurp processing of the source text is used. However, file slurping does not scale well to large (say, more than a few hundred megabyte) files.
    • If the "most\ncases" case above is encountered in line-by-line processing of source text (which does scale to enormous files), handling becomes more tricky, but can still be done.
    • Update: This approach does not handle nested tags.

    Update: In almost every case, use of the full-featured Text::CSV module is preferable to the naive use of split that I have in my example code.


    Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1200077]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-24 03:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found