Remove all duplicates after regex capture

Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks. I'm hoping that you can help me with what is probably quite a simple problem, but which is completely stumping me!

I'm trying to work with text files which are fairly disorganized in terms of structure, and which have been stored in a hash. What I want from each file is to extract the line of text which begins "title:#" and ends with a "#" and store this text in a scalar to be used later.

The problem arises because, within some of the text files, there are multiple lines which begin and end with "title#" and "#" respectively. What distinguishes the "titles" I want is that they only ever appear in each text file once, whereas the "titles" I do not want appear at least twice (but sometimes three or four times) in the same text file.

So this is the basic script that I am using, which prints out all the titles

use warnings;
use strict;

my %mycorpus = (
    a => "<blah blah blah 
blah
title:#this is text I want 1#
blah blah blah",

    b => "blah
title:#this is text I do not want# blah
title:#this is text I want 2#
blah
title:#this is text I do not want#
blah",

    c => "blah blah
title:#this is text I want 3#
title:#this is text I do not want#
title:#this is text I do not want#
title:#this is text I do not want#
blah",
);


foreach my $filename (sort keys %mycorpus) {
        my $titles = '';


         while ($mycorpus{$filename} =~ /title:#(.*?)#/g){
                $titles = $1;        
    
            print  "$titles \n";    
            }
            }
[download]

The script above, obviously, prints out all 8 captured lines which begin with "title", but my desired output is:

this is text I want 1
this is text I want 2
this is text I want 3
[download]

I thought it might work to add a line in after the titles have been captured which (was supposed to!) remove all lines which appeared multiple times, but my attempts at this failed

use warnings;
use strict;

my %mycorpus = (
    a => "<blah blah blah 
blah
title:#this is text I want 1#
blah blah blah",

    b => "blah
title:#this is text I do not want# blah
title:#this is text I want 2#
blah
title:#this is text I do not want#
blah",

    c => "blah blah
title:#this is text I want 3#
title:#this is text I do not want#
title:#this is text I do not want#
title:#this is text I do not want#
blah",
);


foreach my $filename (sort keys %mycorpus) {
        my $titles = '';


         while ($mycorpus{$filename} =~ /title:#(.*?)#/g){
                $titles = $1;
                        $titles =~ s/(\b\w+\b)(?:\s*\1)+/$1/g; 
    
            print  "$titles \n";    
            }
            }
[download]

I hope that makes sense: I have been working on this for about 9 hours now, so I'm a bit frazzled! Any help/pointers here would be very much appreciated! Thank you!

Comment on Remove all duplicates after regex capture Select or Download Code

Replies are listed 'Best First'.
Re: Remove all duplicates after regex capture by haukex (Archbishop) on Aug 19, 2018 at 10:58 UTC
Here's a solution that doesn't depend on the order of the lines. It finds all the `title`s using a regex and counts them using a hash, and then selects the one that appears exactly once, `warn`ing or `die`ing if there isn't exactly one. You haven't specified a few things about the `title:` line, like whether there can be `#`'s in the titles, and what kind of text might appear after the closing `#`. use warnings; use strict; my %mycorpus = ( a => "<blah blah blah blah title:#this is text I want 1# blah blah blah", b => "blah title:#this is text I do not want# title:#this is text I want 2# blah title:#this is text I do not want# blah", c => "blah blah title:#this is text I do not want# title:#this is text I want 3# title:#this is text I do not want# title:#this is text I do not want# blah", ); for my $filename (sort keys %mycorpus) { my %titles; $titles{$1}++ while $mycorpus{$filename} =~ m{ ^ title:\# (.*) \# }xmg; my @once = grep { $titles{$_}==1 } sort keys %titles; die "No title found in $filename" unless @once; warn "More than one title found in $filename" if @once>1; my $title = $once[0]; print "$title\n"; } __END__ this is text I want 1 this is text I want 2 this is text I want 3 [download]	[reply] [d/l] [select]
Re^2: Remove all duplicates after regex capture by Maire (Scribe) on Aug 20, 2018 at 06:44 UTC
Brilliant, thank you very much! (Also, thanks for including examples of how to implement "die" and "warn": I rarely use these signals myself, but having just run your code on my data, it allowed me to identify a major error in the formatting of some of the data that would have caused a major headache later!).	[reply]
Re: Remove all duplicates after regex capture by FreeBeerReekingMonk (Deacon) on Aug 19, 2018 at 09:59 UTC
Edit: note the flaw pointed out by Hautex. I did not understand the question correctly, and did not check uniqueness of the correct title. How about this loop? `foreach my $filename (sort keys %mycorpus) { my $titles = ''; my $counter = 0; while ($mycorpus{$filename} =~ /title:#(.?)#\s$/gm){ if($counter++){ last if $counter++; # skip the rest of the matches # can also be used to print warnings about multiple titles # and check $1 against $titles if they are the same, or not }else{ $titles = $1; # first match, we can store it, print "$titles \n"; # or print it out } } }` [download] the output is `this is text I want 1 this is text I want 2 this is text I want 3` [download] You can also replace the while with an if, and then it just matches the first title# . `foreach my $filename (sort keys %mycorpus) { my $titles = ''; if ($mycorpus{$filename} =~ /title:#(.?)#\s$/m){ $titles = $1; print "$titles \n"; } }` [download] The output is the same. I think you wanted the multiline regexp modifier to match a newline inside your filedump string. edit: better structure to allow more post-work (commented what can be done there). Did also remove the /g (go) modifier in the "if" example as it is not needed there.	[reply] [d/l] [select]
Re^2: Remove all duplicates after regex capture by Maire (Scribe) on Aug 19, 2018 at 10:17 UTC
This works perfectly! Thank you very much for your help/time. I feel a bit daft after seeing how (relatively) simple the solution actually was, but I've learned a lot from your code here. Thanks! EDIT: Ah, thanks for the reworking! It was probably my sleep-drived incoherant question that caused the confusion! Thanks again.	[reply]
Re^3: Remove all duplicates after regex capture by haukex (Archbishop) on Aug 19, 2018 at 10:49 UTC
Note that FreeBeerReekingMonk's solution only works because it relies on the `blah` on the line `"title:#this is text I do not want# blah"`, and it only grabs the first `title:`. If I remove the `blah` or reorder the lines in the third example, it does not work.	[reply] [d/l] [select]
Re^4: Remove all duplicates after regex capture by FreeBeerReekingMonk (Deacon) on Aug 19, 2018 at 19:28 UTC
Re: Remove all duplicates after regex capture by AnomalousMonk (Archbishop) on Aug 19, 2018 at 17:22 UTC
This is essentially the same approach haukex uses, but with some different features: Titles are normalized before being added to the `%titles` duplication detection hash according to their normalized form. This means that, in this implementation, `'This is a title'` and `'This IS A Title'` are the same for the purpose of rejecting duplicate titles. This may or may not be what Maire wants; it's just an example of what's possible. The extent of normalization is easily adjusted. Titles are allowed to wrap from one line to the next. This applies to both unique and duplicate titles. Again, Maire may not want this, and it's easily turned off. A title may not be the only thing on a line. Another thing that's easy to change. A `'*'` is used in place of the original `'#'` character to delimit titles. This is done only to demonstrate that any character, even a regex metacharacter, could be used as a delimiter. (It wouldn't be that hard to change the regexes to accommodate multi-character delimiter sequences.) Script `remove_dup_lines_1.pl`: Read more... (2 kB) Output: `c:\@Work\Perl\monks\Maire>perl remove_dup_lines_1.pl a: 'this is text I want 1' b: 'this is text I want one' c: 'this is text I want A' c: 'this is text I want over multiple lines B' c: 'this is text I want C'` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Remove all duplicates after regex capture by Maire (Scribe) on Aug 20, 2018 at 06:31 UTC
Thank you very much for this (and also for your very clear explanations!).	[reply]
Re: Remove all duplicates after regex capture by Marshall (Canon) on Aug 20, 2018 at 01:53 UTC
Another idea for you: I think it is straightforward and "simple" in concept. #!/usr/bin/perl use strict; use warnings; my %mycorpus = ( a => "<blah blah blah blah title:#this is text I want 1# blah blah blah", b => "blah title:#this is text I do not want# blah title:#this is text I want 2# blah title:#this is text I do not want# blah", c => "blah blah title:#this is text I want 3# title:#this is text I do not want# title:#this is text I do not want# title:#this is text I do not want# blah", ); foreach my $filename (sort keys %mycorpus) { foreach my $line (split /\n/,$mycorpus{$filename}) { if ($line =~ /^title:#(.)#\s$/) { print "$line\n"; last; } } } __END__ title:#this is text I want 1# title:#this is text I want 2# title:#this is text I want 3# [download]	[reply] [d/l]
Re^2: Remove all duplicates after regex capture by Maire (Scribe) on Aug 20, 2018 at 06:59 UTC
Thank you!	[reply]


"be consistent"
	PerlMonks