Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Remove all duplicates after regex capture

by Maire (Scribe)
on Aug 19, 2018 at 09:45 UTC ( [id://1220606]=perlquestion: print w/replies, xml ) Need Help??

Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks. I'm hoping that you can help me with what is probably quite a simple problem, but which is completely stumping me!

I'm trying to work with text files which are fairly disorganized in terms of structure, and which have been stored in a hash. What I want from each file is to extract the line of text which begins "title:#" and ends with a "#" and store this text in a scalar to be used later.

The problem arises because, within some of the text files, there are multiple lines which begin and end with "title#" and "#" respectively. What distinguishes the "titles" I want is that they only ever appear in each text file once, whereas the "titles" I do not want appear at least twice (but sometimes three or four times) in the same text file.

So this is the basic script that I am using, which prints out all the titles

use warnings; use strict; my %mycorpus = ( a => "<blah blah blah blah title:#this is text I want 1# blah blah blah", b => "blah title:#this is text I do not want# blah title:#this is text I want 2# blah title:#this is text I do not want# blah", c => "blah blah title:#this is text I want 3# title:#this is text I do not want# title:#this is text I do not want# title:#this is text I do not want# blah", ); foreach my $filename (sort keys %mycorpus) { my $titles = ''; while ($mycorpus{$filename} =~ /title:#(.*?)#/g){ $titles = $1; print "$titles \n"; } }

The script above, obviously, prints out all 8 captured lines which begin with "title", but my desired output is:
this is text I want 1 this is text I want 2 this is text I want 3

I thought it might work to add a line in after the titles have been captured which (was supposed to!) remove all lines which appeared multiple times, but my attempts at this failed

use warnings; use strict; my %mycorpus = ( a => "<blah blah blah blah title:#this is text I want 1# blah blah blah", b => "blah title:#this is text I do not want# blah title:#this is text I want 2# blah title:#this is text I do not want# blah", c => "blah blah title:#this is text I want 3# title:#this is text I do not want# title:#this is text I do not want# title:#this is text I do not want# blah", ); foreach my $filename (sort keys %mycorpus) { my $titles = ''; while ($mycorpus{$filename} =~ /title:#(.*?)#/g){ $titles = $1; $titles =~ s/(\b\w+\b)(?:\s*\1)+/$1/g; print "$titles \n"; } }

I hope that makes sense: I have been working on this for about 9 hours now, so I'm a bit frazzled! Any help/pointers here would be very much appreciated! Thank you!

Replies are listed 'Best First'.
Re: Remove all duplicates after regex capture
by haukex (Archbishop) on Aug 19, 2018 at 10:58 UTC

    Here's a solution that doesn't depend on the order of the lines. It finds all the titles using a regex and counts them using a hash, and then selects the one that appears exactly once, warning or dieing if there isn't exactly one. You haven't specified a few things about the title: line, like whether there can be #'s in the titles, and what kind of text might appear after the closing #.

    use warnings; use strict; my %mycorpus = ( a => "<blah blah blah blah title:#this is text I want 1# blah blah blah", b => "blah title:#this is text I do not want# title:#this is text I want 2# blah title:#this is text I do not want# blah", c => "blah blah title:#this is text I do not want# title:#this is text I want 3# title:#this is text I do not want# title:#this is text I do not want# blah", ); for my $filename (sort keys %mycorpus) { my %titles; $titles{$1}++ while $mycorpus{$filename} =~ m{ ^ title:\# (.*) \# }xmg; my @once = grep { $titles{$_}==1 } sort keys %titles; die "No title found in $filename" unless @once; warn "More than one title found in $filename" if @once>1; my $title = $once[0]; print "$title\n"; } __END__ this is text I want 1 this is text I want 2 this is text I want 3
      Brilliant, thank you very much! (Also, thanks for including examples of how to implement "die" and "warn": I rarely use these signals myself, but having just run your code on my data, it allowed me to identify a major error in the formatting of some of the data that would have caused a major headache later!).
Re: Remove all duplicates after regex capture
by FreeBeerReekingMonk (Deacon) on Aug 19, 2018 at 09:59 UTC
    Edit: note the flaw pointed out by Hautex. I did not understand the question correctly, and did not check uniqueness of the correct title.

    How about this loop?

    foreach my $filename (sort keys %mycorpus) { my $titles = ''; my $counter = 0; while ($mycorpus{$filename} =~ /title:#(.*?)#\s*$/gm){ if($counter++){ last if $counter++; # skip the rest of the matches # can also be used to print warnings about multiple titles # and check $1 against $titles if they are the same, or not }else{ $titles = $1; # first match, we can store it, print "$titles \n"; # or print it out } } }

    the output is

    this is text I want 1 this is text I want 2 this is text I want 3

    You can also replace the while with an if, and then it just matches the first title# .

    foreach my $filename (sort keys %mycorpus) { my $titles = ''; if ($mycorpus{$filename} =~ /title:#(.*?)#\s*$/m){ $titles = $1; print "$titles \n"; } }

    The output is the same. I think you wanted the multiline regexp modifier to match a newline inside your filedump string.

    edit: better structure to allow more post-work (commented what can be done there). Did also remove the /g (go) modifier in the "if" example as it is not needed there.

      This works perfectly! Thank you very much for your help/time. I feel a bit daft after seeing how (relatively) simple the solution actually was, but I've learned a lot from your code here. Thanks! EDIT: Ah, thanks for the reworking! It was probably my sleep-drived incoherant question that caused the confusion! Thanks again.

        Note that FreeBeerReekingMonk's solution only works because it relies on the blah on the line "title:#this is text I do not want# blah", and it only grabs the first title:. If I remove the blah or reorder the lines in the third example, it does not work.

Re: Remove all duplicates after regex capture
by AnomalousMonk (Archbishop) on Aug 19, 2018 at 17:22 UTC

    This is essentially the same approach haukex uses, but with some different features:

    • Titles are normalized before being added to the  %titles duplication detection hash according to their normalized form. This means that, in this implementation,
          'This is a title'
      and
          'This  IS   A   Title'
      are the same for the purpose of rejecting duplicate titles. This may or may not be what Maire wants; it's just an example of what's possible. The extent of normalization is easily adjusted.
    • Titles are allowed to wrap from one line to the next. This applies to both unique and duplicate titles. Again, Maire may not want this, and it's easily turned off.
    • A title may not be the only thing on a line. Another thing that's easy to change.
    • A  '*' is used in place of the original  '#' character to delimit titles. This is done only to demonstrate that any character, even a regex metacharacter, could be used as a delimiter. (It wouldn't be that hard to change the regexes to accommodate multi-character delimiter sequences.)
    Script remove_dup_lines_1.pl: Output:
    c:\@Work\Perl\monks\Maire>perl remove_dup_lines_1.pl a: 'this is text I want 1' b: 'this is text I want one' c: 'this is text I want A' c: 'this is text I want over multiple lines B' c: 'this is text I want C'


    Give a man a fish:  <%-{-{-{-<

      Thank you very much for this (and also for your very clear explanations!).
Re: Remove all duplicates after regex capture
by Marshall (Canon) on Aug 20, 2018 at 01:53 UTC
    Another idea for you:
    I think it is straightforward and "simple" in concept.
    #!/usr/bin/perl use strict; use warnings; my %mycorpus = ( a => "<blah blah blah blah title:#this is text I want 1# blah blah blah", b => "blah title:#this is text I do not want# blah title:#this is text I want 2# blah title:#this is text I do not want# blah", c => "blah blah title:#this is text I want 3# title:#this is text I do not want# title:#this is text I do not want# title:#this is text I do not want# blah", ); foreach my $filename (sort keys %mycorpus) { foreach my $line (split /\n/,$mycorpus{$filename}) { if ($line =~ /^title:#(.*)#\s*$/) { print "$line\n"; last; } } } __END__ title:#this is text I want 1# title:#this is text I want 2# title:#this is text I want 3#
      Thank you!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1220606]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-23 21:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found