Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Split and print hash based on regex

by Maire (Scribe)
on Mar 27, 2018 at 14:04 UTC ( #1211848=perlquestion: print w/replies, xml ) Need Help??

Maire has asked for the wisdom of the Perl Monks concerning the following question:

Good afternoon all,

I have a hash containing thousands of lines of text from hundreds of different files. Every time a certain phrase appears in the hash (in the SSCCE below, "This is"), I want to create a new txt file which prints both the phrase and all subsequent text until we reach the next "This is".

So, for instance, I had hoped that the script below would create six text files (named UserA_1, UserA_2 etc.) where the first file contained the text "This is line 1 from text 1 another line here which should be included in the text file with the above line.", the second file contained the text "This is line 2 from text 1", and so on.

However, although the script below creates the 6 new text files (and names them appropriately), it does not actually print anything into the files.
#!/usr/bin/perl use strict; use warnings; #SSCCE: my %mycorpus = ( text1 => "This is line 1 from text 1 another line here which should be included in the text file with the a +bove line. This is line 2 from text 1 This is line 3 from text 1", text2 => "This is line 1 from text 2 This is line 2 from text 2 another line here which should be included in the text file with the a +bove line. This is line 3 from text 2", ); my $count = 1; foreach my $filename (sort keys %mycorpus) { my $outfile; while ($mycorpus{$filename} =~ /This is/g) { close $outfile if $outfile; open $outfile, '>', "UserA_$count.txt" or die "could not open"; $count++; print {$outfile} $_; } }

I have been working on this script for nearly a week, but I can't spot my mistake(s), and thus I would be very grateful for any help.

EDIT:
I probably should have mentioned in my original post that my code here is based on a more basic script that I use to split and print text NOT stored in a hash. This script (reproduced as an SSCCE below) works successfully and returns the desired output.
my $count = 1; my $outfile; while (<DATA>) { if ( my($regex) = /This is/g) { close $outfile if $outfile; open $outfile, '>', "UserA$1_$count.txt" or die "could not open 'UserA$regex.txt' $!"; $count++; } print {$outfile} $_; } __DATA__ This is line 1 from text 1 another line here which should be included in the text file with the a +bove line. This is line 2 from text 1 This is line 3 from text 1 This is line 1 from text 2 This is line 2 from text 2 another line here which should be included in the text file with the a +bove line. This is line 3 from text 2

Replies are listed 'Best First'.
Re: Split and print hash based on regex
by AnomalousMonk (Bishop) on Mar 27, 2018 at 18:53 UTC

    WRT your first SSCCE:   while does not automatically assign the result of its CONDITION evaluation to  $_ (in contrast to the
        while (<FILEHANDLE>) { do_something_with($_); }
    special case):

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "foreach my $filename (qw(a b c)) { dd 'before while loop, $filename is', $filename; while ($filename) { dd 'in while loop, $_ is', $_; last; } } " ("before while loop, \$filename is", "a") ("in while loop, \$_ is", undef) ("before while loop, \$filename is", "b") ("in while loop, \$_ is", undef) ("before while loop, \$filename is", "c") ("in while loop, \$_ is", undef)


    Give a man a fish:  <%-{-{-{-<

      Great, thank you!
Re: Split and print hash based on regex
by choroba (Archbishop) on Mar 27, 2018 at 14:11 UTC
    print {$outfile} $_;

    Where do you populate $_?

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      This is a very good question, thanks! And I'm guessing from your response that this may lie at the heart of the problem? The original script that I am working with (reproduced now in an edit to my original post) used very similar syntax successfully, but I need to think about how the original script manages to populate $_ and my modified script doesn't.
        while (<DATA>)

        is equivalent to

        while ($_ = <DATA>)

        which is interpreted as

        while (defined($_ = <DATA>))

        So that's how $_ is populated in the original script.

        There's another question, though: How $1 is populated. Note that the matching uses =, not =~, so it's equivalent to

        my($regex) = ($_ =~ /This is/g)
        where the parentheses after my enforce the list context on the match, but without a capture group in the regex, there's no way to populate $1.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Split and print hash based on regex
by Cristoforo (Curate) on Mar 27, 2018 at 20:10 UTC
    Here is a possible solution that makes use of the three argument open (with a reference to the filename). This if all the data is in a hash.
    #!/usr/bin/perl use strict; use warnings; #SSCCE: my %mycorpus = ( text1 => "This is line 1 from text 1 another line here which should be included in the text file with the a +bove line. This is line 2 from text 1 This is line 3 from text 1", text2 => "This is line 1 from text 2 This is line 2 from text 2 another line here which should be included in the text file with the a +bove line. This is line 3 from text 2", ); my $count = 1; foreach my $filename (sort keys %mycorpus) { my $outfile; open my $fh, '<', \$mycorpus{$filename} or die $!; while (<$fh>) { chomp; if (/^This is/) { close $outfile if $outfile; my $out = "UserA_$count.txt"; open $outfile, '>', $out or die "could not open '$out' for writing $!"; $count++; } print $outfile $_, "\n" if $outfile; } }
    Edit: added conditional to print command ('if $outfile')

    Edit2: The solution offered by tybalt89, Re: Split and print hash based on regex is better than this one. His does not rely on the identifying phase to be at the front of the line of text. The post by jh also is better than this one.

      Ah, very nice solution, thanks! I wasn't aware that one could "open" part of a hash in this way: that tip will save me a lot of time in the future!
Re: Split and print hash based on regex
by tybalt89 (Prior) on Mar 27, 2018 at 21:53 UTC
    #!/usr/bin/perl use strict; use warnings; #SSCCE: my %mycorpus = ( text1 => "This is line 1 from text 1 another line here which should be included in the text file with the a +bove line. This is line 2 from text 1 This is line 3 from text 1", text2 => "This is line 1 from text 2 This is line 2 from text 2 another line here which should be included in the text file with the a +bove line. This is line 3 from text 2", ); my $count = 1; foreach my $filename (sort keys %mycorpus) { for ( $mycorpus{$filename} =~ /This is(?:(?!This is).)*/sg ) { my $outputname = 'UserA_' . $count++ . '.txt'; open my $outfile, '>', $outputname or die "$! opening $outputname" +; print $outfile "$_\n"; # \n only if desired close $outfile; } } # for testing file contents system "more UserA* | cat";
      Thanks!
Re: Split and print hash based on regex
by jh (Beadle) on Mar 27, 2018 at 16:05 UTC
    Considering you use the word "split" in the title of your post, it's funny you aren't using split to process the text.
    our $all_text = join "", <ARGV>; # files, STDIN, etc. our $key_phrase = "This is "; # should not be hard-coded our $base_name = "UserA_"; our $ext = ".txt"; our @bits = split m/\Q$key_phrase\E/, $all_text; # if line 1 data includes the key phrase, element 1 will be empty: shift @bits if $all_text =~ m/^\Q$key_phrase\E/; my $count = 1; foreach my $bit (@bits) { # suggest padding the index number so files sort correctly my $filename = sprintf "%s%2.2d%s", $base_name, $count++, $ext; open FILE, ">", $filename or die "Could not write to \"$filename\": $!\n"; print FILE "$key_phrase$bit"; # put back the what split() excised close FILE; }
    This solution assumes that you can read all the data into memory, of course, but unless it's a million lines or an ongoing TCP/IP connection or something, I rarely have issues with that.
      Thanks for this. I've never (successfully!) worked with the split function before, but your script exemplifies it in a way that I (as a relative newbie) can understand, thanks!
Re: Split and print hash based on regex
by bliako (Prior) on Mar 27, 2018 at 15:29 UTC

    Your regex does not capture anything. Shouldn't it be capturing from one "This is" to the next "This is"?

      What I was trying to do is ask it to look for the "This is" and then print that and everything else until the next "This is" (as opposed to capturing the text, as such (if that makes sense!). This method worked successfully in the original script (not using hashes) which I've now reproduced above. However, I will look into using a capturing regex to see if I can get this modified script to work successfully that way instead, thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1211848]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2021-05-06 04:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (69 votes). Check out past polls.

    Notices?