Re: question about finding strings (regexes and slurping files)

Hello SaraBetsy and welcome to the monastery and to the wonderful world of Perl!

you already got some good advice, so I just want to clarify few things.

> should grab the 25 characters before and after it

it's not what the regex you posted is supposed to do: it grabs from 0 to 25 chars before and after the string. As already said gsix modifiers must go outside the regular expression: ' m/.../gsix'

Let's use your regex to match 0-3 chars before and after the letter X using: /.{0,3}X.{0,3}/ against some strings:

# regex /.{0,3}X.{0,3}/
#
# string       matched part

123X123         123X123
12X123          12X123
1X123           1X123
X123            X123
X123456         X123
[download]

And now confront the different output of the /.{3}X.{3}/ regex against the same set of strings:

# regex /.{3}X.{3}/
#
# string       matched part

123X123         123X123
12X123          -no match-
1X123           -no match-
X123            -no match-
X123456         -no match-
[download]

Infact the second version search for at least 3 chars before and after X

Now a little note about slurping files. When you do it the file goes deirectly into the memory, with probably even some overhead, so 100Mb of file data will be at least 100Mb+ of RAM used. As you will work as bioinformatic with possibly big files it's better to understand this early.

If you process the file one line at time the memory consumption is minimal. The diamond operator <> is a poweful beast in Perl and, as many other things in perl, it acts differently depending on the context it was used in.

# open my $fh, '<', $file_path or die "unable to read $file_path"

# list context: every line goes in the array
my @all_lines = <$fh>;


# scalar context: just next line goes into a scalar (<> acts as an ite
+rator here)
my $line = <$fh>;

# so to read a file one line at time:
while (defined( my $line= <$fh>)) {
[download]

See How to read in large files

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Comment on Re: question about finding strings (regexes and slurping files) Select or Download Code

Replies are listed 'Best First'.

Re^2: question about finding strings (regexes and slurping files)
by ww (Archbishop) on Nov 28, 2017 at 16:18 UTC

First, EXCELLENT POINTS in Discipulus' reply above.

Refresher: OP wants to capture a specific word and some words around it (no definition of why) in any lines of a moderately large text file which contain the specific word -- for some sort of corpus analysis.

Now, for another approach to the regex, we can define $word (to capture the 'nearby words' OP wanted) in terms of alpha-content rather than by character-counting. (As done here, we must insert single spaces between the words in the regex itself but could have including most of them in the definition of $word):

 #!/usr/bin/perl
 use strict;
 use warnings;
 use 5.024;
 
 # 1204380
 
 my $string = 'tryna';
 my $word = qr /[a-z]+/i;         
 #           any word comprised solely of letters a-z, UC or LC, follo
+wed by a space
 my (@slurp, $line, @found, $found);        
 #           declare vars; bad practice to do as globals, but simpler 
+to read
 
 @slurp = <DATA>;     # read each line of __DATA__ into var $slurp;
 for $line(@slurp) {  # read thru array @slurp line by line
 if ( $line =~ /($word\s$word\s$string\s$word\s$word\s)/gix ) { 
 #               Match only if there are two $word instances before $s
+tring and a 
 #               space following the second $word after $string
         push @found, "\tmatch: $1";                         
 #               When Ln 16 matches Ln 17 pushes the match (+ a visual
+ marker) to @found
         print "full original line with a match: $line\n";   
     }
  }
 
 for $found(@found) {
     say $found;
 }
 
 __DATA__
 
 123 abcde this sentence has foo bar tryna much too long for my taste 
+ CONTAINS MATCH
 this doesn't have the magic phrase 123456 7890 abcd3e fc.
 much too long for my taste but tryno tryna foo bar baz   CONTAINS MAT
+CH
 work was put into the tryna document which shows good work CONTAINS M
+ATCH
 problems with our out of town and other tryna that never show up  CON
+TAINS MATCH
 Tryna fill to gully and TRYNA upside of big Pine CONTAINS MATCH TWICE
+ BUT ...
    ...FAILS ON Ln 15 BECUZ THERE IS NO $word NOR ANY SPACE ...
    ...PRECEEDING THE FIRST INSTANCE OF $string!
 no searchstring here
 endit
[download]

And here is the output (the full lines are redundant to OP's stated needs but are included for clarity):

F:\PMonks\>1204380.pl
full original line with a match:  123 abcde this sentence has foo bar 
+tryna much too long for my taste  CONTAINS MATCH

full original line with a match:  much too long for my taste but tryno
+ tryna foo bar baz   CONTAINS MATCH

full original line with a match:  work was put into the tryna document
+ which shows good work CONTAINS MATCH

full original line with a match:  problems with our out of town and ot
+her tryna that never show up  CONTAINS MATCH

full original line with a match:  Tryna fill to gully and TRYNA upside
+ of big Pine CONTAINS MATCH TWICE BUT ...

        match: foo bar tryna much too
        match: but tryno tryna foo bar
        match: into the tryna document which
        match: and other tryna that never
        match: gully and TRYNA upside of

F:\PMonks>
[download]

Spirit of the Monastery

[reply]
[d/l]
[select]


"be consistent"
	PerlMonks