Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: question about finding strings (regexes and slurping files)

by Discipulus (Canon)
on Nov 28, 2017 at 08:21 UTC ( [id://1204395]=note: print w/replies, xml ) Need Help??


in reply to question about finding strings?

Hello SaraBetsy and welcome to the monastery and to the wonderful world of Perl!

you already got some good advice, so I just want to clarify few things.

> should grab the 25 characters before and after it

it's not what the regex you posted is supposed to do: it grabs from 0 to 25 chars before and after the string. As already said gsix modifiers must go outside the regular expression: ' m/.../gsix'

Let's use your regex to match 0-3 chars before and after the letter X using: /.{0,3}X.{0,3}/ against some strings:

# regex /.{0,3}X.{0,3}/ # # string matched part 123X123 123X123 12X123 12X123 1X123 1X123 X123 X123 X123456 X123

And now confront the different output of the /.{3}X.{3}/ regex against the same set of strings:

# regex /.{3}X.{3}/ # # string matched part 123X123 123X123 12X123 -no match- 1X123 -no match- X123 -no match- X123456 -no match-

Infact the second version search for at least 3 chars before and after X

Now a little note about slurping files. When you do it the file goes deirectly into the memory, with probably even some overhead, so 100Mb of file data will be at least 100Mb+ of RAM used. As you will work as bioinformatic with possibly big files it's better to understand this early.

If you process the file one line at time the memory consumption is minimal. The diamond operator <> is a poweful beast in Perl and, as many other things in perl, it acts differently depending on the context it was used in.

# open my $fh, '<', $file_path or die "unable to read $file_path" # list context: every line goes in the array my @all_lines = <$fh>; # scalar context: just next line goes into a scalar (<> acts as an ite +rator here) my $line = <$fh>; # so to read a file one line at time: while (defined( my $line= <$fh>)) {

See How to read in large files

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: question about finding strings (regexes and slurping files)
by ww (Archbishop) on Nov 28, 2017 at 16:18 UTC

    First, EXCELLENT POINTS in Discipulus' reply above.

    Refresher: OP wants to capture a specific word and some words around it (no definition of why) in any lines of a moderately large text file which contain the specific word -- for some sort of corpus analysis.

    Now, for another approach to the regex, we can define $word (to capture the 'nearby words' OP wanted) in terms of alpha-content rather than by character-counting. (As done here, we must insert single spaces between the words in the regex itself but could have including most of them in the definition of $word):

    #!/usr/bin/perl use strict; use warnings; use 5.024; # 1204380 my $string = 'tryna'; my $word = qr /[a-z]+/i; # any word comprised solely of letters a-z, UC or LC, follo +wed by a space my (@slurp, $line, @found, $found); # declare vars; bad practice to do as globals, but simpler +to read @slurp = <DATA>; # read each line of __DATA__ into var $slurp; for $line(@slurp) { # read thru array @slurp line by line if ( $line =~ /($word\s$word\s$string\s$word\s$word\s)/gix ) { # Match only if there are two $word instances before $s +tring and a # space following the second $word after $string push @found, "\tmatch: $1"; # When Ln 16 matches Ln 17 pushes the match (+ a visual + marker) to @found print "full original line with a match: $line\n"; } } for $found(@found) { say $found; } __DATA__ 123 abcde this sentence has foo bar tryna much too long for my taste + CONTAINS MATCH this doesn't have the magic phrase 123456 7890 abcd3e fc. much too long for my taste but tryno tryna foo bar baz CONTAINS MAT +CH work was put into the tryna document which shows good work CONTAINS M +ATCH problems with our out of town and other tryna that never show up CON +TAINS MATCH Tryna fill to gully and TRYNA upside of big Pine CONTAINS MATCH TWICE + BUT ... ...FAILS ON Ln 15 BECUZ THERE IS NO $word NOR ANY SPACE ... ...PRECEEDING THE FIRST INSTANCE OF $string! no searchstring here endit

    And here is the output (the full lines are redundant to OP's stated needs but are included for clarity):

    F:\PMonks\>1204380.pl full original line with a match: 123 abcde this sentence has foo bar +tryna much too long for my taste CONTAINS MATCH full original line with a match: much too long for my taste but tryno + tryna foo bar baz CONTAINS MATCH full original line with a match: work was put into the tryna document + which shows good work CONTAINS MATCH full original line with a match: problems with our out of town and ot +her tryna that never show up CONTAINS MATCH full original line with a match: Tryna fill to gully and TRYNA upside + of big Pine CONTAINS MATCH TWICE BUT ... match: foo bar tryna much too match: but tryno tryna foo bar match: into the tryna document which match: and other tryna that never match: gully and TRYNA upside of F:\PMonks>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1204395]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-04-24 12:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found