Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

grep question using multiple lines

by bradcathey (Prior)
on Dec 28, 2008 at 00:15 UTC ( #732848=perlquestion: print w/ replies, xml ) Need Help??
bradcathey has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monasterians,

I'm trying to isolate some code in BBEdit using the grep functionality offered (Perl friendly). Here are 6 lines of text containing 2 email addresses (this is just an example, and could be looking for anything, actually):

f834bkg94halUF9deju hHFDUO()NFRS432 DSFadsfg94hHFDUO()N hfedls74d8oHFx constant=barney@gmail.com alUF9dejuH()NF UO()NFRS432 DSFadsf4halUF9deju fedls74d8oH sfg94hHFDUOf f834bkg94halUF9deju hHFDUO()NFRS432 DSFadsfg94hHFDUO()N hfedls74d8oHFx constant=wilma@aol.com alUF9dejuH()NFui0 UO()NFRS432 DSFadsf4halUF9deju fedls74d8oH sfg94hHFDUOf

and I want to end up with:

barney@gmail.com wilma@aol.com

so far I have:

(?-m).+constant=(\w+@\w+\.com)(?-m).+

but keep ending up with:

f834bkg94halUF9deju hHFDUO()NFRS432 DSFadsfg94hHFDUO()N barney@gmail.com UO()NFRS432 DSFadsf4halUF9deju fedls74d8oH sfg94hHFDUOf f834bkg94halUF9deju hHFDUO()NFRS432 DSFadsfg94hHFDUO()N wilma@aol.com UO()NFRS432 DSFadsf4halUF9deju fedls74d8oH sfg94hHFDUOf

Obviously, I need to use multiple lines, but not sure how to do it in grep vs a Perl regexp.

—Brad
"The important work of moving the world forward does not wait to be done by perfect men." George Eliot

Comment on grep question using multiple lines
Select or Download Code
Re: grep question using multiple lines
by borisz (Canon) on Dec 28, 2008 at 00:34 UTC
    use a proper email parser: My example parse all emails out of the text and then print only the email addresses, that start with constant=
    use Email::Address; my @add = Email::Address->parse(<<'__TXT__'); f834bkg94halUF9deju hHFDUO()NFRS432 DSFadsfg94hHFDUO()N hfedls74d8oHFx constant=barney@gmail.com alUF9dejuH()NF UO()NFRS432 DSFadsf4halUF9deju fedls74d8oH sfg94hHFDUOf f834bkg94halUF9deju hHFDUO()NFRS432 DSFadsfg94hHFDUO()N hfedls74d8oHFx constant=wilma@aol.com alUF9dejuH()NFui0 UO()NFRS432 DSFadsf4halUF9deju fedls74d8oH sfg94hHFDUOf __TXT__ for my $add (@add) { local $_ = $add->address; next unless s/^constant=//; print $_, $/; }
    output:
    barney@gmail.com wilma@aol.com
    Boris

      Thanks, but like I so carefully pointed out, I'm not just looking for email addresses. I'm aware of all the parsing modules out there. This is just an exercise in grepping and looking for an academic answer.

      —Brad
      "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
        $Email::Address::addr_spec This regular expression defined what an email address is allowed to look like.
        (?-xism:(?-xism:(?-xism:(?-xism:(?-xism:\s*\((?:\s*(?-xism:(?-xism:(?> +[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|(?-xism:\s*\((?:\s*(?-xism +:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|))*\s*\)\s*))) +*\s*\)\s*)|\s+)*(?-xism:[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x0 +0-\x1F\x7F()<>\[\]:;@\\,."\s]+)*)(?-xism:(?-xism:\s*\((?:\s*(?-xism:( +?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|(?-xism:\s*\((?: +\s*(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|))*\ +s*\)\s*)))*\s*\)\s*)|\s+)*)|(?-xism:(?-xism:(?-xism:\s*\((?:\s*(?-xis +m:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|(?-xism:\s*\( +(?:\s*(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|) +)*\s*\)\s*)))*\s*\)\s*)|\s+)*"(?-xism:(?-xism:[^\\"])|(?-xism:\\(?-xi +sm:[^\x0A\x0D])))+"(?-xism:(?-xism:\s*\((?:\s*(?-xism:(?-xism:(?>[^() +\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|(?-xism:\s*\((?:\s*(?-xism:(?- +xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|))*\s*\)\s*)))*\s* +\)\s*)|\s+)*))\@(?-xism:(?-xism:(?-xism:(?-xism:\s*\((?:\s*(?-xism:(? +-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|(?-xism:\s*\((?:\ +s*(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|))*\s +*\)\s*)))*\s*\)\s*)|\s+)*(?-xism:[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+( +?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*)(?-xism:(?-xism:\s*\((?:\s* +(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|(?-xism +:\s*\((?:\s*(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0 +D]))|))*\s*\)\s*)))*\s*\)\s*)|\s+)*)|(?-xism:(?-xism:(?-xism:\s*\((?: +\s*(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|(?-x +ism:\s*\((?:\s*(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A +\x0D]))|))*\s*\)\s*)))*\s*\)\s*)|\s+)*\[(?:\s*(?-xism:(?-xism:[^\[\]\ +\])|(?-xism:\\(?-xism:[^\x0A\x0D]))))*\s*\](?-xism:(?-xism:\s*\((?:\s +*(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x0D]))|(?-xis +m:\s*\((?:\s*(?-xism:(?-xism:(?>[^()\\]+))|(?-xism:\\(?-xism:[^\x0A\x +0D]))|))*\s*\)\s*)))*\s*\)\s*)|\s+)*)))
Re: grep question using multiple lines
by backstab (Novice) on Dec 28, 2008 at 02:01 UTC
    my $txt = <<'EOF'; f834bkg94halUF9deju hHFDUO()NFRS432 DSFadsfg94hHFDUO()N hfedls74d8oHFx constant=barney@gmail.com alUF9dejuH()NF UO()NFRS432 DSFadsf4halUF9deju fedls74d8oH sfg94hHFDUOf f834bkg94halUF9deju hHFDUO()NFRS432 DSFadsfg94hHFDUO()N hfedls74d8oHFx constant=wilma@aol.com alUF9dejuH()NFui0 UO()NFRS432 DSFadsf4halUF9deju fedls74d8oH sfg94hHFDUOf EOF while ($txt =~ /constant=(\w+@\w+\.\w+)/g) { print "$1\n"; }
    Prints what you want for me. To understand what it does note the /g flag used to match that allows to not reset the match at the beginning from call to call as a result of the while loop.
Re: grep question using multiple lines
by eye (Chaplain) on Dec 28, 2008 at 03:05 UTC
    This seems to work, though it assumes that there is no more than one address per line:
    ^.*constant=(\w+@\w+\.\w+).*|^.*\r
    replaced by "\1". You can use "Replace All" and get the result you want. If you are willing to accept a multi-step solution, you can make this more robust and easily eliminate the assumption of no more than one address per line.

    In my usage, I'd be inclined to use a regex in "Process Lines Containing..." to eliminate lines without an email address. I'd then extract the email addresses from the remaining lines with a regex. I believe all of this could be automated in a BBEdit Text Factory.

      My solution with the while loop works with many emails on the same line. In fact doing so we consider the text as a whole totally ignoring newlines.

      The idea of /g flag within a while is each match will start where the previous one has stopped and the loop stops when there is no more successful match.

      The special variable @- is an array with the match start and end positions respectivly as $-[0] and $-[1] it might help to see what the loop does,

      while ($txt =~ /constant=(\w+@\w+\.\w+)/g) { print "==> match starts at $-[0]!!!\n"; print "$1\n"; }
        bradcathey (the OP) wrote:
        I'm trying to isolate some code in BBEdit using the grep functionality offered (Perl friendly).
        By my reading, the OP wants to know how to use the PCRE capability of BBEdit (or TextWrangler) to accomplish this task. While BBEdit has a mechanism for invoking scripts, I do not think that was what the OP was asking about. There are many merits to your answer, but it is not something that can be implemented directly in BBEdit.

      This is what I was looking for, basically. Your original script still failed to clear the inbetween lines, but when I added one more \r before the alternation operator it worked to prefection:

      ^.*constant=(\w+@\w+\.\w+).*\r|^.*\r

      returned:

      barney@gmail.comwilma@aol.com

      Perfecto! Thanks eye.

      —Brad
      "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
Re: grep question using multiple lines
by n3toy (Friar) on Dec 28, 2008 at 03:30 UTC
    You might be able to do it in one line.

    I am not sure what the criteria for finding the data is exactly. The example shows an email address, but you say the search text could be anything. Assuming per the example you are looking for the text following "constant=" up to the first space and there is only one instance per line, this worked for me:

    perl -nle 'while(m/constant=(.*)\s/g){print "$1"}' /home/jamie/example +.txt
    I tend to oversimplify things, so it may not be what you are looking for. But it is one line and it returns the data you were looking for.

    Jamie

      Use the modifier notation might be of great style as well,

      perl -nle 'print $1 while /constant=(.*)\s/g' /home/jamie/txt

      But I remark the association of -l and \s vs. a more explicite regexp does not behave well in case of many matches on the same line!

      Try it for example with a txt file as follow,

      xxxxxxxxx constant=foo@bar.com xxxxxxxxxxx xxxxxx constant=baz@huux.org xxxxxxx contant=hello@world.bye xxxxxxxxx xxxxxxxxxxxxxxxxxxx

      will print,

      foo@bar.com baz@huux.org xxxxxxx contant=hello@world.bye

      I think the problem comes from (.*) that is greedy and matches even spaces at the condition there is at least one space remaining to satisfie \s. But I try (.*?) and the /g flag does not seem ok?

        Indeed your match is greedy. Still we can change it as follows:

        $ perl -nle 'print $1 while /constant=(^ +)\s/g' < test.txt
        barney@gmail.com
        wilma@aol.com
        
        Steve
        --

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://732848]
Approved by Bloodnok
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (9)
As of 2014-09-17 11:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (73 votes), past polls