Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
Problems? Is your data what you think it is?
 
PerlMonks  

Quick regex question

by eversuhoshin (Sexton)
on Mar 31, 2011 at 19:13 UTC ( #896704=perlquestion: print w/ replies, xml ) Need Help??
eversuhoshin has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks~

I have a quick regex question. I am looking for executive deaths in SEC filings and I want to exclude those filing that only have death benefits not the actual passing away of a CEO.

I want to exclude those that mention benefits does this regex work

/(sudden|unexpected)? death | (deceased| passed away | died | dying) (?!death benefits)

Basically, I don't want perl to match a file that contains death benefits and it would be great if you can help me write if /death benefit/ then exclude this file.

Thank you for your time and consideration

Comment on Quick regex question
Re: Quick regex question
by JavaFan (Canon) on Mar 31, 2011 at 19:23 UTC
    /\b(death(?!\s+benefits)|deceased|passed\s+away|died|dying)\b/
Re: Quick regex question
by kennethk (Monsignor) on Mar 31, 2011 at 19:24 UTC
    I personally have no idea how SEC filings are formatted, let alone how they usually discuss deaths and death benefits. In general, this sort of question can be best presented by giving us samples of what you do what to match off and what you don't. See How do I post a question effectively?. And please wrap code in <code> tags.

    Based upon what you've written, I'm going to assume you require a regular expression that matches the phrases deceased, passed away, died and dying as well as death but not death benefits. You appear to have had the right concept by wanting to use a negative look-ahead. Meeting this spec might look like:

    /\b(?:deceased|passed\s+away|died|dying|death(?!\s+benefits))\b/i

    which YAPE::Regex::Explain describes as

    The regular expression: (?i-msx:\b(?:deceased|passed\s+away|died|dying|death(?!\s+benefits))\b +) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?i-msx: group, but do not capture (case-insensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- \b the boundary between a word char (\w) and something that is not a word char ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- deceased 'deceased' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- passed 'passed' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- away 'away' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- died 'died' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- dying 'dying' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- death 'death' ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- benefits 'benefits' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- \b the boundary between a word char (\w) and something that is not a word char ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    See Looking ahead and looking behind in perlretut.

Re: Quick regex question
by atcroft (Monsignor) on Mar 31, 2011 at 19:26 UTC

    Create a flag. Loop through the file. Set the flag before testing a line, bail out if the regex matches, unset it at the end of the loop. Once you exit the loop, see if the flag is set, and skip processing if it is set.

    An (untested) example:

    Hope that helps.

    Update: 2011-03-31

    Updated text slightly for clarity. Added comment in code.

Re: Quick regex question
by wind (Priest) on Mar 31, 2011 at 19:29 UTC
    Sounds like you want the following
    /(?:sudden|unexpected)? death(?! benefits)|deceased|passedaway|died|dy +ing/
    If you want to force the file to not have death benefits at all, then you could use the following:
    /^(?!.*death benefits).*(?:(?:sudden|unexpected)? death(?! benefits)|d +eceased|passedaway|died|dying)/
    But probably be cleaner to just separate the regex's
    !/death benefits/ && /(?:sudden|unexpected)? death(?! benefits)|deceas +ed|passedaway|died|dying/

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://896704]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (9)
As of 2014-04-16 05:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (414 votes), past polls