Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Quick Question about checking what the regex does

by eversuhoshin (Sexton)
on Aug 20, 2012 at 07:43 UTC ( #988401=perlquestion: print w/ replies, xml ) Need Help??
eversuhoshin has asked for the wisdom of the Perl Monks concerning the following question:

Hello, can someone help me write a code that shows what words the regex actually matches? Below is the code

$fcount=()=$data=~m/outlook\s+for\s+any\s+rating|(?:rating|if\s+on\s+ +negative|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+about +\s+future\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|gui +dance\s+(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|intrepretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology) /xig;

I just want to make sure that I am not matching some weird stuff due to some regex mistake. Thank you so much!

Comment on Quick Question about checking what the regex does
Download Code
Re: Quick Question about checking what the regex does
by Ratazong (Prior) on Aug 20, 2012 at 07:48 UTC
    There are various ways to translate a regex into human language - I like this webpage.
Re: Quick Question about checking what the regex does
by Anonymous Monk on Aug 20, 2012 at 08:36 UTC

      Unfortunately one of the countermeasures against abuse is that the regex tester limits the size of the regular expression and of the test data. I think the limit is set somewhere around 1k each. The OP's regular expression (and probably his data set) will exceed that limitation.

      There is a github repo where one could fetch the code and make a simple modification to SafeMatchStats.pm (lines 25 and 26) to set an arbitrarily large limit. Then ensure that all dependencies listed in Makefile.PL are installed (Skip Plack, as it's only used by the cloud service and not required to run locally). Finally, run as ./retester daemon. The repo is at https://github.com/daoswald/retester.

      If you have the data that you intend to run against this regular expression, you can iterate over the matches using ${^MATCH} to tell you what matched each time.

      However, I was curious, and in the absence of the original data, I took a shot at unwinding the OP's regex by backing out the possible alternation paths:

      use strict; use warnings; my $fcount; my $data; $data = do{ local $/ = undef; <DATA>; }; $fcount = () = $data =~ m/ outlook\s+for\s+any\s+rating | (?: rating | if\s+on\s+negative | Microsoft | suggesting\s+an | may\s+contain\s+statements\s+about\s+future\s+events\, | business\s+conditions\s+and\s+the ) \s+outlook|guidance\s+ (?:to\s+approve|facility) | (?: authoritative | revenue\s+recognition | invaluable\s+practical | valuable | regulatory | technical | under\s+the | staff\'s | judicial | SEC | FDA | Treasury (?:\s+Department)? | specific | implementation | their | government | any\s+ruling | college | absent | \s+his | interim | intrepretive | transition | administrative | procedural | related | applicable | accounting | definitive | superceding | IRS | Internal\s+Revenue\s+Service | valued | EITF\s+accounting ) \s+guidance | guidance\s+ (?: and\s+rules | promulgated(?:\s+thereunder)? |in\s+SFAS ) | (?:provided|issued) \s+by\s+ (?:the\s+)? (?: SEC | Securities\s+and\s+Exchange\s+Commission | Internal\s+Revenue\s+Service | Secretary | United\s+States | Financial\s+Accounting ) | (?:other|applicable) \s+guidance\s+issued | according\s+to\s+the\s+guidance\s+contained | provide\s+guidance\s+to\s+directors | receiving\s+guidance | (?:current|other)\s+guidance\s+(?:under|from) | assumes\s+guidance\s+of\s+ (?:the|a)\s+ (?: company | board | talented\s+team | compensation ) | guidance\s+(?:system|software|technology) /xig; print $fcount, "\n"; __DATA__ outlook for any rating rating if on negative Microsoft suggesting an may contain statements about future events, business conditions and the rating outlook to approve if on negative outlook to approve Microsoft outlook to approve suggesting an outlook to approve may contain statements about future events, outlook to approve business conditions and the outlook to approve rating guidance to approve if on negative guidance to approve Microsoft guidance to approve suggesting an guidance to approve may contain statements about future events, guidance to approve business conditions and the guidance to approve rating outlook facility if on negative outlook facility Microsoft outlook facility suggesting an outlook facility may contain statements about future events, outlook facility business conditions and the outlook facility rating guidance facility if on negative guidance facility Microsoft guidance facility suggesting an guidance facility may contain statements about future events, guidance facility business conditions and the guidance facility authoritative guidance revenue recognition guidance invaluable practical guidance valuable guidance regulatory guidance technical guidance under the guidance staff's guidance judicial guidance SEC guidance FDA guidance Treasury Department guidance Treasury guidance specific guidance implementation guidance their guidance government guidance any ruling guidance college guidance absent guidance his guidance interim guidance intrepretive guidance transition guidance administrative guidance procedural guidance related guidance applicable guidance accounting guidance definitive guidance superceding guidance IRS guidance Internal Revenue Service guidance valued guidance EITF accounting guidance guidance and rules guidance promulgated thereunder guidance promulgated guidance in SFAS provided by SEC provided by the SEC issued by SEC issued by the SEC provided by Securities and Exchange Commission provided by the Securities and Exchange Commission issued by Securities and Exchange Commission issued by the Securities and Exchange Commission provided by Internal Revenue Service provided by the Internal Revenue Service issued by Internal Revenue Service issued by the Internal Revenue Service provided by Secretary provided by the Secretary issued by Secretary issued by the Secretary provided by United States provided by the United States issued by United States issued by the United States provided by Financial Accounting provided by the Financial Accounting issued by Financial Accounting issued by the Financial Accounting other guidance issued applicable guidance issued according to the guidance contained provide guidance to directors receiving\s+guidance current guidance under current guidance from other guidance under other guidance from assumes the guidance of the company assumes the guidance of a company assumes the guidance of the board assumes the guidance of a board assumes the guidance of the talented team assumes the guidance of a talented team assumes the guidance of the compensation assumes the guidance of a compensation guidance system guidance software guidance technology

      I must have gotten a few of the branches slightly wrong because it's only matching 99 of the 113 strings that I listed here. Oh, and it would actually match much more than that. The regex uses "\s+" (meaning one or more whitespaces), and in my input strings I replaced "one or more" with just "one".


      Dave

Re: Quick Question about checking what the regex does
by CountZero (Bishop) on Aug 20, 2012 at 09:35 UTC
    Rather than do:
    $fcount=()=$data=~m/YOUR REGEX HERE/xig
    which will count the number of matches, do:
    @matches=$data=~m/YOUR REGEX HERE/xig; $fcount = @matches;
    and you will find all the matches in the array @matches.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Quick Question about checking what the regex does
by Anonymous Monk on Aug 20, 2012 at 15:15 UTC

    Higher-Order Perl by Mark Jason Dominus, chapters 6 and (I think) 8. Chapter six contains a program for outputting all the strings a regexp-like structure matches; you'll have to translate the regexp manually to the program's format. Chapter eight has a continuation on that, on parsing a real regexp. (I haven't read that chapter yet...)

Re: Quick Question about checking what the regex does
by Kenosis (Priest) on Aug 20, 2012 at 17:02 UTC

    You may find YAPE::Regex::Explain useful for 'explaining' your regex:

    use Modern::Perl; use YAPE::Regex::Explain; my $regex = 'outlook\s+for\s+any\s+rating|(?:rating|if\s+on\s+negative +|Microsoft|suggesting\s+an|may\s+contain\s+statements\s+about\s+futur +e\s+events\,|business\s+conditions\s+and\s+the)\s+outlook|guidance\s+ +(?:to\s+approve|facility) |(?:authoritative|revenue\s+recognition|invaluable\s ++practical|valuable|regulatory|technical|under\s+the|staff\'s|judicia +l|SEC|FDA|Treasury(?:\s+Department)?|specific|implementation|their|go +vernment|any\s+ruling|college|absent|\s+his|interim|intrepretive|tran +sition|administrative|procedural|related|applicable|accounting|defini +tive|superceding|IRS|Internal\s+Revenue\s+Service|valued|EITF\s+accou +nting)\s+guidance |guidance\s+(?:and\s+rules|promulgated(?:\s+thereund +er)?|in\s+SFAS)|(?:provided|issued)\s+by\s+(?:the\s+)?(?:SEC|Securiti +es\s+and\s+Exchange\s+Commission|Internal\s+Revenue\s+Service|Secreta +ry|United\s+States|Financial\s+Accounting) |(?:other|applicable)\s+guidance\s+issued|according\ +s+to\s+the\s+guidance\s+contained|provide\s+guidance\s+to\s+directors +|receiving\s+guidance |(?:current|other)\s+guidance\s+(?:under|from)|assum +es\s+guidance\s+of\s+(?:the|a)\s+(?:company|board|talented\s+team|com +pensation)|guidance\s+(?:system|software|technology)'; say YAPE::Regex::Explain->new($regex)->explain;

    Partial output:

    ... matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- outlook 'outlook' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- for 'for' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- any 'any' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- rating 'rating' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- rating 'rating' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- if 'if' ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ...
Re: Quick Question about checking what the regex does
by kcott (Abbot) on Aug 20, 2012 at 21:52 UTC
    "Hello, can someone help me write a code that shows what words the regex actually matches?"

    You can add the following line to your existing code to achieve this:

    use Regexp::Debugger;

    See Regexp::Debugger.

    -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://988401]
Approved by Ratazong
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (14)
As of 2014-07-28 14:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (200 votes), past polls