Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Regex: Example works, plugging it into code doesn't.

by EclecticScion (Novice)
on Jun 14, 2013 at 15:38 UTC ( [id://1038979]=perlquestion: print w/replies, xml ) Need Help??

EclecticScion has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, O monks;

I am fairly new to Perl, and this is my first submitted question. I seek wisdom on a problem that looks extremely simple, but has been causing me a lot of grief.

I wish to parse a string to find all-caps words of at least 3 characters. My example code looks like this:

my $finalline = "3.2 COMPLIANCE WITH LAWS AND REGULATIONS. While the + Product is in its possession or under its, or its sub-contractor's c +ontrol, THEDUDES shall comply, or ensure that its subcontractors comp +ly, with all applicable federal, state and local statutory and regula +tory requirements regarding the manufacture, if applicable, packaging +, handling transportation and storage of the Product."; @words = ($finalline =~ /([A-Z][A-Z][A-Z]+)/g); print ("@words \n");

This returns:

COMPLIANCE WITH LAWS AND REGULATIONS THEDUDES

...as desired.

However, when I try to plug this into my larger program, the code behaves oddly. Specifically, $finalline has the desired content, but @words usually ends up empty. The especially odd thing is that if I remove the "/g" above, @words will retrieve the first capitalized word, but if I put it back in, it will retrieve zero capitalized words(!) I have also tried adding "/gc" instead, without success. The code does more or less what it's supposed to apart from this bug (and a couple of others).

Here is my code. As far as I can tell, the relevant bits (lines 150-153, marked as "#PROBLEM CODE!!!" below) are exactly the same as in my example, but do not work correctly. Any help would be greatly appreciated.

(Addendum: I have since gone back and turned on "use strict", and declared all my variables ahead of time. This has not solved the problem.)

#!/usr/bin/perl -w use diagnostics; use Spreadsheet::WriteExcel::Big; use HTML::Restrict; use File::Slurp; #use Win32::Word::Writer; use RTF::Writer; opendir(DIR2, "contract/new"); my @files2 = readdir(DIR2); closedir(DIR2); foreach $file2 (@files2) { print "$file2\n"; open (FH2, "contract/new/$file2"); $newfile2=$file2; $newfile2 =~ s/\.html//g; my @filelines2 = <FH2>; chomp @filelines2; my $masterplan = join(' ', @filelines2); $hr2 = HTML::Restrict->new(); $masterplan = $hr2->process($masterplan); my @masterarray = split('DUMMY',$masterplan); my $caps = 0; my $traps = 0; my $escape = 0; my @words = "";
foreach $finalline (@masterarray){ $escape = 0; $finalline =~ s/\&.{3,5}\;/ /g; $finalline =~ s/\s+/ /g; while ($finalline =~ /(\w\w+\b)/gc){ $linelength++ } if (($finalline =~ /^\s*\d|^\s*Section|^\s*Article|^\s*[A-Z]\. +|witness whereof/i) && ($holder == $rtf)){ while ($escape == 0){ if (($finalline =~ /[A-Z][A-Z][A-Z]/) && ($caps == 0) && ( +$traps == 0)){ print(">>>$finalline<<</n"); @words = ($finalline =~ /([A-Z][A-Z][A-Z]+)/g); #PROBLEM CODE!!! print("@words"); #PROBLEM CODE!!! $lineholder = $finalline; #PROBLEM CODE!!! $finalline = join(' ', @words); #PROBLEM CODE!!! $caps++; } #etc.

Update: Removed extraneous code; added new line (print (">>>$finalline<<</n");)

Update 2: smls's solution works. Thanks smls!

Update 3: Restored problem code (see below).

Replies are listed 'Best First'.
Re: Regex: Example works, plugging it into code doesn't.
by smls (Friar) on Jun 14, 2013 at 16:34 UTC

    Inside your big foreach loop, near the beginning you do this:

    while ($finalline =~ /(\w\w+[\s\.])/gc){ $linelength++; }

    The /c flag prevents pos (the regex matching start position) for the string $finalline from being reset afterwards (see perlreref), so it will remain pointing to the end of the string.

    Then further down, you do:

    if (($finalline =~ /[A-Z][A-Z][A-Z]/) && ($caps == 0) && ($traps == 0) +){ @words = ($finalline =~ /([A-Z][A-Z][A-Z]+)/g); ... }

    ...but this regex will try to start matching at the end of $finalline - due to pos($finalline) still pointing there - which of course won't find any matches.

    I haven't made any attempt to understand your code as a whole, so I have no idea whether you actually need the /c flag in the first regex. If you do, you can manually set pos($finalline) = 0 to reset the matching start position before the second regex.

    EDIT:
    Your regex /[A-Z][A-Z][A-Z]+/ can also be more succinctly written as /[A-Z]{3,}/

    EDIT2:
    Also, in the future please try to reduce your code to a minimal, working (except for the problem of course), self-contained testcase when asking on PerlMonks, it is really difficult to work with a to huge chunk of code that I did not write and that I cannot test (because it depends on file input, and dependencies that I don't have installed).
    Also, if you follow this rule, then in many cases you will actually find the bug yourself in the process of creating this minimal testcase! :)

      That was indeed the problem. Thanks very much!

Re: Regex: Example works, plugging it into code doesn't.
by toolic (Bishop) on Jun 14, 2013 at 16:04 UTC
    Tip #2 from the Basic debugging checklist: print $finalline and show us the contents:
    while ($escape == 0){ if (($finalline =~ /[A-Z][A-Z][A-Z]/) && ($caps == 0) && ( +$traps == 0)){ print ">>>$finalline<<<\n"; @words = ($finalline =~ /([A-Z][A-Z][A-Z]+)/g);

    Also, try to reduce your code further.

      Thanks! I have removed the extra code that follows the problem code.

      print (">>>$finalline<<<");

      returns full sentences from the source files, like the one in the example above (in particular, the program succesfully outputs a new RTF file with all the lines intact). E.g.:

      >>> 3.2 COMPLIANCE WITH LAWS AND REGULATIONS. While the Product is in its possession or under its, or its sub contractor's control, AKORN shall comply, or ensure that its subcontractors comply, with all applicable federal, state and local statutory and regulatory requirements regarding the manufacture, if applicable, packaging, handling transportation and storage of the Product. <<<
        I have removed the extra code

        But in doing so, you have also removed the part that actually caused the problem. If I hadn't seen your question before your edit, there wouldn't have been any way for me to identify the problem... :)

        Just blindly cutting away code that you think doesn't contribute to the problem, is not what "try to reduce your code further" means...
        What it means, is make a temporary copy of your script, and cut away code one chunk at a time, and each time test the remaining code to see if it still reproduces your bug (but is otherwise functional). Then when you've reached the minimal amount of code needed to reproduce the bug, remove file read's and module calls etc. as much as possible, and replace them with in-place definitions of dummy data. Test again, to see if it still reproduces the bug. If it does, that's what you should upload to Perl Monks.

        Just for future reference... :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1038979]
Approved by muba
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-25 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found