Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^3: extracting strings from text files

by oralaks (Acolyte)
on Jul 08, 2012 at 05:22 UTC ( [id://980545]=note: print w/replies, xml ) Need Help??


in reply to Re^2: extracting strings from text files
in thread extracting strings from text files

1. Do you see any error messages or access to OS level log messages when you run/reach the stage.

2.Monitoring and changing the memory/OS user limits depend on your OS port/platform. On most UNIX like/Linux you can monitor virtual memory with OS commands like vmstat,free.On linux you can even use even any of decent free tools like collectl etc to monitor system resource/memory usage. (ulimit can tell your current settings). Task manager in Windows is a best place to start.

3.Above all it might be better to get the issue correctly identified/verified before attempting to increase limits some of which may need root access.

  • Comment on Re^3: extracting strings from text files

Replies are listed 'Best First'.
Re^4: extracting strings from text files
by perlyr (Novice) on Jul 08, 2012 at 14:34 UTC
    Thanks for the help guys. I have solved the problem. It's a metacharacter in one of the files that stops the program. Dave, here is the whole message:
    Unmatched ) in regex; marked by <-- HERE in m/LLP) <-- HERE \s*(\w+\s* +(\d\d|\d),\s*\d{4})\s*$/ at D:\files\test2.pl line 88, <FILE> chunk 1 +.
Re^4: extracting strings from text files
by perlyr (Novice) on Jul 08, 2012 at 07:55 UTC
    Here is the error message I see when it stops
    Unmatched ) in regex; marked by <-- HERE in m/LLP) <-- HERE \s*(\w+\s* +(\d\d|\d),\s*\d{4})\s*($|except|\()/
    any thoughts?
      My thought: that your error report just muddies the water. We can't even tell what the whole line is; much less, its context in your code -- of which you appear to have posted only a fragment and then deleted that.

      It's really hard to offer much help when the helpee obstructs the process.

      FWIW, I captured the following (less the shebang and node number) from the a mirror of the node you deleted ... but it doesn't compile, and the failure reported (as might be expected for a fragment) is for syntax errors other than the missing close_paren in line 69 (as displayed).

      #!/usr/bin/perl use 5.014; #980557 foreach (@infiles) { my $data; my $address = $_; unless(open (FILE,"<$_")){ print STDERR "could not open $_: $!\n"; next;} local $/; $data=<FILE>; my($html) = $data =~ /(<p>|<\/P>|&nbsp;|<html>)/im; if (defined($html)){ $data =~ s/<[^>]+>/\n/g; $data =~ s/&nbsp;//ig; $data =~ s/&lt;/</ig; $data =~ s/&gt;/>/ig; $data =~ s/&quot;/"/ig; $data =~ s/&amp;/&/ig; $data =~ s/&#\d+;/&/ig; $data =~ s/&&/ &/ig; $data =~ s/(\w+)&(\d+)/$1 $2/ig; } my($block, $lines, $auditor, $auditorcity, $auditorstate, $date_audited, $keyword1, $keyword2, $keyword3, $keyword4, +$block2,); ($keyword1) = $data =~ /(have(\s*|\n)audited [^a])/i; ($keyword2) = $data =~ /(our(\s+|\n)audits|consent to)/i; if (!defined($keyword1) && !defined($keyword2)) {($block) ="" +;} elsif (!defined($keyword1) && defined($keyword2)) {($block) = + $data =~ /($keyword2(?:[^\n]*\n){1,350})/im;} elsif(defined($keyword1)) {($block) = $data =~ /($keyword1( +?:[^\n]*\n){1,350})/im;} ($keyword3) = $block =~ /(((REPORT(\s+|\n)OF|CONSENT OF)|)\s +*INDEPENDENT\s*(CERTIFIED|registered|registered CERTIFIED|)\s*PUBLIC\ +s*(ACCOUNTANTS|accounting(\s+|\n)firm)|REPORT OF INDEPENDENT ACCOUNTA +NTS|REPORT OF INDEPENDENT AUDITORS|)/im; if (defined($keyword3)) {($lines) = substr ($block,1,index($b +lock,$keyword3)-1);} elsif (!defined($keyword3)){($lines) = $block;} ($auditor) = $lines =~ /^\s*((?:[A-Z]\w+|\/\w+|&\s*\w+).+?\s* +(LLP|L\.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(\.|))$/m; if(defined ($auditor)){ ($auditorcity) = $lines =~ /$auditor\s*(?:(?:.+?\s*(?:LLP|L\ +.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(?:\.|))|\(.+\)|)\s*(.+?)(?<![\d]),\s* +(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:(?:.+?\s*(?:LLP|L\ +.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(?:\.|)|\(.+\)|)\s*.+?(?<![\d]),\s*)(. ++?)$/m; ($date_audited) = $lines =~ /$auditorstate\s*(\w+\s*(\d\d|\d) +,\s*\d{4})\s*($|except|\()/m; } if(!defined($auditorcity) && !defined($auditorstate)){ ($auditorcity) = $lines =~ /^\s*(\w+)(?<![\d]),\s*(?:\w+)(? +<!LLP)$/m; ($auditorstate) = $lines =~ /^\s*(?:\w+(?<![\d]),\s*)(\w+)(? +<!LLP)$/m; ($date_audited) = $lines =~ /^\s*(\w+\s*(\d\d|\d),\s*\d{4})\ +s*($|except|\()/m; } if(defined ($auditor) && !defined($auditorcity) && !defined($ +auditorstate)){ ($auditorcity) = $lines =~ /^\s*(\w+|\w+ \w+|\w+ \w+ \w+)(?: +(?<![\d]),\s*(?:\w+|\w+ \w+|\w+ [^a] \w+))(?<![\d])$/m; ($auditorstate) = $lines =~ /^\s*(?:(?:\w+|\w+ \w+|\w+ \w+ \w ++)[^\d],\s*)(\w+|\w+ \w+|\w+ [^a] \w+)(?<![\d])$/m;} if (!defined($auditorcity) && !defined($auditorstate) && defi +ned($date_audited)){ ($auditorcity) = $lines =~ /^\s*(\w+|\w+ \w+|\w+ \w+ \w+)(?: +(?<![\d]),\s*(?:\w+|\w+ \w+|\w+ [^a] \w+))\s*$date_audited/m; ($auditorstate) = $lines =~ /^\s*(?:(?:\w+|\w+ \w+|\w+ \w+ \w ++)[^\d],\s*)(\w+|\w+ \w+|\w+ [^a] \w+)\s*$date_audited/m; } if (!defined($auditor) && defined($auditorcity)){ ($auditor) = $lines =~ /\n{2,}\s*(.+?)?\s+$auditorcity/m; } if (!defined($auditor) && defined($auditorcity)){ ($auditor) = $lines =~ /(?:\/s\/)\s*([^-]+?)\s*$auditorcit +y/m;} if (!defined($auditor)){ ($auditor) = $lines =~ /(?:\/s\/)\s*([^-]+?)$/m;} if (!defined($auditor)){ ($lines) = $data =~ /(consent to(?:[^\n]*\n){1,90})/im; ($auditor) = $lines =~ /^\s*(.+?\s*(LLP|L\.L\.P\.|LLC|L +TD|P.A.|P\.C\.|PC))$/m; if(defined ($auditor) && !defined($auditorcity) && !defined($ +auditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(.+?),\s*(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:.+?,\s*)(.+?)$/m;} } if (!defined($auditor)){ ($auditor) = $data =~ /((PWC|KPMG|ERNST & YOUNG|DELOITTE & + TOUCHE|PRICEWATERHOUSECOOPERS|Young LLP)\s*(LLP|))/i;} if(defined ($auditor) && !defined($auditorcity) && !defined($au +ditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(.+?)(?<![\d]),\s*(?:.+ +?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:.+?(?<![\d]),\s*)(.+ +?)$/m;} if(defined ($auditor) && !defined($auditorcity) && !defined($au +ditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(?:\w+ (?:\d|\d\d)),\s* +\d{2,4}\s*(.+?),(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:\w+ (?:\d|\d\d)),\s* +\d{2,4}\s*(?:.+?),(.+?)$/m;} if(!defined($date_audited)){ ($date_audited) = $lines =~ /^\s*(\w+\s*(\d\d|\d),\s*\d{4})(,| +$)/m;} print OUTFILE "$auditor\t"; print OUTFILE "$auditorcity\t"; print OUTFILE "$auditorstate\t"; print OUTFILE "$date_audited\n";

      At this point, /me notes that *whatever* the full script looks like, it should never have compiled, and thus, never have processed nearly 7k files, with the regex error.

      Show the full error message, which should include a file and line number. Then show us the lines in you code surrounding the line at which the error occurs at; making clear which is the actual line in error. If the rexpexp is a run-time regexp, i.e. the pattern comes partially from a variable, then explain where the variable gets its contents.

      It may just be that you need to wrap that variable (if any) with \Q$variable\E.

      Dave.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://980545]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-23 20:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found