Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re: extracting strings from text files

by xyzzy (Pilgrim)
on Jul 08, 2012 at 04:31 UTC ( #980541=note: print w/replies, xml ) Need Help??

in reply to extracting strings from text files

Yeah, this is probably a memory issue. Do you have any tools to monitor virtual memory usage? You might want to use them and see how much your script eats up and how it grows over time.

$,=qq.\n.;print q.\/\/____\/.,q./\ \ / / \\.,q.    /_/__.,q..
Happy, sober, smart: pick two.

Replies are listed 'Best First'.
Re^2: extracting strings from text files
by perlyr (Novice) on Jul 08, 2012 at 04:59 UTC
    maybe you are right. Do you know how to adjust the memory allocation then? Thanks

      1. Do you see any error messages or access to OS level log messages when you run/reach the stage.

      2.Monitoring and changing the memory/OS user limits depend on your OS port/platform. On most UNIX like/Linux you can monitor virtual memory with OS commands like vmstat,free.On linux you can even use even any of decent free tools like collectl etc to monitor system resource/memory usage. (ulimit can tell your current settings). Task manager in Windows is a best place to start.

      3.Above all it might be better to get the issue correctly identified/verified before attempting to increase limits some of which may need root access.

        Thanks for the help guys. I have solved the problem. It's a metacharacter in one of the files that stops the program. Dave, here is the whole message:
        Unmatched ) in regex; marked by <-- HERE in m/LLP) <-- HERE \s*(\w+\s* +(\d\d|\d),\s*\d{4})\s*$/ at D:\files\ line 88, <FILE> chunk 1 +.
        Here is the error message I see when it stops
        Unmatched ) in regex; marked by <-- HERE in m/LLP) <-- HERE \s*(\w+\s* +(\d\d|\d),\s*\d{4})\s*($|except|\()/
        any thoughts?
Re^2: extracting strings from text files
by perlyr (Novice) on Jul 08, 2012 at 04:47 UTC
    No, I don't know how to monitor that. Not sure this is a memory issue. I just bought this desktop which has 16GB memory. Below is the major part of the code.
    foreach (@infiles) { my $data; my $address = $_; unless(open (FILE,"<$_")){ print STDERR "could not open $_: $!\n"; next;} local $/; $data=<FILE>; my($html) = $data =~ /(<p>|<\/P>|&nbsp;|<html>)/im; if (defined($html)){ $data =~ s/<[^>]+>/\n/g; $data =~ s/&nbsp;//ig; $data =~ s/&lt;/</ig; $data =~ s/&gt;/>/ig; $data =~ s/&quot;/"/ig; $data =~ s/&amp;/&/ig; $data =~ s/&#\d+;/&/ig; $data =~ s/&&/ &/ig; $data =~ s/(\w+)&(\d+)/$1 $2/ig; } my($block, $lines, $auditor, $auditorcity, $auditorstate, $date_audited, $keyword1, $keyword2, $keyword3, $keyword4, +$block2,); ($keyword1) = $data =~ /(have(\s*|\n)audited [^a])/i; ($keyword2) = $data =~ /(our(\s+|\n)audits|consent to)/i; if (!defined($keyword1) && !defined($keyword2)) {($block) ="" +;} elsif (!defined($keyword1) && defined($keyword2)) {($block) = + $data =~ /($keyword2(?:[^\n]*\n){1,350})/im;} elsif(defined($keyword1)) {($block) = $data =~ /($keyword1( +?:[^\n]*\n){1,350})/im;} ($keyword3) = $block =~ /(((REPORT(\s+|\n)OF|CONSENT OF)|)\s +*INDEPENDENT\s*(CERTIFIED|registered|registered CERTIFIED|)\s*PUBLIC\ +s*(ACCOUNTANTS|accounting(\s+|\n)firm)|REPORT OF INDEPENDENT ACCOUNTA +NTS|REPORT OF INDEPENDENT AUDITORS|)/im; if (defined($keyword3)) {($lines) = substr ($block,1,index($b +lock,$keyword3)-1);} elsif (!defined($keyword3)){($lines) = $block;} ($auditor) = $lines =~ /^\s*((?:[A-Z]\w+|\/\w+|&\s*\w+).+?\s* +(LLP|L\.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(\.|))$/m; if(defined ($auditor)){ ($auditorcity) = $lines =~ /$auditor\s*(?:(?:.+?\s*(?:LLP|L\ +.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(?:\.|))|\(.+\)|)\s*(.+?)(?<![\d]),\s* +(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:(?:.+?\s*(?:LLP|L\ +.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(?:\.|)|\(.+\)|)\s*.+?(?<![\d]),\s*)(. ++?)$/m; ($date_audited) = $lines =~ /$auditorstate\s*(\w+\s*(\d\d|\d) +,\s*\d{4})\s*($|except|\()/m; } if(!defined($auditorcity) && !defined($auditorstate)){ ($auditorcity) = $lines =~ /^\s*(\w+)(?<![\d]),\s*(?:\w+)(? +<!LLP)$/m; ($auditorstate) = $lines =~ /^\s*(?:\w+(?<![\d]),\s*)(\w+)(? +<!LLP)$/m; ($date_audited) = $lines =~ /^\s*(\w+\s*(\d\d|\d),\s*\d{4})\ +s*($|except|\()/m; } if(defined ($auditor) && !defined($auditorcity) && !defined($ +auditorstate)){ ($auditorcity) = $lines =~ /^\s*(\w+|\w+ \w+|\w+ \w+ \w+)(?: +(?<![\d]),\s*(?:\w+|\w+ \w+|\w+ [^a] \w+))(?<![\d])$/m; ($auditorstate) = $lines =~ /^\s*(?:(?:\w+|\w+ \w+|\w+ \w+ \w ++)[^\d],\s*)(\w+|\w+ \w+|\w+ [^a] \w+)(?<![\d])$/m;} if (!defined($auditorcity) && !defined($auditorstate) && defi +ned($date_audited)){ ($auditorcity) = $lines =~ /^\s*(\w+|\w+ \w+|\w+ \w+ \w+)(?: +(?<![\d]),\s*(?:\w+|\w+ \w+|\w+ [^a] \w+))\s*$date_audited/m; ($auditorstate) = $lines =~ /^\s*(?:(?:\w+|\w+ \w+|\w+ \w+ \w ++)[^\d],\s*)(\w+|\w+ \w+|\w+ [^a] \w+)\s*$date_audited/m; } if (!defined($auditor) && defined($auditorcity)){ ($auditor) = $lines =~ /\n{2,}\s*(.+?)?\s+$auditorcity/m; } if (!defined($auditor) && defined($auditorcity)){ ($auditor) = $lines =~ /(?:\/s\/)\s*([^-]+?)\s*$auditorcit +y/m;} if (!defined($auditor)){ ($auditor) = $lines =~ /(?:\/s\/)\s*([^-]+?)$/m;} if (!defined($auditor)){ ($lines) = $data =~ /(consent to(?:[^\n]*\n){1,90})/im; ($auditor) = $lines =~ /^\s*(.+?\s*(LLP|L\.L\.P\.|LLC|L +TD|P.A.|P\.C\.|PC))$/m; if(defined ($auditor) && !defined($auditorcity) && !defined($ +auditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(.+?),\s*(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:.+?,\s*)(.+?)$/m;} } if (!defined($auditor)){ ($auditor) = $data =~ /((PWC|KPMG|ERNST & YOUNG|DELOITTE & + TOUCHE|PRICEWATERHOUSECOOPERS|Young LLP)\s*(LLP|))/i;} if(defined ($auditor) && !defined($auditorcity) && !defined($au +ditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(.+?)(?<![\d]),\s*(?:.+ +?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:.+?(?<![\d]),\s*)(.+ +?)$/m;} if(defined ($auditor) && !defined($auditorcity) && !defined($au +ditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(?:\w+ (?:\d|\d\d)),\s* +\d{2,4}\s*(.+?),(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:\w+ (?:\d|\d\d)),\s* +\d{2,4}\s*(?:.+?),(.+?)$/m;} if(!defined($date_audited)){ ($date_audited) = $lines =~ /^\s*(\w+\s*(\d\d|\d),\s*\d{4})(,| +$)/m;} print OUTFILE "$auditor\t"; print OUTFILE "$auditorcity\t"; print OUTFILE "$auditorstate\t"; print OUTFILE "$date_audited\n";

      There is nothing obviously wrong with the code -- at least in terms of causing the script to silently terminate -- which eliminates many of the possibilities.

      The next thing I would look at is does the script accumulate (leak) memory as it runs? On windows look at the task manager, on unix try man top and see whether the memory usage just keeps growing? At what rate per file? And what is the total memory usage when it quits?

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I just looked at the task manager. when it runs, the CPU usage is about 12% and physical memory is about 13%. memory usage is about 2.2GB, and doesn't accumulate. After it stops, the memory usage is still 2.2GB.
      I can't see in your code where you are closing the filehandle?????

      Ronald Fischer <>

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://980541]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2021-04-15 07:18 GMT
Find Nodes?
    Voting Booth?

    No recent polls found