http://www.perlmonks.org?node_id=699237

romandas has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks,

I come to you with a quandary I do not understand, as tautological as that sounds.

I am trying to parse an .mht file (archived web page) using Perl, specifically the text portion of the top of the file. I do not have a choice as to the format.

I successfully broke down the text into an array with the judicious use of $/. Then I used regexes to pull data from each record in the array. However, the first record is a bit different from the rest because it contains the .mht metadata at the beginning and then the first record's information.

I figured that a regex using '|' should work fine, with the first half of the pattern pulling the information needed from the first record, while the second half pulls information from the rest of the records via the short-circuit behavior of '|'.

At least, that's my expected behavior. Obviously, it's not what's occurring.

Instead, the second half of the pattern is being used to match the first record as well as all the rest, resulting in the first line of metadata returned instead of the information I want. I do not understand why, and thus my quandary. Why?

Here are my code snippets. I tried it two different ways (I'm not sure if the match variables get renumbered after the '|'):

foreach my $vulnerabilityText (@records) { $vulnerabilityText =~ m/Remediation Report\n\n(.+?)\n|^(.+?)\n/; my $vulnerabilityName; if (defined($1)) { $vulnerabilityName = $1; } else { $vulnerabilityName = $2; } print($vulnerabilityName, "\n"); }

or

foreach my $vulnerabilityText (@records) { $vulnerabilityText =~ m/Remediation Report\n\n(.+?)\n|^(.+?)\n/; my $vulnerabilityName = $1; print($vulnerabilityName, "\n"); }

I know each half of the patterns work; I tried them separately and they give the appropriate result. The problem arises when I bind them together with '|'.

System information: Activestate Perl 5.8.8 on Windows XP

Update 1: I corrected the pattern; I initially posted a simplified one I was using for troubleshooting, not the correct one. Sorry about that, pjotrik.

Update 2 (example data): Here is some (edited to remove non-essential) data:

thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_0001_01C8E81A.6FD2AA50" Content-Location: file:///C:/RemediationReport.mht Content-Class: urn:content-classes:message Importance: normal Priority: normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4133 This is a multi-part message in MIME format. ------=_NextPart_000_0001_01C8E81A.6FD2AA50 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0002_01C8E81A.6FD2AA50" ------=_NextPart_001_0002_01C8E81A.6FD2AA50 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit _____ Remediation Report Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Audit ID: 6504 <snip> Affected Items: Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Audit ID: 6505 <snip> Affected Items: Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x Audit ID: 6242 <snip>

I assigned $/ to "Affected Items:\n\n", so the records are broken down into an array like this:

First element of @records:

'thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_0001_01C8E81A.6FD2AA50" Content-Location: file:///C:/Program%20Files/eEye%20Digital%20Security +/Retina%20 5/Reports/Temp/Remediation/RemediationReport.html Content-Class: urn:content-classes:message Importance: normal Priority: normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4133 This is a multi-part message in MIME format. ------=_NextPart_000_0001_01C8E81A.6FD2AA50 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0002_01C8E81A.6FD2AA50" ------=_NextPart_001_0002_01C8E81A.6FD2AA50 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit _____ Retina - Network Security Scanner Network Vulnerability Assessment & Remediation Management 7/17/2008 - Report created by Retina version 5.9.4.1929 Remediation Report Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Audit ID: 6504'

Second element of @records:

'Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Op era Audit ID: 6505

Third element of @records:

'Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x Audit ID: 6242'

Does this help?

The output I expect is:

Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x

The output I get is:

thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x

Solution: My expectations of how the regex engine works are incorrect. See moritz's and JavaFan's explanations below. This is my revised code.

foreach my $vulnerabilityText (@records) { (($vulnerabilityText =~ m/Remediation Report\n\n(.+?)\s*\n/) or ($ +vulnerabilityText =~ m/^(.+?)\s*\n/)); my $vulnerabilityName = $1; }