Greetings fellow monks,
I come to you with a quandary I do not understand, as tautological as that sounds.
I am trying to parse an .mht file (archived web page) using Perl, specifically the text portion of the top of the file. I do not have a choice as to the format.
I successfully broke down the text into an array with the judicious use of $/. Then I used regexes to pull data from each record in the array. However, the first record is a bit different from the rest because it contains the .mht metadata at the beginning and then the first record's information.
I figured that a regex using '|' should work fine, with the first half of the pattern pulling the information needed from the first record, while the second half pulls information from the rest of the records via the short-circuit behavior of '|'.
At least, that's my expected behavior. Obviously, it's not what's occurring.
Instead, the second half of the pattern is being used to match the first record as well as all the rest, resulting in the first line of metadata returned instead of the information I want. I do not understand why, and thus my quandary. Why?
Here are my code snippets. I tried it two different ways (I'm not sure if the match variables get renumbered after the '|'):
foreach my $vulnerabilityText (@records)
{
$vulnerabilityText =~ m/Remediation Report\n\n(.+?)\n|^(.+?)\n/;
my $vulnerabilityName;
if (defined($1))
{
$vulnerabilityName = $1;
}
else
{
$vulnerabilityName = $2;
}
print($vulnerabilityName, "\n");
}
or
foreach my $vulnerabilityText (@records)
{
$vulnerabilityText =~ m/Remediation Report\n\n(.+?)\n|^(.+?)\n/;
my $vulnerabilityName = $1;
print($vulnerabilityName, "\n");
}
I know each half of the patterns work; I tried them separately and they give the appropriate result. The problem arises when I bind them together with '|'.
System information: Activestate Perl 5.8.8 on Windows XP
Update 1: I corrected the pattern; I initially posted a simplified one I was using for troubleshooting, not the correct one. Sorry about that, pjotrik.
Update 2 (example data): Here is some (edited to remove non-essential) data:
thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng==
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="----=_NextPart_000_0001_01C8E81A.6FD2AA50"
Content-Location: file:///C:/RemediationReport.mht
Content-Class: urn:content-classes:message
Importance: normal
Priority: normal
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4133
This is a multi-part message in MIME format.
------=_NextPart_000_0001_01C8E81A.6FD2AA50
Content-Type: multipart/alternative;
boundary="----=_NextPart_001_0002_01C8E81A.6FD2AA50"
------=_NextPart_001_0002_01C8E81A.6FD2AA50
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
_____
Remediation Report
Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE
Audit ID: 6504
<snip>
Affected Items:
Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope
+ra
Audit ID: 6505
<snip>
Affected Items:
Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x
Audit ID: 6242
<snip>
I assigned $/ to "Affected Items:\n\n", so the records are broken down into an array like this:
First element of @records:
'thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng==
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="----=_NextPart_000_0001_01C8E81A.6FD2AA50"
Content-Location: file:///C:/Program%20Files/eEye%20Digital%20Security
+/Retina%20
5/Reports/Temp/Remediation/RemediationReport.html
Content-Class: urn:content-classes:message
Importance: normal
Priority: normal
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4133
This is a multi-part message in MIME format.
------=_NextPart_000_0001_01C8E81A.6FD2AA50
Content-Type: multipart/alternative;
boundary="----=_NextPart_001_0002_01C8E81A.6FD2AA50"
------=_NextPart_001_0002_01C8E81A.6FD2AA50
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
_____
Retina - Network Security Scanner
Network Vulnerability Assessment & Remediation Management
7/17/2008 - Report created by Retina version 5.9.4.1929
Remediation Report
Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE
Audit ID: 6504'
Second element of @records:
'Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Op
era
Audit ID: 6505
Third element of @records:
'Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x
Audit ID: 6242'
Does this help?
The output I expect is:
Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE
Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope
+ra
Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x
The output I get is:
thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng==
Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope
+ra
Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x
Solution: My expectations of how the regex engine works are incorrect. See moritz's and JavaFan's explanations below. This is my revised code.
foreach my $vulnerabilityText (@records)
{
(($vulnerabilityText =~ m/Remediation Report\n\n(.+?)\s*\n/) or ($
+vulnerabilityText =~ m/^(.+?)\s*\n/));
my $vulnerabilityName = $1;
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.