Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Greetings fellow monks,

I come to you with a quandary I do not understand, as tautological as that sounds.

I am trying to parse an .mht file (archived web page) using Perl, specifically the text portion of the top of the file. I do not have a choice as to the format.

I successfully broke down the text into an array with the judicious use of $/. Then I used regexes to pull data from each record in the array. However, the first record is a bit different from the rest because it contains the .mht metadata at the beginning and then the first record's information.

I figured that a regex using '|' should work fine, with the first half of the pattern pulling the information needed from the first record, while the second half pulls information from the rest of the records via the short-circuit behavior of '|'.

At least, that's my expected behavior. Obviously, it's not what's occurring.

Instead, the second half of the pattern is being used to match the first record as well as all the rest, resulting in the first line of metadata returned instead of the information I want. I do not understand why, and thus my quandary. Why?

Here are my code snippets. I tried it two different ways (I'm not sure if the match variables get renumbered after the '|'):

foreach my $vulnerabilityText (@records) { $vulnerabilityText =~ m/Remediation Report\n\n(.+?)\n|^(.+?)\n/; my $vulnerabilityName; if (defined($1)) { $vulnerabilityName = $1; } else { $vulnerabilityName = $2; } print($vulnerabilityName, "\n"); }

or

foreach my $vulnerabilityText (@records) { $vulnerabilityText =~ m/Remediation Report\n\n(.+?)\n|^(.+?)\n/; my $vulnerabilityName = $1; print($vulnerabilityName, "\n"); }

I know each half of the patterns work; I tried them separately and they give the appropriate result. The problem arises when I bind them together with '|'.

System information: Activestate Perl 5.8.8 on Windows XP

Update 1: I corrected the pattern; I initially posted a simplified one I was using for troubleshooting, not the correct one. Sorry about that, pjotrik.

Update 2 (example data): Here is some (edited to remove non-essential) data:

thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_0001_01C8E81A.6FD2AA50" Content-Location: file:///C:/RemediationReport.mht Content-Class: urn:content-classes:message Importance: normal Priority: normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4133 This is a multi-part message in MIME format. ------=_NextPart_000_0001_01C8E81A.6FD2AA50 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0002_01C8E81A.6FD2AA50" ------=_NextPart_001_0002_01C8E81A.6FD2AA50 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit _____ Remediation Report Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Audit ID: 6504 <snip> Affected Items: Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Audit ID: 6505 <snip> Affected Items: Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x Audit ID: 6242 <snip>

I assigned $/ to "Affected Items:\n\n", so the records are broken down into an array like this:

First element of @records:

'thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_0001_01C8E81A.6FD2AA50" Content-Location: file:///C:/Program%20Files/eEye%20Digital%20Security +/Retina%20 5/Reports/Temp/Remediation/RemediationReport.html Content-Class: urn:content-classes:message Importance: normal Priority: normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4133 This is a multi-part message in MIME format. ------=_NextPart_000_0001_01C8E81A.6FD2AA50 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0002_01C8E81A.6FD2AA50" ------=_NextPart_001_0002_01C8E81A.6FD2AA50 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit _____ Retina - Network Security Scanner Network Vulnerability Assessment & Remediation Management 7/17/2008 - Report created by Retina version 5.9.4.1929 Remediation Report Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Audit ID: 6504'

Second element of @records:

'Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Op era Audit ID: 6505

Third element of @records:

'Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x Audit ID: 6242'

Does this help?

The output I expect is:

Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x

The output I get is:

thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x

Solution: My expectations of how the regex engine works are incorrect. See moritz's and JavaFan's explanations below. This is my revised code.

foreach my $vulnerabilityText (@records) { (($vulnerabilityText =~ m/Remediation Report\n\n(.+?)\s*\n/) or ($ +vulnerabilityText =~ m/^(.+?)\s*\n/)); my $vulnerabilityName = $1; }

In reply to Regex problems using '|' by romandas

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2023-11-28 23:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?