Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Regex problems using '|'

by romandas (Pilgrim)
on Jul 22, 2008 at 08:50 UTC ( [id://699237]=perlquestion: print w/replies, xml ) Need Help??

romandas has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks,

I come to you with a quandary I do not understand, as tautological as that sounds.

I am trying to parse an .mht file (archived web page) using Perl, specifically the text portion of the top of the file. I do not have a choice as to the format.

I successfully broke down the text into an array with the judicious use of $/. Then I used regexes to pull data from each record in the array. However, the first record is a bit different from the rest because it contains the .mht metadata at the beginning and then the first record's information.

I figured that a regex using '|' should work fine, with the first half of the pattern pulling the information needed from the first record, while the second half pulls information from the rest of the records via the short-circuit behavior of '|'.

At least, that's my expected behavior. Obviously, it's not what's occurring.

Instead, the second half of the pattern is being used to match the first record as well as all the rest, resulting in the first line of metadata returned instead of the information I want. I do not understand why, and thus my quandary. Why?

Here are my code snippets. I tried it two different ways (I'm not sure if the match variables get renumbered after the '|'):

foreach my $vulnerabilityText (@records) { $vulnerabilityText =~ m/Remediation Report\n\n(.+?)\n|^(.+?)\n/; my $vulnerabilityName; if (defined($1)) { $vulnerabilityName = $1; } else { $vulnerabilityName = $2; } print($vulnerabilityName, "\n"); }

or

foreach my $vulnerabilityText (@records) { $vulnerabilityText =~ m/Remediation Report\n\n(.+?)\n|^(.+?)\n/; my $vulnerabilityName = $1; print($vulnerabilityName, "\n"); }

I know each half of the patterns work; I tried them separately and they give the appropriate result. The problem arises when I bind them together with '|'.

System information: Activestate Perl 5.8.8 on Windows XP

Update 1: I corrected the pattern; I initially posted a simplified one I was using for troubleshooting, not the correct one. Sorry about that, pjotrik.

Update 2 (example data): Here is some (edited to remove non-essential) data:

thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_0001_01C8E81A.6FD2AA50" Content-Location: file:///C:/RemediationReport.mht Content-Class: urn:content-classes:message Importance: normal Priority: normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4133 This is a multi-part message in MIME format. ------=_NextPart_000_0001_01C8E81A.6FD2AA50 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0002_01C8E81A.6FD2AA50" ------=_NextPart_001_0002_01C8E81A.6FD2AA50 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit _____ Remediation Report Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Audit ID: 6504 <snip> Affected Items: Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Audit ID: 6505 <snip> Affected Items: Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x Audit ID: 6242 <snip>

I assigned $/ to "Affected Items:\n\n", so the records are broken down into an array like this:

First element of @records:

'thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_0001_01C8E81A.6FD2AA50" Content-Location: file:///C:/Program%20Files/eEye%20Digital%20Security +/Retina%20 5/Reports/Temp/Remediation/RemediationReport.html Content-Class: urn:content-classes:message Importance: normal Priority: normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4133 This is a multi-part message in MIME format. ------=_NextPart_000_0001_01C8E81A.6FD2AA50 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0002_01C8E81A.6FD2AA50" ------=_NextPart_001_0002_01C8E81A.6FD2AA50 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit _____ Retina - Network Security Scanner Network Vulnerability Assessment & Remediation Management 7/17/2008 - Report created by Retina version 5.9.4.1929 Remediation Report Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Audit ID: 6504'

Second element of @records:

'Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Op era Audit ID: 6505

Third element of @records:

'Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x Audit ID: 6242'

Does this help?

The output I expect is:

Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x

The output I get is:

thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x

Solution: My expectations of how the regex engine works are incorrect. See moritz's and JavaFan's explanations below. This is my revised code.

foreach my $vulnerabilityText (@records) { (($vulnerabilityText =~ m/Remediation Report\n\n(.+?)\s*\n/) or ($ +vulnerabilityText =~ m/^(.+?)\s*\n/)); my $vulnerabilityName = $1; }

Replies are listed 'Best First'.
Re: Regex problems using '|'
by pjotrik (Friar) on Jul 22, 2008 at 09:05 UTC
    How about
    foreach my $vulnerabilityText (@records) { $vulnerabilityText =~ m/(Remediation Report|.+)/; my $vulnerabilityName = $1; print($vulnerabilityName, "\n"); }
      Whoops. I posted the wrong pattern. The one I needed to work wouldn't work correctly with the grouping behavior of (), since I had to match on unique text in record 1 to locate the same information that's at the top in the other records.
        OK, that makes it something like:
        foreach my $vulnerabilityText (@records) { $vulnerabilityText =~ m/(Remediation report\n\n)?(.+)/; my $vulnerabilityName = $2; print($vulnerabilityName, "\n"); }
        I'm not sure about the newlines, but you'll work that out.
Re: Regex problems using '|'
by moritz (Cardinal) on Jul 22, 2008 at 09:45 UTC
    Show us some example data, and what you want to extract from it. Please also include example data that should not match (if there is data like that in your application).
      Okay, I posted some. What I don't understand is if I just use /Remediation Report\n\n(.+?)\n/, I get the line I'm looking for (Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE), but if I use /Remediation Report\n\n(.+?)\n|^(.+?)\n/, I get the top of the metadata (thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng==).
        The regex engine starts at the start of the string, and tries to match the first alternative, here Remediation Report\n\n(.+?)\n. It doesn't match, so it tries the second alternative, ^(.+?)\n. That one matches, so it captures the first line in $1.

        Without the alternation, the regex engine moves its starting position until it finds the substring Remediation Report.

        (Actually it's much smarter than that; it searches for the constant substring with the same techniques that index uses, but from a users point of view that only matters when it comes to speed, not in terms of functionality).

Re: Regex problems using '|'
by Anonymous Monk on Jul 22, 2008 at 09:24 UTC
    my $foo = $1 if m/^(.+?)\n/; $foo =~ s/Remediation Report(?:\n+)?//;
      Thank you for your submission. I'm not sure it does what I'm looking for, however. The first record's information is buried further down, so the initial pattern won't capture the information I'm looking for. It's why there is a ^ at the beginning of the second portion of the pattern, whereas the first portion does not contain it. Effectively "Remediation Report\n\n" is the equivalent of ^, as it is the starting point for the subsequent match.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://699237]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (1)
As of 2024-07-21 23:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.