Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Reg Exp to handle variations in the matched pattern

by moritz (Cardinal)
on Feb 22, 2012 at 13:12 UTC ( [id://955520]=note: print w/replies, xml ) Need Help??


in reply to Reg Exp to handle variations in the matched pattern

I don't understand your question. It would be nice if you provided several pieces of text that are supposed to match, and several that are supposed not to match, and what problem you encounter.

One thing that looks suspicious is your use of character classes. For example [^:\r(\d|\w] matches everything except the colon, \r, the vertical pipe, the opening paren, digits and word characters. That's not what you want, is it?

Also your last regex has an imbalanced )

What I'm looking for is to perhaps modify my reg exp in such a way that pattern 2 only matches match2a

The regex /^this is text:\r$/ would do that trick. Is that what you want?

Replies are listed 'Best First'.
Re^2: Reg Exp to handle variations in the matched pattern
by markjrouse (Initiate) on Feb 22, 2012 at 13:23 UTC

    Essentially, it's match any text where there is:

      a space, followed by a dash, followed by a carriage return OR a colon, followed by a carriage return BUT NOT a colon, followed by carriage return, followed by a digit, or a letter.

    One of the text files is actually located here: http://www.treasury.gov/resource-center/sanctions/SDN-List/Documents/sdnew02.txt

    I'm not interested in the text before the colon, as I want to search and replace, but having problems getting the regexp just write.

      a space, followed by a dash, followed by a carriage return OR a colon, followed by a carriage return

      So far that's simple / -[:r\r]\r/

      BUT NOT a colon, followed by carriage return

      If you're looking for two carriage returns in a row, then you'll never find something where the first carriage return is followed by a colon (because then it's not two carriage returns in a row, d'oh), so I don't see why you emphasize it like that.

      followed by carriage return, followed by a digit, or a letter.
      \r\w
      One of the text files is actually located here: http://www.treasury.gov/resource-center/sanctions/SDN-List/Documents/sdnew02.txt

      The pattern you describe matches nowhere in that file; in fact I can't find a single occurence of a carriage return in that file.

      If you describe what information you want to extract from that file, we might be able to help you. But right now it seems that you don't have a clear mental image yourself, so it's pretty hard to help you.

        Ultimately, I'm looking to ascertain how Perl could parse this file to turn it into a structured format. My first thoughts are to update the file using Perl to break out elements that would then constitute a line, then import these lines into a database to extract out the various fields. Unless, Perl is able to do this better and more efficiently. I'm still learning Perl, so I'm sure there is a better way of doing it in Perl.

        If you look at that file, from line 27:

        Licensing at 202/622-2480. The following changes have occurred with respect to the Office of Foreign Assets Control Listing of Specially Designated Nationals and Blocked Persons since January 1, 2002: 01/09/02: The following have been named as "Specially Designated Global Terrorists" [SDGTs] -

        There are two distinct patterns that I'm trying to match here, hence my original regexp (\s-\r)|(:\r). After the "January 1,2002:" text is a cariage return, line feed x2. Hex values 0D 0A 0D 0A. I'm looking to insert a string between ":" and the cariage return. So the first pattern is /(:)\r\n\r\n/ Therefore, my substuition code is this

        s/(:)\r\n\r\n/\1\$\$\n/g but of course this insertion is not working

        It may be my hex/text editor, but It tells me there are lots of carriage returns in this data.

        The second pattern is after the "01/09/02: The following have been named as "Specially Designated Global Terrorists" SDGTs -" text, where the dash at the end is proceeded by a space, and followed by a carriage return, new line feed x2, so my match regexp is /(\s-)\r\n\r\n/ Therefore, my substuition code is this s/(\s-)\r\n\r\n/\1\$\$\n/g but of course this insertion is not working

        The subsequent result would be:

        Licensing at 202/622-2480. The following changes have occurred with respect to the Office of Foreign Assets Control Listing of Specially Designated Nationals and Blocked Persons since January 1, 2002:$$ 01/09/02: The following have been named as "Specially Designated Global Terrorists" [SDGTs] -$$

        sorry for it not being much clearer. It's a bit difficult to explain.

        Hi Moritz, Yes your right. I've just re-downloaded the file and there are no carriage returns. I'll try this again.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://955520]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-24 22:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found