Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: How to match and extract string after exact 3 digits [RESOLVED]

by thanos1983 (Parson)
on Feb 10, 2017 at 11:50 UTC ( [id://1181646]=note: print w/replies, xml ) Need Help??


in reply to Re: How to match and extract string after exact 3 digits [RESOLVED]
in thread How to match and extract string after exact 3 digits [RESOLVED]

Hello Marshall,

This is very interesting, but while I was testing my code last night I updated with a bit more complex data.

Sample of new update more complicated data:

my $str = "902 M 903 Textmessage 904 PO 905 S 906 VAS 907 10 908 3629 +909 85290200429/TYPE=thanos\@test.com 910 NA 911 NA 912 NA 913 NA 914 + NA 917 0 918 NA 919 Wed,_01_Feb_2017_19:56:23_GMT 922 NA 923 PO 924 +NA 925 NA 926 07594d85 927 100 928 20170202035623000+08 929 201702020 +35623000+08 930 NA 931 85260531042/TYPE=thanos2\@test.2.com 932 1 934 + 258;3259 920 NA 921 NA 935 NA 936 NA 938 NA 939 NA 940 thanos-local +942 NA 944 NA 945 4880 946 NA 948 NA 950 454000000927816 953 NA 954 1 +3 955 5.3.0 956 NA 957 07594d85 958 NA 961 13 981 NA 982 0 983 852902 +00429/TYPE=thanos3\@test.3.com 984 Wed,_01_Feb_2017_19:56:23_GMT 985 +RegularThanos 986 TEST 987 NA 988 NA 991 NA 992 NA 993 NA 994 1234567 +89 995 NA 996 NA 997 NA 998 NA 603 0E552E92 602 0 617 NA 618 NA 621 N +A This is a test line that I want to Capture2 635 NA 636 NA 637 NA 63 +8 NA 639 This is a test line that I want to Capture";

With the regex that you provided works perfect without the new string. I am really bad with regex so I can not really tell why. I think because it matches the first string that it founds and at the end of the string it stops. In my case I need also to be able to detect white space characters in the string. So I need from the regex to detect the string between two integers with 3 digits each, but that also will be a problem due to the last integer at the end of the string that has no integers following.

My temporary solution is:

@pairs = split(/(?:^|\s+)(\d{3})\s+/, $str);

After that I need to clear the array elements for empty elements and remove trailing and leading white space. But apart from that it seems to be working fine.

Never the less the solution is really good idea but for the moment it does not work with my current problem.

Thank you for your time and effort though.

Seeking for Perl wisdom...on the process of learning...not there...yet!

Replies are listed 'Best First'.
Re^3: How to match and extract string after exact 3 digits [RESOLVED]
by Marshall (Canon) on Feb 10, 2017 at 19:15 UTC
    I see that you've arrived at an approach that is working! Great! Sometimes with these things, just getting it done some way is a major hurdle!

    For future info, I went ahead and adapted my match global approach to your new data set. Here's the code and then some explanation of the regex follows. I added some single quotes around the values so you could see that there aren't any leading or trailing spaces to clean up.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $str2 = "902 M 903 Textmessage 904 PO 905 S 906 VAS 907 10 908 3629 + 909 85290200429/TYPE=thanos\@test.com 910 NA 911 NA 912 NA 913 NA 91 +4 NA 917 0 918 NA 919 Wed,_01_Feb_2017_19:56:23_GMT 922 NA 923 PO 924 + NA 925 NA 926 07594d85 927 100 928 20170202035623000+08 929 20170202 +035623000+08 930 NA 931 85260531042/TYPE=thanos2\@test.2.com 932 1 93 +4 258;3259 920 NA 921 NA 935 NA 936 NA 938 NA 939 NA 940 thanos-local + 942 NA 944 NA 945 4880 946 NA 948 NA 950 454000000927816 953 NA 954 +13 955 5.3.0 956 NA 957 07594d85 958 NA 961 13 981 NA 982 0 983 85290 +200429/TYPE=thanos3\@test.3.com 984 Wed,_01_Feb_2017_19:56:23_GMT 985 + RegularThanos 986 TEST 987 NA 988 NA 991 NA 992 NA 993 NA 994 123456 +789 995 NA 996 NA 997 NA 998 NA 603 0E552E92 602 0 617 NA 618 NA 621 +NA This is a test line that I want to Capture2 635 NA 636 NA 637 NA 6 +38 NA 639 This is a test line that I want to Capture"; my (%hash)= $str2 =~/(\d{3})\s+(.+?)\s*(?=\d{3}|$)/g; foreach my $key ( sort {$a<=>$b}keys %hash ) { print "$key => \'$hash{$key}\'\n"; } __END__ 259 => '920 NA' 602 => '0' 603 => '0E' 617 => 'NA' 618 => 'NA' 621 => 'NA This is a test line that I want to Capture2' 629 => '909' 635 => 'NA' 636 => 'NA' 637 => 'NA' 638 => 'NA' 639 => 'This is a test line that I want to Capture' 789 => '995 NA' 816 => '953 NA' 880 => '946 NA' 902 => 'M' 903 => 'Textmessage' 904 => 'PO' 905 => 'S' 906 => 'VAS' 907 => '10' 908 => '3' 910 => 'NA' 911 => 'NA' 912 => 'NA' 913 => 'NA' 914 => 'NA' 917 => '0' 918 => 'NA' 919 => 'Wed,_01_Feb_' 921 => 'NA' 922 => 'NA' 923 => 'PO' 924 => 'NA' 925 => 'NA' 926 => '0' 927 => '100' 928 => '2' 929 => '2' 930 => 'NA' 931 => '8' 932 => '1' 934 => '258;' 935 => 'NA' 936 => 'NA' 938 => 'NA' 939 => 'NA' 940 => 'thanos-local' 942 => 'NA' 944 => 'NA' 945 => '4' 948 => 'NA' 950 => '4' 954 => '13' 955 => '5.3.0' 956 => 'NA' 957 => '0' 958 => 'NA' 961 => '13' 981 => 'NA' 982 => '0' 983 => '8' 984 => 'Wed,_01_Feb_' 985 => 'RegularThanos' 986 => 'TEST' 987 => 'NA' 988 => 'NA' 991 => 'NA' 992 => 'NA' 993 => 'NA' 994 => '1' 996 => 'NA' 997 => 'NA' 998 => 'NA'
    This my (%hash)= $str2 =~/(\d{3})\s+(.+?)\s*(?=\d{3}|$)/g; is of course the key line!

    First we start by capturing a sequence of exactly 3 digits. Then throw away any sequence of spaces after those digits. Then we capture a sequence of any characters. The ? in the (.+?) makes this match "non-greedy". Without that, it would gobble up the entire rest of the line! Now comes a tricky part, how to tell the (.+?) to stop grabbing stuff? There might or might not be an unwanted space (at the end of the line, there is no extra space). This is the "real work", (?=\d{3}|$). The ?= means that this is a "look ahead" assertion. We stop grabbing stuff when we see that either a sequence of exactly 3 digits or end of string is coming up next. Although this expression is in parens(), it does not "capture" anything - it actually throws any matching stuff away once it is satisfied that the condition is true. Its like these trailing 3 digits never happened. When the /g (global) modifier kicks in, those 3 digits that caused us to stop will wind up getting matched by the first capture group at the beginning of the regex (the 3 consecutive digits).

    Anyway, it is possible to "look ahead" to see what would happen and use that as a basis to stop capturing the previous "match almost anything" match.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1181646]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-19 22:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found