Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Extract data from CSV field.

by ww (Archbishop)
on Dec 08, 2015 at 22:32 UTC ( [id://1149731]=note: print w/replies, xml ) Need Help??


in reply to Extract data from CSV field.

I'm not sure this is responsive, but I'm trying to ignore the fields other than those for which you give an example, in the surmise that that's your problem area:

#!/usr/bin/perl -w use 5.018; use strict; #1149716 =head I would like to extract a piece of data from one field that has multip +le fields in it. The original field is a long description that usuall +y contains a #F123456, #123456, #123-F123456, #123-123456, or #12AB-1 +23456 in it. This data floats around from left to right and there sho +uld be whitespace before the #. Also, the end of the data is either w +hitespace, or the end of the field. =cut my @data = ("TRAY HINGED PLSTC 20 CAV #F32473", "BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX", "PAD, FOAM, 24 X 24 X 1/4 #16193 + 112 SHEETS PER ROLL, ORDER IN FULL ROLLS", "PKG LIST,ASST ARM,RAD,300 #F37784", "PAD, TOP CAP RE17-30048 #F30121 + CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X 4-3/4", "foo bar #379460 best F11", "F1234 SIMULATION", ); for my $data (@data) { # say "\t|$data|\n\n"; chomp $data; if ( $data =~ /\n/ ) { $data =~ s/\n//g; } if ( $data =~ /(^.* #[A-Z]*\d+.*$)/m ) { say "\n\$data matches regex\n"; $data =~ s/ +/ /g; # clean up excess spaces say "$data \n"; } else { say "\n\t The data, $data, does NOT MATCH\n"; } }

The regular expression may be obscure: here's an explanation:

C:perl -MYAPE::Regex::Explain -e " print YAPE::Regex::Explain->new(qr/ +(^.* #[A-Z]*\d+.*$)/)->explain();" The regular expression: (?-imsx:(^.* #[A-Z]*\d+.*$)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: # NB: I did NOT need the parens as there's no use of the capture # My bad, but harmless except for shoving bits &amp bytes around # when they didn't need to be disturbed. ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- # ' #' ---------------------------------------------------------------------- [A-Z]* any character of: 'A' to 'Z' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- <p>And the output is thus:</p> <c>C:1149716.pl $data matches regex TRAY HINGED PLSTC 20 CAV #F32473 The data, BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX, + does NOT MATCH $data matches regex PAD, FOAM, 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL, ORDER IN FULL ROL +LS $data matches regex PKG LIST,ASST ARM,RAD,300 #F37784 $data matches regex PAD, TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X + 4-3/4 $data matches regex foo bar #379460 best F11 The data, F1234 SIMULATION, does NOT MATCH

and here's the output of my code:

$data matches regex TRAY HINGED PLSTC 20 CAV #F32473 The data, BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX, + does NOT MATCH $data matches regex PAD, FOAM, 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL, ORDER IN FULL ROL +LS $data matches regex PKG LIST,ASST ARM,RAD,300 #F37784 $data matches regex PAD, TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X + 4-3/4 $data matches regex foo bar #379460 best F11 The data, F1234 SIMULATION, does NOT MATCH

HTH. Sometimes you'll get better answers if you trim your code to the mere few (<20) lines that demonstrate only the problem you want to address. I see you want more than what's here in terms of advice on the code you supplied but don't have time to try to create jumbled CSV that would give a shot at assessing the efficiency and/or clarity.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1149731]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2024-04-23 10:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found