Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

How to split line with varying number of tokens?

by zBernie (Novice)
on Apr 28, 2013 at 04:04 UTC ( #1031028=perlquestion: print w/ replies, xml ) Need Help??
zBernie has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to split a space separated log file, and found that occasionally, the third column (FROM) is comprised of multiple strings. E.g., Marco's Pizza. Sometimes the FROM column has 2 or 3 tokens. So the split command below does not always work. Is there a way to handle this split with the varying number of tokens in the FROM column?

my ($reqid, $dest, $from, $date, $time, $pages, $rcv) = split(/\s+/, $ +_); REQID DEST FROM DATE TIME nPa +ges RCV 138454 mail_room Marco's Pizza 12/26 21:52 1 rcv 138446 custsvc 973 618 0577 12/26 18:44 1 rcv 138445 county2 spam 12/26 18:41 3 rcv 138444 custsvc spam 12/26 18:30 1 rcv 138439 county2 7182737253 12/26 17:54 2 rcv 138438 county2 Acme Products, Inc. 12/26 17:52 1 rcv

Comment on How to split line with varying number of tokens?
Download Code
Re: How to split line with varying number of tokens?
by Athanasius (Monsignor) on Apr 28, 2013 at 04:38 UTC

    If you can be sure that the FROM field never contains a string resembling the following DATE field, you can take this approach:

    #! perl use strict; use warnings; <DATA>; # Discard header while (<DATA>) { chomp; my @tokens = split /\s+/; my @fields; for (my $i = 0; $i < @tokens; ++$i) { if ($i == 2) { my $from = $tokens[$i]; until ($tokens[++$i] =~ m! ^ \d{1,2} / \d{1,2} $ !x) { $from .= ' ' . $tokens[$i]; } push @fields, $from; } push @fields, $tokens[$i]; } print join('|', @fields), "\n"; } __DATA__ REQID DEST FROM DATE TIME nPa +ges RCV 138454 mail_room Marco's Pizza 12/26 21:52 1 rcv 138446 custsvc 973 618 0577 12/26 18:44 1 rcv 138445 county2 spam 12/26 18:41 3 rcv 138444 custsvc spam 12/26 18:30 1 rcv 138439 county2 7182737253 12/26 17:54 2 rcv 138438 county2 Acme Products, Inc. 12/26 17:52 1 rcv

    Output:

    14:30 >perl 614_SoPW.pl 138454|mail_room|Marco's Pizza|12/26|21:52|1|rcv 138446|custsvc|973 618 0577|12/26|18:44|1|rcv 138445|county2|spam|12/26|18:41|3|rcv 138444|custsvc|spam|12/26|18:30|1|rcv 138439|county2|7182737253|12/26|17:54|2|rcv 138438|county2|Acme Products, Inc.|12/26|17:52|1|rcv 14:30 >

    Update: If you know that only the third field can contain spaces, a better approach may be as follows:

    1. shift @fields twice to get the first two fields
    2. pop   @fields four times to get the last four fields
    3. join(' ', @fields) to get the third, remaining field.

    Update 29th April: Tidied the code.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: How to split line with varying number of tokens?
by jwkrahn (Monsignor) on Apr 28, 2013 at 06:04 UTC
    $ echo "REQID DEST FROM DATE TIM +E nPages RCV 138454 mail_room Marco's Pizza 12/26 21:52 1 rcv 138446 custsvc 973 618 0577 12/26 18:44 1 rcv 138445 county2 spam 12/26 18:41 3 rcv 138444 custsvc spam 12/26 18:30 1 rcv 138439 county2 7182737253 12/26 17:54 2 rcv 138438 county2 Acme Products, Inc. 12/26 17:52 1 rcv " | +perl -e' while ( <> ) { my $line = reverse; my ( $rcv, $pages, $time, $date, $rest ) = map scalar reverse, spl +it " ", $line, 5; my ( $reqid, $dest, $from ) = split " ", $rest, 3; print join( " ", map qq/"$_"/, $reqid, $dest, $from, $date, $tim +e, $pages, $rcv ), "\n"; } ' "REQID" "DEST" "FROM" "DATE" "TIME" "nPages" "RCV" "138454" "mail_room" "Marco's Pizza" "12/26" "21:52" "1" " +rcv" "138446" "custsvc" "973 618 0577" "12/26" "18:44" "1" "rcv +" "138445" "county2" "spam" "12/26" "18:41" "3" "rcv" "138444" "custsvc" "spam" "12/26" "18:30" "1" "rcv" "138439" "county2" "7182737253" "12/26" "17:54" "2" "rcv" + "138438" "county2" "Acme Products, Inc." "12/26" "17:52" "1" + "rcv"
Re: How to split line with varying number of tokens?
by hdb (Prior) on Apr 28, 2013 at 06:11 UTC

    As you know what the first 2 fields are and what the last 4 fields are everything in between would be the name. So you could re-join the fields in the middle, possibly distorting the white space.

    use strict; use warnings; <DATA>; while(<DATA>){ chomp; my @line = split /\s+/; my $from = join( " ", splice( @line, 2, $#line-5) ); my ($reqid, $dest, $date, $time, $pages, $rcv) = @line; print join "|", ($reqid, $dest, $from, $date, $time, $pages, $rcv); print "\n"; } __DATA__ REQID DEST FROM DATE TIME nPa +ges RCV 138454 mail_room Marco's Pizza 12/26 21:52 1 rcv 138446 custsvc 973 618 0577 12/26 18:44 1 rcv 138445 county2 spam 12/26 18:41 3 rcv 138444 custsvc spam 12/26 18:30 1 rcv 138439 county2 7182737253 12/26 17:54 2 rcv 138438 county2 Acme Products, Inc. 12/26 17:52 1 rcv
      ... re-join the fields in the middle, possibly distorting the white space.

      I, too, wondered about the significance of embedded whitespace in the  FROM field of the data and about the fixed-field nature of the data, concerning all of which zBernie is silent in the OP and, to this moment, elsewhere in this thread. If embedded whitespace in the  FROM field matters, it's simple enough to deal with it using split if the sub-strings corresponding to the separators are also captured and everything is re-assembled with a minor modification to your existing split approach. (Even so, I think I prefer a regex-based extraction approach like that of davido, which lends itself better to data validation efforts.)

      >perl -wMstrict -le "my @data = ( 'REQID DEST FROM DATE TIME + nPages RCV', '138454 mail_room Marco`s Pizza 12/26 21:52 1 rcv' +, '138446 custsvc 973 618 0577 12/26 18:44 1 rcv', '138445 county2 spam 12/26 18:41 3 rcv' +, '138444 custsvc spam 12/26 18:30 1 rcv', '138439 county2 7182737253 12/26 17:54 2 rcv' +, '138438 county2 Acme Products, Inc. 12/26 17:52 1 rcv' +, ); ;; for my $record (@data) { my @fields = split /(\s+)/, $record; my $from = join '', splice @fields, 4, $#fields - 11; my ($reqid, $dest, $date, $time, $pages, $rcv) = @fields[ 0, 2, map { $#fields - $_ } 6, 4, 2, 0 ]; printf qq{'%s' \n}, join '|', $reqid, $dest, $from, $date, $time, $pages, $rcv; } " 'REQID|DEST|FROM|DATE|TIME|nPages|RCV' '138454|mail_room|Marco`s Pizza|12/26|21:52|1|rcv' '138446|custsvc|973 618 0577|12/26|18:44|1|rcv' '138445|county2|spam|12/26|18:41|3|rcv' '138444|custsvc|spam|12/26|18:30|1|rcv' '138439|county2|7182737253|12/26|17:54|2|rcv' '138438|county2|Acme Products, Inc.|12/26|17:52|1|rcv'
Re: How to split line with varying number of tokens?
by davido (Archbishop) on Apr 28, 2013 at 06:18 UTC

    If the FROM field is the only real wild-card then be specific about what you do know, and relaxed about what you don't. By anchoring with specifics to the left and the right of the FROM field, you can relax your specification of that one field and still build a relatively robust regular expression:

    while( my $line = <DATA> ) { print $line; chomp $line; my( $reqid, $dest, $from, $date, $time, $npages, $rcv ) = $line =~ m[ ^ # Beginning of input line. (\d+)\s+ # REQID (\w+)\s+ # DEST (\S.*?\S)\s+ # FROM (Accept non-space, anything [non- # greedily], non-space) (\d{1,2}/\d{1,2})\s+ # DATE (\d{1,2}:\d{1,2})\s+ # TIME (\d+)\s+ # nPages (\w+)\s* # RCV $ # End of input line. ]x; print "REQID: [$reqid]\tDEST: [$dest]\tFROM: [$from]\n"; print "DATE: [$date]\tTIME: [$time]\n"; print "nPages: [$npages]\tRCV: [$rcv]\n\n"; }

    (I'm assuming that the fact your columns are not vertically aligned is not a typo; ie, that the fields aren't fixed length. If they are fixed length, this solution would be silly.)


    Dave

Re: How to split line with varying number of tokens?
by hdb (Prior) on Apr 28, 2013 at 07:07 UTC

    Another alternative is based on the fact that your data is nicely vertically aligned, even if not perfect. So you could specify which columns of characters belong to which field. This is something that Excel would also offer when importing such data.

    use strict; use warnings; my %format = (#field from to reqid => [ 0, 7], dest => [ 8, 19], from => [ 20, 41], date => [ 42, 48], time => [ 49, 55], npages => [ 56, 59], rcv => [ 60, 70], ); <DATA>; while(<DATA>){ chomp; my %line; for my $item (keys %format) { $line{$item} = substr $_, $format{$item}->[0], $format{$item}->[1] +-$format{$item}->[0]+1; $line{$item} =~ s/^\s*//; # remove leading spaces $line{$item} =~ s/\s*$//; # remove trailing spaces print "$item=$line{$item}, "; } print "\n"; } __DATA__ REQID DEST FROM DATE TIME nPa +ges RCV 138454 mail_room Marco's Pizza 12/26 21:52 1 rcv 138446 custsvc 973 618 0577 12/26 18:44 1 rcv 138445 county2 spam 12/26 18:41 3 rcv 138444 custsvc spam 12/26 18:30 1 rcv 138439 county2 7182737253 12/26 17:54 2 rcv 138438 county2 Acme Products, Inc. 12/26 17:52 1 rcv
Re: How to split line with varying number of tokens?
by kcott (Abbot) on Apr 28, 2013 at 07:22 UTC

    G'day zBernie,

    Is the original data in a fixed format? If so, you can use unpack:

    #!/usr/bin/env perl use 5.010; use strict; use warnings; while (<DATA>) { say '[', join(']~[' => map { s/\s*$//; $_ } unpack 'A8A14A22A6A9A3 +A*'), ']'; } __DATA__ 138454 mail_room Marco's Pizza 12/26 21:52 1 rcv 138446 custsvc 973 618 0577 12/26 18:44 1 rcv 138445 county2 spam 12/26 18:41 3 rcv 138444 custsvc spam 12/26 18:30 1 rcv 138439 county2 7182737253 12/26 17:54 2 rcv 138438 county2 Acme Products, Inc. 12/26 17:52 1 rcv

    Output:

    $ pm_split_space_sep_log.pl [138454]~[mail_room]~[Marco's Pizza]~[12/26]~[21:52]~[1]~[rcv] [138446]~[custsvc]~[973 618 0577]~[12/26]~[18:44]~[1]~[rcv] [138445]~[county2]~[spam]~[12/26]~[18:41]~[3]~[rcv] [138444]~[custsvc]~[spam]~[12/26]~[18:30]~[1]~[rcv] [138439]~[county2]~[7182737253]~[12/26]~[17:54]~[2]~[rcv] [138438]~[county2]~[Acme Products, Inc.]~[12/26]~[17:52]~[1]~[rcv]

    -- Ken

Re: How to split line with varying number of tokens?
by igelkott (Curate) on Apr 28, 2013 at 17:31 UTC

    Considering that you may have altered your data a bit (redacted) for this post, it looks like you may really have tab-separated values. If so, change to split(/\t/, $_);

      I wish it were tab separated!
Re: How to split line with varying number of tokens?
by jakeease (Friar) on Apr 29, 2013 at 07:36 UTC
    #!/usr/bin/perl use strict; use warnings; <DATA>; # Discard header while (<DATA>) { chomp; my ($reqid, $dest, $from, $datetime, $pages, $rcv) = split(/\s\s+/ +, $_); my ($date, $time) = split(/\s+/, $datetime); print join('|', ($reqid, $dest, $from, $date, $time, $pages, $rcv) +), "\n"; } __DATA__ REQID DEST FROM DATE TIME nPa +ges RCV 138454 mail_room Marco's Pizza 12/26 21:52 1 rcv 138446 custsvc 973 618 0577 12/26 18:44 1 rcv 138445 county2 spam 12/26 18:41 3 rcv 138444 custsvc spam 12/26 18:30 1 rcv 138439 county2 7182737253 12/26 17:54 2 rcv 138438 county2 Acme Products, Inc. 12/26 17:52 1 rcv

    i. e., split on two+ spaces instead of one+; then fix date and time. Output:

    138454|mail_room|Marco's Pizza|12/26|21:52|1|rcv 138446|custsvc|973 618 0577|12/26|18:44|1|rcv 138445|county2|spam|12/26|18:41|3|rcv 138444|custsvc|spam|12/26|18:30|1|rcv 138439|county2|7182737253|12/26|17:54|2|rcv 138438|county2|Acme Products, Inc.|12/26|17:52|1|rcv

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1031028]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2014-11-28 04:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (193 votes), past polls