http://www.perlmonks.org?node_id=632864

princepawn has asked for the wisdom of the Perl Monks concerning the following question:

Hello parsing fans, let's start with some sample data:
Mon Oct 1 17:09:23 2001 0 127.0.0.1 2611 1774034 a _ o r tmbranno ftp + 0 * c Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 1774034 a _ o r tmbranno ftp 0 + * c Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 file with spaces in it.zip a _ + o r tmbranno ftp 0 * c Mon Oct 1 17:09:31 2001 0 127.0.0.1 7276 p1774034_11i_zhs.zip a _ o r + tmbranno ftp 0 * c
Now, if it were not for the 3rd line, I could simply split on whitespace to get each field:
our @field = qw(day_name month day current_time year transfer_ti +me remote_host file_size filename transfer_type special_ac +tion_flag direction access_mode username service_name authentication +_method authenticated_user_id completion_status); my %field; @field{@field} = split /\s+/, $line;
In then we have our data in a hash, and can access fields by name instead of position. This is how my module Net::FTPServer::XferLog has worked fine for years, but I just learned of a poor guy getting filenames with spaces in them. So, my approach to this problem is to split like normal, but shift and pop off data with care from either side of the filename field. and then whatever is left after that, join with empty string to make the file field:
sub parse_line { my $self = shift; my $line = shift or die "must supply xferlog l +ine"; my @field = qw(day_name month day current_time year transfer_tim +e remote_host file_size filename transfer_type special_action_flag direction access_mode username service_name authentication_method authenticated_user_i +d completion_status); my %field; my @tmp = split /\s+/, $line; if (scalar @tmp == scalar @field) { @field{@field} = @tmp; } else { for (@field) { last if $_ eq 'filename'; $field{$_} = shift @tmp; } @field = reverse @field; @tmp = reverse @tmp; for (@field) { last if $_ eq 'filename'; $field{$_} = shift @tmp; } @tmp = reverse @tmp ; $field{filename} = "@tmp"; } # map { print "$_ => $field{$_} \n" } @field; # print "-------------------"; \%field; }

But that is not very 'phisticated and I just KNOW some 1337 h4x0R out there is dying to flex his text parsing skIllZ and make the crowd go ooh and ahhh, so show me whatcha got!


Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

Replies are listed 'Best First'.
Re: parsing a space-separated filename in a line with fields separated by spaces
by BrowserUk (Patriarch) on Aug 15, 2007 at 22:07 UTC

    7HiS sEeem5 2 dO 743 tRicK.

    #! perl -slw use strict; use Data::Dump qw[ pp ]; while( <DATA> ) { my %fields; my @bits = m[ ^ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ ( .+ ) \s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+) $ ]x; @fields{ qw( day_name month day current_time year transfer_time remote_host file_size filename transfer_type special_action_flag direction access_mode username service_name authentication_method authenticated_user_id completion_status ) } = @bits; print pp \%fields; } __DATA__ Mon Oct 1 17:09:23 2001 0 127.0.0.1 2611 1774034 a _ o r tmbranno ftp + 0 * c Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 1774034 a _ o r tmbranno ftp 0 + * c Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 file with spaces in it.zip a _ + o r tmbranno ftp 0 * c Mon Oct 1 17:09:31 2001 0 127.0.0.1 7276 p1774034_11i_zhs.zip a _ o r + tmbranno ftp 0 * c

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: parsing a space-separated filename in a line with fields separated by spaces
by johngg (Canon) on Aug 15, 2007 at 22:32 UTC
    You can work your way in from both ends using the thre-argument form of split, like this. I have just populated an AoA here but once you have done that you can do what you like with it.

    use strict; use warnings; use Data::Dumper; my @linesData; while ( <DATA> ) { chomp; my @flds = split m{\s+}, $_, 9; my $rest = pop @flds; push @flds, reverse map { $_ = reverse } split m{\s+}, reverse($rest), 10; push @linesData, \@flds; } print Data::Dumper->Dumpxs([\@linesData], [qw{*linesData}]); __END__ Mon Oct 1 17:09:23 2001 0 127.0.0.1 2611 1774034 a _ o r tmbranno ftp + 0 * c Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 1774034 a _ o r tmbranno ftp 0 + * c Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 file with spaces in it.zip a _ + o r tmbranno ftp 0 * c Mon Oct 1 17:09:31 2001 0 127.0.0.1 7276 p1774034_11i_zhs.zip a _ o r + tmbranno ftp 0 * c

    This produces

    I hope this is of use.

    Cheers,

    JohnGG

Re: parsing a space-separated filename in a line with fields separated by spaces
by FunkyMonk (Chancellor) on Aug 15, 2007 at 22:11 UTC
    My take...

    while ( <DATA> ) { chomp; my @fields1 = split ' ', $_, 9; my @fields2 = split / /, pop @fields1; if ( @fields2 > 10 ) { my @filename = splice @fields2, 0, @fields2 - 9; unshift @fields2, join ' ', @filename; } push @fields1, @fields2; printf "%3s %3s %2d %8s %4s %s %-14s %4d %-26s %s %s %s %s %-10s % +3s %s %s %s\n", @fields1; } __DATA__ Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 file with spaces in it.zip a _ + o r tmbranno ftp 0 * c Mon Oct 1 17:09:23 2001 0 127.0.0.1 2611 1774034 a _ o r tmbranno ftp + 0 * c Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 1774034 a _ o r tmbranno ftp 0 + * c Mon Oct 1 17:09:31 2001 0 127.0.0.1 7276 p1774034_11i_zhs.zip a _ o r + tmbranno ftp 0 * c

    Output:

    Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 file with spaces in it. +zip a _ o r tmbranno ftp 0 * c Mon Oct 1 17:09:23 2001 0 127.0.0.1 2611 1774034 + a _ o r tmbranno ftp 0 * c Mon Oct 1 17:09:27 2001 0 127.0.0.1 22 1774034 + a _ o r tmbranno ftp 0 * c Mon Oct 1 17:09:31 2001 0 127.0.0.1 7276 p1774034_11i_zhs.zip + a _ o r tmbranno ftp 0 * c

Re: parsing a space-separated filename in a line with fields separated by spaces
by mamawe (Sexton) on Aug 15, 2007 at 22:28 UTC
    Would you mind using a regex?
    while (<>) { if (/^(\w{3} \w{3} [ :\d]{16}) (\d+) ([.\d]+) (\d+) (.+) ([a]) ([_ +]) ([o]) ([r]) (\w+) (\w+) (\d) (\S) ([c])$/) { print "$1 $2 $3 $4 '$5' $6 $7 $8 $9 $10 $11 $12 $13 $14\n"; } }
    puts nice single quotes around the name and you can access every field as well. You might even assign it to a list of variables.
Re: parsing a space-separated filename in a line with fields separated by spaces
by jwkrahn (Abbot) on Aug 15, 2007 at 22:30 UTC
    Assuming that file names can also have leading and/or trailing spaces in them, for example ' file name ' then you may want something like this:
    while ( <$in> ) { chomp; my %field; # remove and capture leading fields s/^ *(\S+) (\S+) +(\d+) ([\d:]+) (\d+) (\d+) ([\d.]+) (\d+) // and @field{ qw/ day_name month day current_time year transfer_time + remote_host file_size / } = ( $1, $2, $3, $4, $5, $6, $7, $8 ); # remove and capture trailing fields s/ (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)$// and @field{ qw/ transfer_type special_action_flag direction access +_mode username service_name authentication_method authenticated_user_ +id completion_status / } = ( $1, $2, $3, $4, $5, $6, $7, $8, $9 ); # only thing left is file name $field{ filename } = $_; print "$_ = '$field{$_}'\n" for keys %field; print "\n"; }