http://www.perlmonks.org?node_id=498985

Motomo94 has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract Date, Time, Recipient-Address, Sender-Address and the Subject. So if I search for Auser I want any line (record with his email) to take all the information listed above into a new file. These would be in 3 files (which are generated from MS Exchange 2003) they are Tab delimited. I am new to Perl and need to get this information quickly. Any help would be greatly appreated. Below is a sample of the data in the file that I need searched. These files are Huge in production I chopped them DOWN. Again Thanks Leo
# Exchange System Attendant Version 6.5.7226.0 # Date Time client-ip Client-hostname Partner-Name Serv +er-hostname server-IP Recipient-Address Event-ID MSGID + Priority Recipient-Report-Status total-bytes Number-Recipie +nts Origination-Time Encryption service-Version Linked-MS +GID Message-Subject Sender-Address 2005-9-10 0:0:16 GMT - - - storming - Someoneg@ao +l.com 1027 2433A69xxxxxxxxxxxxxxxx795006DB02DADF78@storming.Dom +ain.name1 0 0 11927 1 2005-9-10 0:0:16 GMT 0 - + c=US;a= ;p=AMSCAN;l=storming-050910000016Z-212788 Fw: Hey Ugly l +ine expansion and re-offer EX:/O=org/OU=Site/CN=RECIPIENTS/CN=Ause +r - 2005-9-10 0:0:16 GMT - - - storming - c1r3ai4g@ao +l.com 1019 2433A690xxxxxxxxxxxxxxxx5006DB02DADF78@storming.Doma +in.name1 0 0 11927 1 2005-9-10 0:0:16 GMT 0 - + - Fw: Hey Ugly line expansion and re-offer - - 2005-9-10 0:0:16 GMT - - - storming - c1r3ai4g@ao +l.com 1025 2433A6xxxxxxxxxxxxxxxx95006DB02DADF78@storming.Domai +n.name1 0 0 11927 1 2005-9-10 0:0:16 GMT 0 - +- Fw: Hey Ugly line expansion and re-offer - - 2005-9-10 0:0:16 GMT - - - storming - c1r3ai4g@ao +l.com 1024 2433A690Fxxxxxxxxxxxxxxxx6795006DB02DADF78@storming. +Domain.name1 0 0 11927 1 2005-9-10 0:0:16 GMT 0 +- - Fw: Hey Ugly line expansion and re-offer - - 2005-9-10 0:0:17 GMT - - - storming - c1r3ai4g@ao +l.com 1033 2433Axxxxxxxxxxxxxxxx428E5EE4C6795006DB02DADF78@stor +ming.Domain.name1 0 0 11927 1 2005-9-10 0:0:16 GMT +0 - - Fw: Hey Ugly line expansion and re-offer Auser@Doma +in.name - 2005-9-10 0:0:17 GMT - - - storming - c1r3ai4g@ao +l.com 1020 2433A69xxxxxxxxxxxxxxxx95006DB02DADF78@storming.Doma +in.name1 0 0 11927 1 2005-9-10 0:0:16 GMT 0 - + - Fw: Hey Ugly line expansion and re-offer Auser@Domain.name + -

Replies are listed 'Best First'.
Re: Extracting data from each line that matches a email address from a Log file (Tab delimited)
by GrandFather (Saint) on Oct 11, 2005 at 00:38 UTC

    There are a number of ways you could go about this. The first, and least recommended is to use a regex something like this:

    use strict; use warnings; #throw away first two lines <DATA>; <DATA>; while (<DATA>) { chomp; next if ! defined $_ or ! length $_; my ($date, $time, $recip, $subject, $send) = /\s*([^\t]+) #date \s+([^\t]+) #time (?:\t[^\t]+){5}\s+ #Skip 5 fields (\S+) #Recipient (?:\t[^\t]+){10}\s+ #Skip 10 fields ([^\t]*)\t #subject ([^\t]*) #Sender /x; print "$date, $time, $recip, $subject, $send\n"; }

    which generates

    2005-9-10, 0:0:16 GMT, Someoneg@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, EX:/O=org/OU=Site/CN=RECIPIENTS/CN=Auser 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, - 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, - 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, - 2005-9-10, 0:0:17 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, Auser@Domain.name 2005-9-10, 0:0:17 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, Auser@Domain.name

    However using a regex for parsing csv data such as this can be very difficult due to managing quoted data. It is generally better to use something like Text::CSV::Simple.

    Yet another option, if you are comfortable with databases, is to use DBI to wrap your file and manage it as a database.


    Perl is Huffman encoded by design.
Re: Extracting data from each line that matches a email address from a Log file (Tab delimited)
by GrandFather (Saint) on Oct 11, 2005 at 08:15 UTC

    Text::CSV::Simple is a little too simple, without making it complicated, to do the job for large files. Below is version using Text::CSV_XS.

    use strict; use warnings 'all'; use Text::CSV_XS; my $parser = Text::CSV_XS->new ({sep_char => "\t"}); while (<DATA>) { next if ! length $_; if (! $parser->parse ($_)) { warn "Error parsing: $_"; next; } my @columns = $parser->fields(); next if ! defined $columns[19]; print "$columns[0], $columns[1], $columns[7], $columns[18], $columns +[19]\n"; }

    Prints:

    # Date, Time, Recipient-Address, Message-Subject, Sender-Address 2005-9-10, 0:0:16 GMT, Someoneg@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, EX:/O=org/OU=Site/CN=RECIPIENTS/CN=Auser 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, - 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, - 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, - 2005-9-10, 0:0:17 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, Auser@Domain.name 2005-9-10, 0:0:17 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, Auser@Domain.name
    Update: s/CVS/CSV/g

    Perl is Huffman encoded by design.
      Here is how I modified the script to do what I need.
      use strict; use warnings 'all'; use Text::CSV_XS; open STDIN,"c:\\scripts\\20051030.log" or die $!; my $columns; open STDOUT, ">Answers.out" or die "can't redirect stdout"; my $parser = Text::CSV_XS->new ({sep_char => "\t"}); while (<STDIN>) { next if ! length $_; if (! $parser->parse ($_)) { warn "Error parsing: $_"; next; } my @columns = $parser->fields(); next if ! defined $columns[20]; print "$columns[0], $columns[1], $columns[7], $columns[18], $columns [19]\n"; }
      I get this in return... Use of uninitialized value in concatenation (.) or string at C:\scripts\Xsv.pl l ine 26, <STDIN> line 392. I still not know where I put in what I am searching for. and I need to search two fields "Recipient-Address, Sender-Address" If my search word "Auser" (case should not matter)is in either field the give me the 5 columns

        Are you setting the correct seperator character in Text::CSV_XS->new ({sep_char => "\t"});?


        Perl is Huffman encoded by design.

        Add use Data::Dumper; and update your next if ! defined $columns[20]; line to (print Dumper (\@columns)), next if ! defined $columns[20]; to see what is actually in @columns.


        Perl is Huffman encoded by design.

        What does the debugger tell you is undefined?

        Alex / talexb / Toronto

        "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

        I think you want something like this:
        #!/usr/bin/perl -w use strict; use Text::CSV_XS; use IO::File; my $filename = 'hdi.csv'; my $column_to_search = 1; my $wanted_value = 'Sweden'; my $csv = Text::CSV_XS->new({binary=>1}); my $fh = IO::File->new($filename) or die $!; while (my $cols = $csv->getline($fh)) { last unless @$cols; next unless defined $cols->[$column_to_search] and $cols->[$column_to_search] eq $wanted_value; for (0,1,3) { $cols->[$_] = '' unless defined $cols->[$_]; } print join(' ',$cols->[0],$cols->[1],$cols->[3]),"\n"; }
Re: Extracting data from each line that matches a email address from a Log file (Tab delimited)
by GrandFather (Saint) on Oct 11, 2005 at 09:01 UTC

    and to complete the suite - a version using DBI. Note that there is a "cheat" here in that the first line of the sample data has been omitted and, due to the # at the start of the header line the data field's name is '# Date'.

    use strict; use warnings 'all'; use DBI; # Create the sample file open outFile, '>', 'sample.csv'; print outFile
    '; close outFile; # The real sample code my $dbh = DBI->connect(qq{DBI:CSV:csv_sep_char=\\\t}) or die "Cannot connect: " . $DBI::errstr; my $sth = $dbh->{'csv_tables'}->{'info'} = { 'file' => 'sample.csv'}; $sth = $dbh->prepare("SELECT * FROM sample.csv") or die "Cannot prepare: " . $dbh->errstr(); $sth->execute() or die "Cannot execute: " . $sth->errstr(); while (my $row = $sth->fetchrow_hashref) { print("$row->{'Date'}, $row->{'Time'}, $row->{'Recipient-Address'}, +$row->{'Message-Subject'}, $row->{'Sender-Address'}, \n"); } $sth->finish(); $dbh->disconnect();

    Prints:

    2005-9-10, 0:0:16 GMT, Someoneg@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, EX:/O=org/OU=Site/CN=RECIPIENTS/CN=Auser, 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, -, 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, -, 2005-9-10, 0:0:16 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, -, 2005-9-10, 0:0:17 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, Auser@Domain.name, 2005-9-10, 0:0:17 GMT, c1r3ai4g@aol.com, Fw: Hey Ugly line expansion a +nd re-offer, Auser@Domain.name,

    Perl is Huffman encoded by design.
      my $sth = $dbh->{'csv_tables'}->{'info'} = { 'file' => 'sample.csv'};
      $sth = $dbh->prepare("SELECT * FROM sample.csv")
      The "my $sth =" on the first line is wrong. And the SQL in the second line should be either q{SELECT * FROM "sample.csv"} or q{SELECT * FROM info}. The first uses quotes to delimit the table name (thus allowing the forbidden period) and the second skirts the issue by using the alias to the filename you created with csv_tables.
Re: Extracting data from each line that matches a email address from a Log file (Tab delimited)
by runrig (Abbot) on Oct 11, 2005 at 00:18 UTC
    awk -F"\t" '/Auser/ { print $1, $2, $8, $20 }' file.txt
    see 'perldoc a2p' to perlify this (update: echo "/Auser/ { print $1, $2, $8, $20 }" | a2p -F"<tab-character-here>" to generate a perl script). Or UnxUtils if you just want to use (g)awk on Windows (and use all double quotes above if Windows).

      Wouldn't it have been better to just show the op the Perl version?

      I believe it comes out something like this :-

      while (<>) { my @Fld = split('\t', $_); if (/Auser/) { print $Fld[0], $Fld[1], $Fld[7], $Fld[19]; } }
Perl, rather than AWK
by Anonymous Monk on Oct 11, 2005 at 00:26 UTC
    Along the exact same lines as runrig: perl -lane 'print "@F0, 1, 7, -1, -2" if /Auser/'