Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

divide one file into multiple arrays

by tevus_oriley (Novice)
on Jun 25, 2013 at 17:09 UTC ( #1040631=perlquestion: print w/replies, xml ) Need Help??

tevus_oriley has asked for the wisdom of the Perl Monks concerning the following question:

I have a log file that I need to divide into 'good' and 'bad' arrays based on whether or not the line contains a website listed in another array. In the snippet below, I am just trying to get the lines that match one of the sites to go in the 'bad' array, but its returning everything. The $sites file just contains a list of websites, one per line (www.yahoo.com, etc) and the $log file contains all kinds of info per line, including a website.
open ($sites_fh, '<', $sites)or die "Can't open '$sites': $!"; chomp(@sites = <$sites_fh>); open ($log_fh, '<', $log) or die "Can't open '$log': $!"; while (<$log_fh>) { foreach my $site (@sites) { push @bad_log, $_ if ($_ =~ /$site/); } } print "@bad_log\n"; #currently returns entire $log
any help is appreciated $sites sample:
www.yahoo.com www.google.com www.comcast.com
$log sample:
X456 TV-yes DB-no 123.12.23.45 dealio3 www.google.com-------- FX-yes d +53 Y-03 X123 TV-yes DB-yes 34.154.43.21 dealio1 www.ask.com-------- FX-no d01 +Y-03 X412 TV-no DB-no 192.365.25.23 rayovac2 www.microsoft.com--- FX-yes d1 +3 Y-07
with the samples above, only the first line of $log should end up in @bad_log, and the others would go to @good_log, once thats created

Replies are listed 'Best First'.
Re: divide one file into multiple arrays
by LanX (Cardinal) on Jun 25, 2013 at 17:33 UTC
    > print "@bad_log\n"; #currently returns entire $log

    The code is fine╣, I suppose your @sites have an empty string or so, which always matches!

    try pushing the regex too, to see what matches in the dump of @bad_log.

    push @bad_log, [$_,$site] if ($_ =~ /$site/);

    By the way you'll get multiple entries per line if different sites match!

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    ╣) in the sense of "it works", not in the sense of "well coded" =)

      That was it! the last line in the list of sites was a blank line. Its always something silly, Thanks Rolph
        here some suggestions to improve your code

        use strict; use warnings; use Data::Dumper qw(Dumper); my @sites=qw/ www.yahoo.com www.google.com www.comcast.com /; my @bad_log; LINE: while (my $line = <DATA>) { my $www = (split / /,$line)[5]; foreach my $site (@sites) { next if $site eq ""; if ( $www =~ /\Q$site/ ) { push @bad_log, $line; next LINE; } } } print Dumper \@bad_log; __DATA__ X456 TV-yes DB-no 123.12.23.45 dealio3 www.google.com-------- FX-yes d +53 Y-03 X123 TV-yes DB-yes 34.154.43.21 dealio1 www.ask.com-------- FX-no d01 +Y-03 X412 TV-no DB-no 192.365.25.23 rayovac2 www.microsoft.com--- FX-yes d1 +3 Y-07

        Cheers Rolf

        ( addicted to the Perl Programming Language)

Re: divide one file into multiple arrays
by ramlight (Friar) on Jun 25, 2013 at 19:57 UTC
    If you have a large number of sites on your bad list (or even lots of sites in your log), you would be better served with a hash. You could populate the hash with the contents of the original array and then use 'exists' for your comparison. So the above code could be written as:

    use strict; use warnings; my @sites=qw/ www.yahoo.com www.google.com www.comcast.com /; my @bad_log; my @good_log; my %bad_hash = (); foreach my $bad_site (@sites) { $bad_hash{$bad_site} = 1; } while (my $line = <DATA>) { my $www = (split / /,$line)[5]; $www =~ s/---*//; if (exists $bad_hash{$www}) { push(@bad_log, $line); } else { push(@good_log, $line); } } print "\nBad lines are:\n"; foreach (@bad_log) { print; } print "\nGood lines are:\n"; foreach (@good_log) { print; } __DATA__ X456 TV-yes DB-no 123.12.23.45 dealio3 www.google.com-------- FX-yes d +53 Y-03 X123 TV-yes DB-yes 34.154.43.21 dealio1 www.ask.com-------- FX-no d01 +Y-03 X412 TV-no DB-no 192.365.25.23 rayovac2 www.microsoft.com--- FX-yes d1 +3 Y-07
    which returns

    Bad lines are: X456 TV-yes DB-no 123.12.23.45 dealio3 www.google.com-------- FX-yes d +53 Y-03 Good lines are: X123 TV-yes DB-yes 34.154.43.21 dealio1 www.ask.com-------- FX-no d01 +Y-03 X412 TV-no DB-no 192.365.25.23 rayovac2 www.microsoft.com--- FX-yes d1 +3 Y-07

      I'll have to remember to upvote that one when I get a new load of votes :).

      You can make the bad_hash straightaway with map :

      my %isBad= map { $_ => 1 } qw/ www.yahoo.com www.google.com www.comcast.com /;
      Which would be   my %isBad = map { $_ => 1 } @sites; for tevus_oriley. And with the hash values being 1, you can just write if ( $isBad{$www} ) instead of if (exists $isBad{$www})

      Edit : Whoops, posted too fast, chomp returns the number of chomped elements, not the chomped list.

      Golf, anyone?

      If you’re OK reading the whole input file into memory, the part function from List::MoreUtils can be used to populate both arrays at once. (This also incorportes Eily’s use of map to populate %bad_hash.)

      #! perl use strict; use warnings; use List::MoreUtils qw( part ); my %bad_hash = map { $_ => 1 } qw( www.yahoo.com www.google.com www.comcast.com ); my ($good_log, $bad_log) = part { exists $bad_hash{ (split)[5] =~ s{-- +-*}{}r } } <DATA>; print "\nBad lines are:\n"; print for @$bad_log; print "\nGood lines are:\n"; print for @$good_log; __DATA__ X456 TV-yes DB-no 123.12.23.45 dealio3 www.google.com-------- FX-yes d +53 Y-03 X123 TV-yes DB-yes 34.154.43.21 dealio1 www.ask.com-------- FX-no d01 +Y-03 X412 TV-no DB-no 192.365.25.23 rayovac2 www.microsoft.com--- FX-yes d1 +3 Y-07

      Hope that’s useful,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I fully agree a hash would be far better, even with a relatively small number of sites. It is not only faster, but it is also easier to code. A simple search in a hash is more straight forward than a grep (or whatever other implementation) in an array

Re: divide one file into multiple arrays
by Eily (Monsignor) on Jun 25, 2013 at 18:04 UTC

    You could also use quotemeta on your @sites strings, so that the dots can match litteral dots and not any characters.

      Or use \Q...\E
      as in
      push @bad_log, $_ if if /\Q$site\E/

Re: divide one file into multiple arrays
by toolic (Bishop) on Jun 25, 2013 at 17:28 UTC
    Post 10 lines from each of your input files.
Re: divide one file into multiple arrays
by Preceptor (Deacon) on Jun 25, 2013 at 17:31 UTC

    I'm not entirely sure what's going on, but I note when you do your pattern match, you don't include an operator (usually - you need 'm' for 'match' 's' for substitute, or tr for transliteration). I don't know if that should work or not. How does this work:

    if ( m/$site/ ) { print "$_ matched $site\n"; push ( @bad_log, $_ ); }
      "usually - you need 'm' for 'match' ..."

      You don't need it for /pattern/ or 'pattern'. You also don't need it for ?pattern?; however, that construct is deprecated.

      #!/usr/bin/env perl use strict; use warnings; my $x = 'abc'; if ($x =~ /b/) { print "y\n" } else { print "n\n" } if ($x =~ 'b') { print "y\n" } else { print "n\n" } if ($x =~ ?b?) { print "y\n" } else { print "n\n" }

      Output:

      $ junk Use of ?PATTERN? without explicit operator is deprecated at ./junk lin +e 7. y y y

      See perlop - Regexp Quote-Like Operators for details.

      -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1040631]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2020-07-09 11:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?