http://www.perlmonks.org?node_id=497006


in reply to Filter script with pattern and an array

thx ppl for ure help with the all in one pattern it works grade ;D i just filterd 1000000 lines and it filter everything correct ;D and yes i use no struct ,D etc because others need to change the script easyly and they have totaly no clue of perl,this is why i use the variables for the filenames as well ;D i striped some filter vars of because 60000 are to much ;D i also include an else print to sort the non maching out of the file ;D heres the code to share my xperiance with all ;D
#!/usr/bin/perl $infile = @ARGV[0]; $outfile = "OK.txt"; $outfile2 = "ERROR.txt"; open(INFILE, "< $infile") or die "\nDatei $infile konnte nicht geoeffnet werden\n"; open(OUTFILE, "> $outfile") or die "\nDatei $outfile konnte nicht geoeffnet werden\n"; open(OUTFILE2, "> $outfile2") or die "\nDatei $outfile2 konnte nicht geoeffnet werden\n"; while (<INFILE>) { if($_ =~ /01005;01000|01005;01001|01005;01002|01005;01003|01005;01004| +01005;01005|01005;01006|01005;01007|01005;01008|01005;01009|01005;010 +10|01005;01011|01005;01012|01005;01013|01005;01014|01005;01015|01005; +01109|01005;01170|01005;01177|01005;01198|01005;01199|01005;01200|010 +05;01400|01005;01401|01005;01402|01005;01403|01005;01404|01005;01405| +01005;01406|01005;01407|01005;01408|01005;01409|01005;01410|01005;014 +11|01005;01412|01005;01443|01005;01576|01005;01577|01005;01597|01005; +01600|01005;01601|01005;01602|01005;01603|01005;01604|01005;01605|010 +05;01606|01005;01607|01005;01608|01005;01610|01005;01611|01005;01612| +01005;01613|01005;01614|01005;01615|01005;01616|01005;01617|01005;016 +18|01005;01619|01005;01620|01005;01621|01005;01622|01005;01623|01005; +01624|01005;01625|01005;01626|01005;01627|01005;01628|01005;01629|010 +05;01634|01005;01635|01005;01660|01005;01773|01005;01798|01005;01800| +01005;01801|01005;02000|01005;02001|01005;02002|01005;02003|01005;020 +04|01005;02005|01005;02006|01005;02045|01005;02046|01005;02196|01005; +02200|01005;02201|01005;02202|01005;02203|01005;02204|01005;02398|010 +05;02399|01005;02400|01005;02402|01005;02403|01005;02404|01005;02405| +01005;02406|01005;02407|01005;02408|01005;02409|01005;02598|01005;025 +99|01005;02600|01005;02601|01005;02603|01005;02800|01005;02801|01005; +02802|01005;02803|01005;02804|01005;02805|01005;02807|01005;02809|010 +05;02810|01005;02811|01005;02812|01005;02970|01005;02971|01005;02996| +01005;02997|01005;02998|01005;03000|01005;03001|01005;03002|01005;032 +00|01005;03201|01005;03202|01005;03203|01005;03204|01005;03205|01005; +03206|01005;03207|01005;03208|01005;03209|01005;03210|01005;03211|010 +05;03212|01005;03213|01005;03214|01005;03215|01005;03216|01005;03217| +01005;03218|01005;03219|01005;03220|01005;03221|01005;03222|01005;032 +23|01005;03224|01005;03225|01005;03226|01005;03227|01005;03228|01005; +03229|01005;03230|01005;03231|01005;03232|01005;03233|01005;03234|010 +05;03235|01005;03236|01005;03237|01005;03238|01005;03239|01005;03240| +01005;03241|01005;03242|01005;03243|01005;03244|01005;03245|01005;032 +46|01005;03247|01005;03248|01005;03249|01005;03250|01005;03252|01005; +03253|01005;03254|01005;03255|01005;03256|01005;03257|01005;03258|010 +05;03259|01005;03260|01005;03261|01005;03262|01005;03264|01005;03265| +01005;03266|01005;03267|01005;03268|01005;03269|01005;03270|01005;032 +71|01005;03272|01005;03273|01005;03274|01005;03275|01005;03276|01005; +03277|01005;03278|01005;03279|01005;03280|01005;03281|01005;03294|010 +05;03298|01005;03305|01005;03309|01005;03320|01005;03332|01005;03333| +01005;03337|01005;03338|01005;03339|01005;03340|01005;03341|01005;033 +42|01005;03343|01005;03344|01005;03345|01005;03347|01005;03350|01005; +03366|01005;03367|01005;03369|01005;03370|01005;03371|01005;03372|010 +05;03373|01005;03374|01005;03375|01005;03376|01005;03379|01005;03381| +01005;03382|01005;03383|01005;03384|01005;03385|01005;03386|01005;033 +87|01005;03388|01005;03389|01005;03392|01005;03393|01005;03394|01005; +03395|01005;03396|01005;03397|01005;03398|01005;03399|01005;03400|010 +05;03401|01005;03402|01005;03403|01005;03404|01005;03405|01005;03406| +01005;03407|01005;03408|01005;03409|01005;03410|01005;03411|01005;034 +12|01005;03413|01005;03414|01005;03415|01005;03416| 01005;03417|01005;03419|01005;03420|01005;03433|01005;03531|01005;0360 +0|01005;03601|01005;03602|01005;03800|01005;04000|01005;04001|01005;0 +4002|01005;04003|01005;04004|01005;04005|01005;04006|01005;04007|0100 +5;04008|01005;04009|01005;04010|01005;04011|01005;04012|01005;04070|0 +1005;04199|01005;04200|01005;04400|01005;04401|01005;04402|01005;0440 +3|01005;04404|01005;04405|01005;04600|01005;04601|01005;04602|01005;0 +4603|01005;04604|01005;04605|01005;04606|01005;04607|01005;04675|0100 +5;04676|01005;04773|01005;04800|01005;04801|01005;04802|01005;04803|0 +1005;04804|01005;04998|01005;04999|01005;10000|01005;10100|01005;1010 +1|01005;10200|01005;10201|01005;10300|01005;10400|01005;10500|01005;1 +0600|01005;10601|01005;10700|01005;10800|01005;10900|01005;10943|0100 +5;11000|01005;11100|01005;11101|01005;11200|01005;11201|01005;11202|0 +1005;11300|01005;11301|01005;11400/ ) { print OUTFILE; } else { print OUTFILE2; } } close(INFILE); close(OUTFILE); close(OUTFILE2);

Replies are listed 'Best First'.
Re^2: Filter script with pattern and an array
by graff (Chancellor) on Oct 04, 2005 at 00:45 UTC
    I'm guessing that this is still in "test" stages... it does not look like you have 60000 elements in your regex yet. ;)

    It's looks like your patterns are supposed to match whole fields -- for example, "01005;11200" should match a line like this:

    012345;23456;01005;11200;000111222;111222333
    but it should not match a line like this:
    012345;23456;02006;22300;000001005;112004444
    The code you posted will match both lines, because the regex does not include ";" before and after the long conjunction of field values.

    Since you seem to be dealing with flat-table data, and your regex patterns involve matching certain combinations of third and fourth column values on each table row, you should consider treating handling things in a more table-like manner: read the target patterns into a hash, then read each row of the flat table file, pull out the 3rd and 4th fields, and see if they consistute an existing hash key.

    In any case, you do want to make sure your script will load your target patterns from a list file, rather than putting all the values in the perl code like you've done here. For example:

    use strict; use warnings; ( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ) or die "Usage: $0 input.table target.list\n"; my ( $infile, $targfile ) = @ARGV; my $outfile = "OK.txt"; my $errfile = "ERROR.txt"; open( IN, $infile ) or die "$infile: $!"; open( OUT, $outfile ) or die "$outfile: $!"; open( ERR, $errfile ) or die "$errfile: $!"; open( TARG, $targfile ) or die "$targfile: $!"; my %target; while (<TARG>) { chomp; # target.list has lines like "01005;11400" $target{$_} = undef; } close TARG; while (<IN>) { my @fields = split /;/; # assuming no quoted ";" within fields my $check = join ';', @fields[2,3]; # line-initial value is $field +s[0] # so 3rd and 4th are @fields[ +2,3] if ( exists( $target{$check} )) { print OUT; } else { print ERR; } } close OUT; close ERR;

    As for this comment of yours:

    i use no struct strict ... etc because others need to change the script easyly and they have totaly no clue of perl
    If others, with less knowledge of perl than you have, are going to be altering this script, then that's the most important reason to include  use strict; use warnings; -- that way, when they screw something up, there's a much better chance that the problem will be caught (and explained) before things get worse.

    (If these other people are just making adjustments to the list of target patterns, that is another very good reason for keeping that list in a separate file, so it can be updated without having to touch the perl script.)

    One last point: if your target patterns are not always being sought in the same columns of the table -- e.g. sometimes your target string is expected to match columns 3 and 4, and other times it is expected to match columns 5 and 6 -- then you might need to revert back to the regex approach. In that case, you should assign the conjunction of strings to a scalar, and form the regex like this:

    my @targ_strings = <TARG>; chomp @targ_strings; my $targ_regex = join "|", @targs; while (<IN>) { if ( /;(?:$targ_regex);/ ) { print OUT; } else { print ERR; } }
      hi thx for this suggestion but there are only txt lines and some of them contain the paaterns i use ;D no others ;D and its running fine ;D in the above code i dont include all 60000 becouse of the readable ;D so far the script filters over 200 files and get out lots of wrong files so i can speed up the process near 80 % faster and this is greate ;D
Re^2: Filter script with pattern and an array
by blazar (Canon) on Oct 04, 2005 at 07:23 UTC
    and yes i use no struct ,D etc because others need to change the script easyly and they have totaly no clue of perl
    (my $strict='struct') =~ s/u/i/; and yes, this is one more reason to use it, not just the opposite!