Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Seeking Algorithm

by WhiteBird (Hermit)
on May 28, 2004 at 03:15 UTC ( #357123=perlquestion: print w/replies, xml ) Need Help??

WhiteBird has asked for the wisdom of the Perl Monks concerning the following question:

I have a medium sized file of about 6,000 records. They are diagnosis codes for emergency room patient visits, but that's beside the point. Consider them a string of numbers arranged like:

396.04,305.1,894.2,V321,908.3,,,,,
299.5,785,432.1,305.1,,,,,,
112.3,312,685,422.1,V566,433.2,987.5,,,, etc.

There can be as few as 2 numbers in each record or as many as 10. I need to filter out a certain subset of records. In each set (row) of numbers there is at least one number between 296.1 and 314.0. However, if the row contains "305.1", the record can be excluded if "305.1" is the only number between 296.1 and 314.0 in the row. In the sample rows above, row 1 would be discarded and rows 2 and 3 kept.

The person requesting this data suggested that I dump it into an excel spreadsheet, sort the data and manually remove the records that I don't need. That seems way too labor intensive to me, and I would think that Perl would have some quick and easy way to sort this out. I can't quite get my mind around the best way to do it, however.

I was thinking that I'd read each row and use a regular expression to find rows with 305.1 and then check for the existence of another qualifying number. Based on that I could then either delete the rows I don't need or save the ones I want to a new file. I'm a little rusty with Perl right now, and I don't even know how to start. I thought if I organized it into a node and tossed it out here that some discussion might help get me going. I'd appreciate any suggestions. Thanks.

Replies are listed 'Best First'.
Re: Seeking Algorithm
by Zaxo (Archbishop) on May 28, 2004 at 03:27 UTC

    My first reaction was to split and grep, but since the numbers are so tightly defined, I think a regex should do,

    my @keepers; open my $fh, '<', '/path/to/file.dat' or die $!; while (<$fh>) { while ( /(\d+\.\d+)/g ) { push @keepers, $_ and last if $1 >= 296.1 && $1 <= 314.0; } } close $fh or die $!;
    The while (/(foo)/g) construction loops over all the matches, putting the matching text in $1.

    Update: Missed the exclusion for 305.1, grep it is:

    my @keepers; open my $fh, '<', '/path/to/file.dat' or die $!; while (<$fh>) { my @data = grep { /^\d+\.\d+$/ && $_ >= 296.1 && $_ <= 314.0 } split ','; if (@data) { push @keepers, $_ unless @data == 1 and $data[0] eq '305.1'; } } close $fh or die $!;

    Update2: CountZero++ is correct, code amended.

    After Compline,
    Zaxo

      Are you sure your updated version works? Some of the data have a 'V' prefixed to the figures and you do not seem to filter this out as you did in your first example.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Seeking Algorithm (jump early)
by tye (Sage) on May 28, 2004 at 04:13 UTC

    I'd probably do it like this to avoid splitting and comparing items after a match is already found for a line:

    while( <> ) { while( /(\d+(?:\.\d+)?)/g ) { if( $1 ne '305.1' && 296.1 <= $1 && $1 <= 314.0 ) { print; last; } } }

    (updated slightly)

    - tye        

Re: Seeking Algorithm
by atcroft (Abbot) on May 28, 2004 at 03:35 UTC

    Just looking briefly at it, the following comes to mind. I'm going to make a bit of an assumption in that you just want to keep a copy of the lines, for processing later.

    # $filename is already defined my (@keptrecords); open(INFILE, $filename) or die("Can't open $filename for input: $!\n"); while (my $line = <INFILE>) { my $tokeep = 0; my $checkexclusion = 0; my @parts = split(/,/, $line); foreach (@parts) { $checkexclusion++ if ($_ eq '305.1'); $tokeep++ if (( 296.1 <= $_ ) and ( $_ <= 314.0 )); } next if (( $checkexclusion ) and ($tokeep <= 1)); push(@keptrecords, $line); }

    Just a thought, based on the idea you presented. Hope that helps....

Re: Seeking Algorithm
by dragonchild (Archbishop) on May 28, 2004 at 13:13 UTC
    Let me guess - you're doing a report on all patients that had a diagnosis of type yadayadayada and they want to exclude the diagnosis for blahblahblah. You were also told that this was a one-off report and they'd never want to do the same thing, but with the code for something else excluded.

    Do yourself a favor and put this into a database. I'm going to assume that each line is a patient. So, create a patient table and a diagnosis table. Then, you can do something like:

    SELECT patient.id FROM patient, diagnosis WHERE diagnosis.patient = patient.id AND diagnosis.code BETWEEN (296.1 AND 314.0) AND NOT EXISTS ( SELECT patient FROM diagnosis d_int WHERE d_int.patient = patient.id AND d_int.code = 305.1 )
    Or something like that. The benefit here is that you have the data in a database for the next stupid question you get asked.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      <grin>Well, you're right on with your description of the requested task. (Have you spent some time in healthcare?) I had started out with a database, but the potential of 10 diagnosis columns and thinking about splitting the data into tables and writing the query seemed a bit laborious. I've not used the NOT EXISTS syntax before--I'll have to look into that for future reference. Thanks.
        You don't have to work in healthcare for this kind of observation. It's a very common 'goose-chase' scenario that people do when they would really be better off using a database.
Re: Seeking Algorithm
by tkil (Monk) on May 28, 2004 at 07:18 UTC

    Depending on how precise you want to be about equality, you might need to do a fancy floating-point compare:

    my $EPSILON = 0.01; sub approx_equal ( $ $ ) { my ( $f1, $f2 ) = @_; return abs( $f1 - $f2 ) < $EPSILON; }

    The most straigntforward filter-style code I could come up with:

    while ( <> ) { my @in_range = grep { /^[\d.+-]+$/ && 296.1 <= $_ && $_ <= 314.0 } split /,\s*/; # skip records with only 305.1 (room temp kelvin?) in range next if @in_range == 1 && approx_equal( $in_range[0], 305.1 ); print; }

    If this is too slow, it might be faster to bail out as soon as we know we can:

      There's no need for an "approx_equal" to do fancy floating point compare. You get the numbers from a split - so you have strings. Just use string compare and avoid the uncertainy problems you have with floating point compare.

      Abigail

        There's no need for an "approx_equal" to do fancy floating point compare. You get the numbers from a split - so you have strings. Just use string compare and avoid the uncertainy problems you have with floating point compare.

        At one point in the evolution of my response, I was relying on strings to avoid this problem. Three reasons I switched:

        1. The original post didn't make it clear that all numbers would have exactly one decimal point. If the row contained 305.10, did that count? How about 305.099?
        2. I can never remember if scalars keep both their string and numeric natures at the same time. After using a numeric comparison against a given scalar, is it now just a number, or are both kept around? (I could find out by research or experimentation, but the fact that I had to think about it, even after 10+ years of using Perl, makes me think that I should avoid this subtlety.)
        3. Finally, it was a way to throw in an educational tidbit "for free". The original author didn't specify the example very precisely; by including this in my response, it would hopefully help them think about it more clearly. (And/or I was showing off. You decide.)
Re: Seeking Algorithm
by WhiteBird (Hermit) on May 28, 2004 at 15:40 UTC
    Thanks all for your input and rapid response. This is just what I was looking for--even more than I anticipated. I find that Tye's solution returns what I need when run against my test data. More testing is in order and I'll probably work a little with the other suggestions just to broaden my horizons and explore TIMTOWTDI. Here's the code I'm working from now:
    use strict; my $outfile = "\\NewCodes.txt"; my $infile = "C:\\TestTBL.txt"; # $filename is already defined open (LOGFILE, ">$outfile") or die "Can't create logfile: $!"; open NAMES, $infile or die "Can't open input datafile: $!\n"; while( <NAMES> ) { while( /(\d+(?:\.\d+)?)/g ) { if( $1 ne '305.1' && 290.0 <= $1 && $1 <= 314.0 ) { print LOGFILE; last; } } } ___DATA 4106693","01/01/00","305.1","473.9","465.8",,,,,,, "SP","000154106774","01/01/00","511.0","305.1",,,,,,,, "SP","000154107231","01/01/00","305.1","490","786.50",,,,,,, "SP","000154107398","01/01/00","558.9","305.1",,,,,,,, "SP","000154108114","01/02/00","305.1","789.00","792.1",,,,,,, "MCD","000154108661","01/03/00","305.1","922.31","E960.0",,,,,,, "SP","000154110313","01/03/00","305.1","784.0","780.4",,,,,,, "SP","000154111573","01/03/00","493.91","305.1",,,,,,,, "MCD","000154111581","01/03/00","599.0","305.1","V14.0",,,,,,, "MCD","000154115056","01/04/00","789.00","305.1",,,,,,,, "SP","000154122133","01/07/00","786.50","305.1",,,,,,,, "MCD","000154125507","01/08/00","305.1","789.00","412",,,,,,, "SP","000154125736","01/08/00","305.1","E849.5","847.0","847.9","924.8 +","E812.0",,,, "SP","000154126007","01/08/00","490","305.1",,,,,,,, "MCD","000154126147","01/08/00","305.1","729.1","346.90",,,,,,, "MCD","000154126775","01/09/00","305.1","525.8","V45.89",,,,,,, "MCD","000154127127","01/09/00","305.1","648.93","346.90",,,,,,, "SP","000154128875","01/10/00","873.63","305.1",,,,,,,, "SP","000154129243","01/10/00","789.01","305.1",,,,,,,, "SP","000154130535","01/10/00","729.5","305.1",,,,,,,, "MCD","000154107771","01/02/00","301.9","780.39",,,,,,,, "MCD","000154115773","01/05/00","802.0","910.0","E960.0","E849.9","250 +.01","311","758.7",,, "MCD","000154118241","01/05/00","787.03","295.70","737.30",,,,,,, "MCD","000154125621","01/08/00","295.70","714.0",,,,,,,, "MCD","000154125957","01/08/00","812.20","E885","E849.9","309.81",,,,, +, "MCD","000154127101","01/09/00","310.1","305.1","E869.4","E849.0",,,,, +, "MCD","000154127160","01/09/00","295.90",,,,,,,,, "MCD","000154130519","01/10/00","723.4","847.0","784.0","E884.3","E849 +.9","907.2","344.1","E929.8","305.1","V44.3" "MCD","000154131701","01/11/00","300.4",,,,,,,,, "MCD","000154134085","01/11/00","298.9","305.1",,,,,,,, "MCD","000154141162","01/13/00","558.9","493.90","296.7",,,,,,, "MCD","000154141308","01/13/00","784.0","300.01",,,,,,,, "MCD","000154146431","01/16/00","312.9","536.9",,,,,,,, "MCD","000154152644","01/18/00","313.81",,,,,,,,, "MCD","000154154507","01/18/00","298.8","305.1",,,,,,,, "MCD","000154158766","01/20/00","842.00","E826.1","E849.9","314.00",,, +,,, "MCD","000154159631","01/20/00","303.90","305.1",,,,,,,, "MCD","000154161007","01/20/00","313.81","311",,,,,,,, "MCD","000154161589","01/20/00","786.50","300.00",,,,,,,, "MCD","000154165410","01/22/00","305.00",,,,,,,,,
Re: Seeking Algorithm
by tkil (Monk) on May 28, 2004 at 16:19 UTC

    Potentially offensive to some, but in the spirit of TMTOWTDI:

    The person requesting this data suggested that I dump it into an excel spreadsheet, sort the data and manually remove the records that I don't need. That seems way too labor intensive to me

    There are pretty easy ways to do this in Excel. Take a look at the worksheet function COUNTIF; combined with a bit of multi-stage arithmetic and then filtering, you should be able to narrow down your set of data pretty easily.

    Hm, after a bit more research, this might not be so easy after all, since COUNTIF only takes one critirion. This can be worked around at least two ways: first, count entries above the range, add the number of entries below the range, then subtract that from the total number of entries. Another solution is to use SUMPRODUCT with boolean tests. Either way, it looks like it can be done with three additional columns (# in range, # equal to 305.1, include yes/no).

    Update

    Works out reasonably well. Doing a text import of the CSV gives me 13 columns (A-M). The first three are identifiers (initials, id, and date), then the ten values (columns D-M). Starting at row 2 (after adding a row of headers), I have:

    N2 = COUNTIF(D2:M2,">=296.1")-COUNTIF(D2:M2,">314.0") O2 = COUNTIF(D2:M2,305.1) P2 = NOT(AND(N2=1,O2=1))

    Workbook here (Excel 2000, 27KB).

    Just out of curiosity, how is one supposed to interpret the values with letters in front?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://357123]
Approved by talexb
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2022-06-26 13:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (85 votes). Check out past polls.

    Notices?