content extraction

zzgulu has asked for the wisdom of the Perl Monks concerning the following question:

hello all,
I am a beginner and trying to extract content of some sections (like medical history) of text files assuming section header is all in upper case and content starts after colon ":". Extraction continues until it reaches a new line beginning at the very first position of the line with an upper case letter. The script works but I thought I could get your advise to make it more stable/efficient or maybe revise my approach to the problem. The only consistent pattern in my text files is that headers or section titles are all in upper case, start at the beginning of the sentence, and end in colon. Thank you for your feedback

#!/usr/bin/perl

open IN, "input.txt";
open OUT, "output.out";
while (my $a=<IN>) {
   $content.=$a;
}   

while($content=~m/(^MEDICAL HISTORY:)(.*?)(\n^[A-Z])/sgm )
     
{
print "$2\n";
}
exit;
[download]

Comment on content extraction Download Code

Replies are listed 'Best First'.
Re: content extraction by GrandFather (Saint) on Feb 23, 2009 at 21:31 UTC
There are a number of ways to improve your script. First, always use strictures (use strict; use warnings;). They pick up typos and give early warning of trouble ahead. Always use the three parameter version of open which gives better security and makes the intent clearer. Check the result of opens and closes - again you get an early heads up about trouble. Use lexical file handles to better manage the life time of the opened file handle. Never use $a or $b as general purpose variables - they are special and should only be used with sort. In fact, avoid single letter variable names in general - it's too easy to get them confused with each other. Be consistent with your use of indentation and white space. PerlTidy is a really good tool - use it. Avoid .? in regular expressions. Instead use a negated character class: `[^\n]`. Only capture stuff you actually want to use - it's more efficient and less confusing. True laziness is hard work	[reply] [d/l]
Re^2: content extraction by zzgulu (Novice) on Feb 24, 2009 at 13:25 UTC
Thank you Beth, GrandFather, and Lawliet for your great points and prompt reply. I will keep them in mind. Lawliet, I used the other two patterns in case I want to test the results(like adding $1\t to print) to make sure I am capturing the right content. Regarding "exit", I guess you are right and it's extra. Thank you all again for your input	[reply]
Re: content extraction by ELISHEVA (Prior) on Feb 23, 2009 at 20:45 UTC
Your script would be significantly more efficient if you detected the start and end of each extraction region while you are reading in the file, something like this (assuming MEDICAL HISTORY: begins a line): `#!/usr/bin/perl use warnings; use strict; open IN, "input.txt" or die; open OUT, ">output.out" or die; my $sHistory = ''; my $bInHistory = 0; while (my $line=<IN>) { if ($line =~ /^MEDICAL HISTORY:(.)$/) { $bInHistory=1; $sHistory = $1; } elsif ($line =~ /^[A-Z]/) { $bInHistory=0; print OUT $sHistory if $sHistory; } elsif ($bInHistory) { $sHistory .= $line; } } print OUT $sHistory if $bInHistory;` [download] Also it is a very good idea to start your script with the two lines: `use strict; use warnings;` [download] as I did above. You will save yourself a world of debugging pain by doing so. Another point: the variables `$a` and `$b` have special meaning in perl (they are used for sorting algorithms), so it is best to stay away from those variable names as well and name your variables something else. And another point: always check for errors when you open file handles. Sometimes they don't open like you expect. If you don't check, you'll get strange results without any proper warning. Best, beth Update:* fixed some bugs (including one pointed out in private msg by almut.)	[reply] [d/l] [select]
Re: content extraction by Lawliet (Curate) on Feb 23, 2009 at 20:44 UTC
Hmm, why do you capture patterns that you never use? `while ($content =~ /^MEDICAL HISTORY:(.+?)\n^[A-Z]/sgm) print "$1\n"; }` [download] Also, why do you exit at the end of your script? And you didn't even know bears could type.	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks