Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

content extraction

by zzgulu (Novice)
on Feb 23, 2009 at 19:51 UTC ( [id://745817]=perlquestion: print w/replies, xml ) Need Help??

zzgulu has asked for the wisdom of the Perl Monks concerning the following question:

hello all,
I am a beginner and trying to extract content of some sections (like medical history) of text files assuming section header is all in upper case and content starts after colon ":". Extraction continues until it reaches a new line beginning at the very first position of the line with an upper case letter. The script works but I thought I could get your advise to make it more stable/efficient or maybe revise my approach to the problem. The only consistent pattern in my text files is that headers or section titles are all in upper case, start at the beginning of the sentence, and end in colon. Thank you for your feedback

#!/usr/bin/perl open IN, "input.txt"; open OUT, "output.out"; while (my $a=<IN>) { $content.=$a; } while($content=~m/(^MEDICAL HISTORY:)(.*?)(\n^[A-Z])/sgm ) { print "$2\n"; } exit;

Replies are listed 'Best First'.
Re: content extraction
by GrandFather (Saint) on Feb 23, 2009 at 21:31 UTC

    There are a number of ways to improve your script.

    First, always use strictures (use strict; use warnings;). They pick up typos and give early warning of trouble ahead.

    Always use the three parameter version of open which gives better security and makes the intent clearer. Check the result of opens and closes - again you get an early heads up about trouble. Use lexical file handles to better manage the life time of the opened file handle.

    Never use $a or $b as general purpose variables - they are special and should only be used with sort. In fact, avoid single letter variable names in general - it's too easy to get them confused with each other.

    Be consistent with your use of indentation and white space. PerlTidy is a really good tool - use it.

    Avoid .*? in regular expressions. Instead use a negated character class: [^\n]*. Only capture stuff you actually want to use - it's more efficient and less confusing.


    True laziness is hard work
      Thank you Beth, GrandFather, and Lawliet for your great points and prompt reply. I will keep them in mind. Lawliet, I used the other two patterns in case I want to test the results(like adding $1\t to print) to make sure I am capturing the right content. Regarding "exit", I guess you are right and it's extra. Thank you all again for your input
Re: content extraction
by ELISHEVA (Prior) on Feb 23, 2009 at 20:45 UTC
    Your script would be significantly more efficient if you detected the start and end of each extraction region while you are reading in the file, something like this (assuming MEDICAL HISTORY: begins a line):
    #!/usr/bin/perl use warnings; use strict; open IN, "input.txt" or die; open OUT, ">output.out" or die; my $sHistory = ''; my $bInHistory = 0; while (my $line=<IN>) { if ($line =~ /^MEDICAL HISTORY:(.*)$/) { $bInHistory=1; $sHistory = $1; } elsif ($line =~ /^[A-Z]/) { $bInHistory=0; print OUT $sHistory if $sHistory; } elsif ($bInHistory) { $sHistory .= $line; } } print OUT $sHistory if $bInHistory;

    Also it is a very good idea to start your script with the two lines:

    use strict; use warnings;
    as I did above. You will save yourself a world of debugging pain by doing so.

    Another point: the variables $a and $b have special meaning in perl (they are used for sorting algorithms), so it is best to stay away from those variable names as well and name your variables something else.

    And another point: always check for errors when you open file handles. Sometimes they don't open like you expect. If you don't check, you'll get strange results without any proper warning.

    Best, beth

    Update: fixed some bugs (including one pointed out in private msg by almut.)

Re: content extraction
by Lawliet (Curate) on Feb 23, 2009 at 20:44 UTC

    Hmm, why do you capture patterns that you never use?

    while ($content =~ /^MEDICAL HISTORY:(.+?)\n^[A-Z]/sgm) print "$1\n"; }

    Also, why do you exit at the end of your script?

    And you didn't even know bears could type.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://745817]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-04-18 21:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found