Re: find acronyms in a text
by rmflow (Beadle) on Jul 23, 2009 at 10:29 UTC
|
you use $FILE as filehandle, but reading from FILE, which is not the same
try while<$FILE> instead. | [reply] [d/l] |
Re: find acronyms in a text
by davorg (Chancellor) on Jul 23, 2009 at 10:20 UTC
|
You say you're dealing with a Word document. Word docs are binary files, but you are treating this as a plain text file. That could potentially cause all sorts of problems.
| [reply] |
|
| [reply] |
|
Use antiword to convert doc to text
| [reply] |
Re: find acronyms in a text
by arun_kom (Monk) on Jul 23, 2009 at 11:29 UTC
|
This should work for text files.
#!/usr/bin/perl -w
use strict;
my %acronyms_captured;
my $file = 'test.txt';
open FH, "<", $file or die;
while(<FH>){
my @words = $_ =~ m/\b[A-Z]+\b/g;
foreach(@words) { $acronyms_captured{$_} = undef; }
}
close(FH);
foreach(keys(%acronyms_captured)){ print "$_\n"; }
| [reply] [d/l] |
|
my @words = m/\b([A-Z]+)\b/g;
foreach(@words) { $acronyms_captured{$_} = undef; }
Personally, I'd write that as:
@acronyms_captured{@words} = ();
I like hash slices a lot :-)
| [reply] [d/l] [select] |
|
well, i thought it didnt matter in this case either way as whatever is captured here is what i want ... but i guess adding capturing brackets to make it explicit is better practise.
... and also the same with using ( ) for undef am learning ... thanks
| [reply] |
|
my @words = $_ =~ m/\b[A-Z]+\b/g;
You don't use split ?
| [reply] [d/l] |
|
Thanks a lot, but I don't understand the line my @words = $_ =~ m/\b[A-Z]+\b/g;
I used regular expressions here to capture all consecutively appearing upper case alphabets separated by word boundaries (\b). Please check the
perl regular expressions documentation
You don't use split ?
You could use split if you like but i think it is better to split by \W+ (non-word character) rather than \s+. This helps keep pattern matching simple in the next step. For the sample text below, using \s+ instead of \W+ would find none unless we perform a more complicated pattern matching later.
my %acronyms;
my $text= "An important class of transcription factors called general
+transcription factors (GTF) are necessary for transcription to occur.
+ The Most common GTF's are TFIIA, TFIIB, and TFIIE.And a few more not
+ mentioned here.";
my @words = split('\W+', $text);
foreach(@words) {
if($_ =~ m/^[A-Z]+$/){ $acronyms{$_}++; }
}
foreach( keys(%acronyms) ){ print "$_ seen $acronyms{$_} times\n"; }
| [reply] [d/l] [select] |
Re: find acronyms in a text
by mzedeler (Pilgrim) on Jul 23, 2009 at 12:13 UTC
|
When using split, write split /\s+/ in stead of split /\s/. Otherwise you'll get undef values every time there is two consecutive white space chars.
print join(', ', split /\s/, "just three words"), "\n";
print join(', ', split /\s+/, "just three words"), "\n";
| [reply] [d/l] [select] |
|
Better yet, specify a literal space " " (or nothing at all and it will default to the same thing) to split on any length whitespace AND skip leading whitespace.
Your patterns have the potential to return initial null fields.
-Greg
| [reply] |
Re: find acronyms in a text
by apl (Monsignor) on Jul 23, 2009 at 12:50 UTC
|
Please consider replacing
die "cannot open the file"; with
die "cannot open file '$file': $!\n";
This will enable you to see what file the user asked for, and why it couldn't be opened.
Obviously this isn't directly associated with you problem, or you would have seen your die. It is, however, a good habit to develop. (Just like use strict; use warnings;) | [reply] [d/l] [select] |
|
good advice ... if you also want to see which line number your script died on, avoid placing \n after $! in the die statement.
| [reply] |
Re: find acronyms in a text
by rovf (Priest) on Jul 23, 2009 at 10:16 UTC
|
$_ =~ s/\s+//;
my @Elements = split /\s/;
You are first removing all white space from the input, and then try to split on white space???? This doesn't make sense to me.
--
Ronald Fischer <ynnor@mm.st>
| [reply] [d/l] [select] |
|
| [reply] |
|
| [reply] [d/l] [select] |
|
Sorry,
you are right and I have updated my programm
I wanted to remove the new line character at the end of each line
| [reply] |
|
| [reply] |