Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

find acronyms in a text

by steph_bow (Pilgrim)
on Jul 23, 2009 at 10:09 UTC ( [id://782612]=perlquestion: print w/replies, xml ) Need Help??

steph_bow has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Dear Monks, I would like fo find all the acronyms (like: HEDZ or AYTUN or YD, ...) in my Word document but my program does not seem to work and there is no error report in my STDERR file. What should I do ?

use strict; open(STDERR, ">>stderr.log" ) or die "cannot redirect stderr output in + log file : $!\n"; my $file = $ARGV[0]; open my $FILE, q{<}, $file or die "cannot open the file"; my @acronyms_captured; my %seen; while(<FILE>){ $_ =~ s/\s+$//; my @Elements = split /\s/; print STDOUT "@Elements\n"; foreach my $el (@Elements) { if ($el =~ /[A-Z]+/){ if ($seen{$el} == 0){ push @acronyms_captured, $el; $seen{$el} ++ ; } } } } foreach my $acronym(@acronyms_captured){ print STDOUT "$acronym\n"; }
update: $_ =~ s/\s+$//;

Replies are listed 'Best First'.
Re: find acronyms in a text
by rmflow (Beadle) on Jul 23, 2009 at 10:29 UTC
    you use $FILE as filehandle, but reading from FILE, which is not the same

    try while<$FILE> instead.
Re: find acronyms in a text
by davorg (Chancellor) on Jul 23, 2009 at 10:20 UTC

    You say you're dealing with a Word document. Word docs are binary files, but you are treating this as a plain text file. That could potentially cause all sorts of problems.

    --

    See the Copyright notice on my home node.

    Perl training courses

      Thanks a lot

      I have taken into account your remark and copied the contents of the doc word into a text word but it does not work yet

      update: it works now thanks to the remark of rmflow
        Use antiword to convert doc to text
Re: find acronyms in a text
by arun_kom (Monk) on Jul 23, 2009 at 11:29 UTC
    This should work for text files.
    #!/usr/bin/perl -w use strict; my %acronyms_captured; my $file = 'test.txt'; open FH, "<", $file or die; while(<FH>){ my @words = $_ =~ m/\b[A-Z]+\b/g; foreach(@words) { $acronyms_captured{$_} = undef; } } close(FH); foreach(keys(%acronyms_captured)){ print "$_\n"; }
      my @words = $_ =~ m/\b[A-Z]+\b/g;

      Aren't you missing capturing brackets there?

      my @words = m/\b([A-Z]+)\b/g;
      foreach(@words) { $acronyms_captured{$_} = undef; }

      Personally, I'd write that as:

      @acronyms_captured{@words} = ();

      I like hash slices a lot :-)

      --

      See the Copyright notice on my home node.

      Perl training courses

        well, i thought it didnt matter in this case either way as whatever is captured here is what i want ... but i guess adding capturing brackets to make it explicit is better practise.
        ... and also the same with using ( ) for undef
        am learning ... thanks

      Thanks a lot, but I don't understand the line

      my @words = $_ =~ m/\b[A-Z]+\b/g;

      You don't use split ?

        Thanks a lot, but I don't understand the line  my @words = $_ =~ m/\b[A-Z]+\b/g;

        I used regular expressions here to capture all consecutively appearing upper case alphabets separated by word boundaries (\b). Please check the perl regular expressions documentation

        You don't use split ?

        You could use split if you like but i think it is better to split by \W+ (non-word character) rather than \s+. This helps keep pattern matching simple in the next step. For the sample text below, using \s+ instead of \W+ would find none unless we perform a more complicated pattern matching later.

        my %acronyms; my $text= "An important class of transcription factors called general +transcription factors (GTF) are necessary for transcription to occur. + The Most common GTF's are TFIIA, TFIIB, and TFIIE.And a few more not + mentioned here."; my @words = split('\W+', $text); foreach(@words) { if($_ =~ m/^[A-Z]+$/){ $acronyms{$_}++; } } foreach( keys(%acronyms) ){ print "$_ seen $acronyms{$_} times\n"; }
Re: find acronyms in a text
by mzedeler (Pilgrim) on Jul 23, 2009 at 12:13 UTC

    When using split, write split /\s+/ in stead of split /\s/. Otherwise you'll get undef values every time there is two consecutive white space chars.

    print join(', ', split /\s/, "just three words"), "\n"; print join(', ', split /\s+/, "just three words"), "\n";
      Better yet, specify a literal space " " (or nothing at all and it will default to the same thing) to split on any length whitespace AND skip leading whitespace. Your patterns have the potential to return initial null fields. -Greg
Re: find acronyms in a text
by apl (Monsignor) on Jul 23, 2009 at 12:50 UTC
    Please consider replacing
         die "cannot open the file";
    with
         die "cannot open file '$file': $!\n";

    This will enable you to see what file the user asked for, and why it couldn't be opened.

    Obviously this isn't directly associated with you problem, or you would have seen your die. It is, however, a good habit to develop. (Just like use strict; use warnings;)

      good advice ... if you also want to see which line number your script died on, avoid placing \n after $! in the die statement.
Re: find acronyms in a text
by rovf (Priest) on Jul 23, 2009 at 10:16 UTC
    $_ =~ s/\s+//; my @Elements = split /\s/;

    You are first removing all white space from the input, and then try to split on white space???? This doesn't make sense to me.

    -- 
    Ronald Fischer <ynnor@mm.st>
      You are first removing all white space from the input

      I thought that at first too. But he's only removing the first run of whitespace from each record he reads in.

      --

      See the Copyright notice on my home node.

      Perl training courses

        he's only removing the first run of whitespace

        Of course! No g modifier! I did not pay close attention...

        -- 
        Ronald Fischer <ynnor@mm.st>

      Sorry, you are right and I have updated my programm

      I wanted to remove the new line character at the end of each line

        I wanted to remove the new line character at the end of each line

        See chomp.

        But does a Word document contain anything that Perl would recognise as a newline character? I don't know anything about that file format.

        --

        See the Copyright notice on my home node.

        Perl training courses

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://782612]
Approved by rovf
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-03-19 10:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found