G'day cinnamond,
Welcome to the Monastery.
[Aside:
I see that it's already been pointed out that you've missed providing us with some details;
the guidelines in "How do I post a question effectively?" have some more information about that.
Otherwise, a well-presented first post: thankyou.
Also, although it's not often directly related to the problem at hand,
telling us your O/S and Perl version can result in a better answer from us
(e.g. we might suggest a better, more recent Perl feature if we know you have a version that supports it).]
Your code for getting the input file is fine. You might consider adding a second question to get an output filename.
I put together what I considered a fairly challenging input file (lots of edge/corner cases) and hard-coded the filename.
$ cat test_input.txt
Hello, world!
I said, "Hello, world!".
Did he say "Hello, world!"?
We're not sure.
1tab: 2tabs: 3tabs: END-TABS
# multi-spacing here - blank line next
The cat sat on the mat.
Old pronouns: thou; thee; thy; thine.
New pronouns: "u", 'ur'.
Forecastle = "fo'c'sle" or "fo'c's'le"
Forecastle = 'fo'c'sle' or 'fo'c's'le'
Don't hide the Very pistol; it could be very important.
Why exclude different but include same?
I hadn't used Lingua::StopWords previously so I read the documentation.
To be honest, I found it lacking in a number of respects: you can't add new stopwords;
you can't remove existing stopwords that you don't want;
you either specify a UTF-8 encoding or take whatever they give you.
Take a look at the various language plugins in the Lingua-StopWords distribution
if you haven't already done so.
I added a _mod_stops() routine (in the code below) to address some of those issues;
you can modify/extend that if you have other requirements.
Working with CSV files has many gotchas: how to handle a field containing the separator character;
how to quote a field containing a quote character; and so on.
Writing your own code for this, unless as an academic exercise, is ill-advised.
The Text::CSV module is robust, thoroughly tested, and addresses these issues:
I strongly recommend you use it.
It runs faster if you also have Text::CSV_XS installed, but that's optional.
If you make a mistake like trying to use a string, instead of a single character, as a separator
(as you did in your posted code) it will tell you about it.
In the script below, I've included code to use Text::CSV: as you can see, it's very straightforward.
Your example code shows using 'en' (English); I don't know if you have requirements for other languages.
I hard-coded $lang but created a lookup table for language regexes, %word_re_for.
That shows you some options; adapt according to your needs.
I split the I/O parts of the code into two anonymous blocks.
This means that filehandles are only open for the time they're needed.
Perl automatically closes them at the end of those blocks: no need for close() statements.
Perl also does the I/O exception handling for you via the autodie pragma:
no need for '... or die "Can't whatever: $!";' all over the place.
I'll also just mention that fc() is preferred over uc() and lc()
when canonicalising strings for comparison.
It requires Perl v5.16 — not knowing your Perl version, I didn't use it.
(Refer back to the "Aside" at the top.)
Here's the code.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Lingua::StopWords 'getStopWords';
use Text::CSV;
my ($lang, $encoding) = qw{en UTF-8};
my %word_re_for = (
en => qr{^.*?\b([\p{Alnum}']*[\p{Alnum}]+).*$},
);
my ($in_file, $out_file) = qw{test_input.txt test_output.csv};
my $is_stop = _mod_stops(getStopWords($lang, $encoding));
my %count_for;
{
open my $fh, '<:encoding(UTF-8)', $in_file;
while (<$fh>) {
TOKEN: for my $token (split) {
next TOKEN unless $token =~ $word_re_for{$lang};
my $word = lc $1;
next TOKEN if $is_stop->{$word};
++$count_for{$word};
}
}
}
{
open my $fh, '>:encoding(UTF-8)', $out_file;
my $csv = Text::CSV::->new({sep_char => "\t", binary => 1});
$csv->say($fh, [$_, $count_for{$_}]) for sort keys %count_for;
}
sub _mod_stops {
my ($stops) = @_;
my @adds = qw{thou thee thy thine u ur};
my @dels = qw{very same};
$stops->{$_} = 1 for @adds;
delete @$stops{@dels};
return $stops;
}
Here's the output.
$ cat test_output.csv
1tab 1
2tabs 1
3tabs 1
blank 1
cat 1
different 1
end 1
exclude 1
fo'c's'le 2
fo'c'sle 2
forecastle 2
hello 3
hide 1
important 1
include 1
line 1
mat 1
multi 1
new 1
next 1
old 1
pistol 1
pronouns 2
said 1
same 1
sat 1
say 1
sure 1
very 2
world 3
|