Hello LexPl and welcome to the Monastery and to Perl in general.
One simple approach to your task would be to construct an array of your regex patterns, read your data file into a scalar as a string and then loop over the array and count the matches of each one in the string. To count matches in a string you can use this form:
my $count =()= $string =~ /regex/gs;
These patterns often also contain ISO entities.
It isn't entirely clear to me precisely what you mean by this. Could you elaborate? It may or may not have any bearing on the task.
(Updated: typo fix - thanks, choroba.)
| [reply] [d/l] |
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use experimental qw( signatures );
use utf8;
use XML::LibXML;
use Encode qw{ encode };
sub create_xml($xml) {
open my $out, '>:encoding(UTF-8)', $xml or die $!;
print {$out} <<~'__XML__';
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY sect "§">
]>
<root link="Art.VV">
A § 1 A
B Art.XVI B
C § 9 C
D § 7 D
E § 6 E
<!-- Should comments be included in statistics? Art.XXX -->
<?print "Should processing instructions be included?" Art.2 ?>
</root>
__XML__
}
sub validate_xml($xml) {
my $dom = 'XML::LibXML'->load_xml(location => $xml);
print $dom;
}
sub generate_statistics($xml) {
my @regexes = (qr/§\s*[0-9]/, qr/Art\.\s*[0-9IVX]/);
open my $in, '<:encoding(UTF-8)', $xml or die $!;
my $string = do { local $/; <$in> };
my @tally;
for my $i (0 .. $#regexes) {
my $regex = $regexes[$i];
++$tally[$i] while $string =~ /$regex/g;
}
for my $i (0 .. $#regexes) {
say encode('UTF-8', "$regexes[$i]:\t$tally[$i]");
}
}
my $xml = '1.xml';
create_xml($xml);
validate_xml($xml);
generate_statistics($xml);
unlink $xml;
The output:
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY sect "§">
]>
<root link="Art.VV">
A § 1 A
B Art.XVI B
C § 9 C
D § 7 D
E § 6 E
<!-- Should comments be included in statistics? Art.XXX -->
<?print "Should processing instructions be included?" Art.2 ?>
</root>
(?^u:§\s*[0-9]): 1
(?^:Art\.\s*[0-9IVX]): 4
You see? The section mark was not counted, the attribute, comment, and processing instruction were. Probably not what you want.
Update: Included §.
Update 2: Print the XML to show how some representations of the section mark are equivalent.
Update 3: Added an attribute.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Hi,
Thanks for your kind welcome!
Let me try to circumscribe what you told me so that I understand it correctly. I would put my regexes in an array, e.g.
my @regexes = (§\s*[0-9], Art\.\s*[0-9IVX, ...) Is this what you meant?
Then how do I read "a data file into a scalar as a string"? Is it just my $file = 'fname.xml'?
Normally I use a file handle like this
my $infile = $ARGV[0];
open(IN, '<' . $infile) or die $!;
Which kind of loop construct do you think of?
With regard to the ISO entities, § which stands for the "§" symbol is an example what I meant.
| [reply] [d/l] [select] |
my @regexes = (qr/§\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, ...)
Then how do I read "a data file into a scalar as a string"?
Mostly as how you have said you do it normally but being sure to concatenate each line or to read them all at once. There are modules which can help with this such as Path::Tiny, File::Slurper and so on. See lots more about this in the Illumination How do I read an entire file into a string?
my $infile = $ARGV[0];
open my $inh, '<', $infile or die "Cannot open $infile for reading: $!
+";
my $xml;
{
local $/ = undef;
$xml = <$inh>;
}
close $inh;
Which kind of loop construct do you think of?
I was thinking of a for loop, as that is the trivial way to iterate over an array unless there is a good reason to use something else (which does not appear to be the case here).
Thanks for clarifying about the entities. Those should be fine as they are just data. You may need to escape any characters which have special meaning to the regular expression engine but otherwise they should not cause any problems. Try it and see how you get along.
| [reply] [d/l] [select] |