Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^2: Entity statistics

by LexPl (Sexton)
on Nov 08, 2024 at 14:13 UTC ( [id://11162604]=note: print w/replies, xml ) Need Help??


in reply to Re: Entity statistics
in thread Entity statistics

Hi,

Thanks for your kind welcome!

Let me try to circumscribe what you told me so that I understand it correctly. I would put my regexes in an array, e.g.

my @regexes = (§\s*[0-9], Art\.\s*[0-9IVX, ...)

Is this what you meant?

Then how do I read "a data file into a scalar as a string"? Is it just my $file = 'fname.xml'?

Normally I use a file handle like this

my $infile = $ARGV[0]; open(IN, '<' . $infile) or die $!;

Which kind of loop construct do you think of?

With regard to the ISO entities, &sect; which stands for the "§" symbol is an example what I meant.

Replies are listed 'Best First'.
Re^3: Entity statistics
by hippo (Archbishop) on Nov 08, 2024 at 14:52 UTC
    my @regexes = (&sect;\s*[0-9], Art\.\s*[0-9IVX, ...)

    Like that, except that each regex needs to be contained in some way otherwise it will look like perl code. You can either enclose them in quotes or mark them as regex by using the qr// operator like this:

    my @regexes = (qr/&sect;\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, ...)
    Then how do I read "a data file into a scalar as a string"?

    Mostly as how you have said you do it normally but being sure to concatenate each line or to read them all at once. There are modules which can help with this such as Path::Tiny, File::Slurper and so on. See lots more about this in the Illumination How do I read an entire file into a string?

    my $infile = $ARGV[0]; open my $inh, '<', $infile or die "Cannot open $infile for reading: $! +"; my $xml; { local $/ = undef; $xml = <$inh>; } close $inh;
    Which kind of loop construct do you think of?

    I was thinking of a for loop, as that is the trivial way to iterate over an array unless there is a good reason to use something else (which does not appear to be the case here).

    Thanks for clarifying about the entities. Those should be fine as they are just data. You may need to escape any characters which have special meaning to the regular expression engine but otherwise they should not cause any problems. Try it and see how you get along.


    🦛

      First of all, many thanks for the helpful assistance and good advice from @choroba and @hippo!

      I have taken up your input and build the following script:

      #!/usr/bin/perl use warnings; use strict; use diagnostics; my $infile = $ARGV[0]; my @regexes = (qr/&sect;\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, qr/Artikel\s* +[0-9IVX]/, qr/Artikels\s*[0-9IVX]/, qr/Artikeln\s*[0-9IVX]/); open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; my $xml; { local $/ = undef; $xml = <$in>; } my $tally; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; ++$tally[$i] while $xml =~ /$regex/g; } for my $i (0 .. $#regexes) { print "$regexes[$i]:\t$tally[$i]\n"; } close $in;

      With use strict; I get the following error message:

      Global symbol "@tally" requires explicit package name (did you forget +to declare "my @tally"?) at monk2.pl line 24. Global symbol "@tally" requires explicit package name (did you forget +to declare "my @tally"?) at monk2.pl line 28. Execution of monk2.pl aborted due to compilation errors (#1) (F) You've said "use strict" or "use strict vars", which indicates that all variables must either be lexically scoped (using "my" or +"state"), declared beforehand using "our", or explicitly qualified to say which package the global variable is in (using "::"). Uncaught exception from user code: Global symbol "@tally" requires explicit package name (did you + forget to declare "my @tally"?) at monk2.pl line 24. Global symbol "@tally" requires explicit package name (did you + forget to declare "my @tally"?) at monk2.pl line 28. Execution of monk2.pl aborted due to compilation errors.</i>

      As the variable $tally is defined beforehand and preceded by the keyword "my", I don't understand what is wrong. How could I fix this?

      If I run the same script without use strict;, the output looks like this:

      (?^:&sect;\s*[0-9]): 3 (?^:Art\.\s*[0-9IVX]): 2 (?^:Artikel\s*[0-9IVX]): 2 (?^:Artikels\s*[0-9IVX]): 2 (?^:Artikeln\s*[0-9IVX]): 2

      How could I get rid of "(?^:" and ")"? Would it be possible to save this output to a file?

      Have a nice afternoon!

        > As the variable $tally is defined beforehand and preceded by the keyword "my", I don't understand what is wrong. How could I fix this?

        The scalar variable $tally is different to an array variable @tally. Single members of the array are called with a dollar sign followed by a square bracket, but they are still elements of the array @tally. So, you need to declare the array:

        my @tally;

        > How could I get rid of "(?^:" and ")"?

        One possibility is to use a regex:

        for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; print "$regex:\t$tally[$i]\n"; }

        > Would it be possible to save this output to a file?

        The easiest way is to use redirection in your shell, it should work even in MSWin.

        perl script.pl > output.txt

        If you want to write to a file from within Perl, open a file for writing and print to it:

        open my $out, '>', 'output.txt' or die $!; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; print {$out} "$regex:\t$tally[$i]\n"; }
        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        As the variable $tally is defined beforehand and preceded by the keyword "my", I don't understand what is wrong. How could I fix this?

        You have declared $tally which is a scalar but the errors are telling you about @tally which is an array. Since your loops refer to the array and not the scalar, that is what you need to declare instead. See the basic datatypes, three for more about the basic data types in Perl and how the sigils relate to them.

        How could I get rid of "(?^:" and ")"?

        You could process the string which you actually output to achieve this but in this particular case you can avoid that by using quotes to delimit each regex in the first place instead of using the qr// operator. You can use single quotes 'foo' or q/foo/ for non-interpolated strings. ie:

        my @regexes = (q/&sect;\s*[0-9]/, q/Art\.\s*[0-9IVX]/, q/Artikel\s*[0- +9IVX]/, q/Artikels\s*[0-9IVX]/, q/Artikeln\s*[0-9IVX]/);

        Bear in mind that these are now just simple strings so you need to take care to explicitly use them in a regex content. But as that is what the rest of your code does anyway, there is no further change required here.

        Would it be possible to save this output to a file?

        Of course. See eg. Re: How do I write to a file?

        Do have a browse through the Tutorials section here and the Getting Started with Perl section in particular. These should help you achieve some of these simple tasks while you become more familiar with the language.


        🦛

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11162604]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2025-02-15 14:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found