Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Entity statistics

by LexPl (Sexton)
on Nov 08, 2024 at 13:07 UTC ( [id://11162602]=perlquestion: print w/replies, xml ) Need Help??

LexPl has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm a Perl newbie. My aim is to generate a statistics that lists the number of occurrences for each member of a large group of regex patterns in an xml file. These patterns often also contain ISO entities. I would be very obliged if you could give me your advice to start on this endeavour.

Examples of such patterns would be "§\s*[0-9])" or "Art\.\s*[0-9IVX]".

Many thanks in advance!

Replies are listed 'Best First'.
Re: Entity statistics
by hippo (Archbishop) on Nov 08, 2024 at 13:56 UTC

    Hello LexPl and welcome to the Monastery and to Perl in general.

    One simple approach to your task would be to construct an array of your regex patterns, read your data file into a scalar as a string and then loop over the array and count the matches of each one in the string. To count matches in a string you can use this form:

    my $count =()= $string =~ /regex/gs;
    These patterns often also contain ISO entities.

    It isn't entirely clear to me precisely what you mean by this. Could you elaborate? It may or may not have any bearing on the task.

    (Updated: typo fix - thanks, choroba.)


    🦛

      It might be harder than it seems.
      1. XML can contain (among other things) attributes, comments, and processing instructions. Are you sure you want to include their contents into the statistics?
      2. ISO entities are not part of the XML. There probably is some DTD that defines them, but as they are not standard (in XML), it might be hard to process them properly (and the DTD might define them in a non-standard way). Moreover, the section mark can be also included in XML as § (or &#xA7, or §), and Art can be repesented as Art, for example.

      See this example (using PRE instead of CODE to include the section mark):


      #!/usr/bin/perl
      use warnings;
      use strict;
      use feature qw{ say };
      use experimental qw( signatures );
      use utf8;
      
      use XML::LibXML;
      use Encode qw{ encode };
      
      sub create_xml($xml) {
          open my $out, '>:encoding(UTF-8)', $xml or die $!;
          print {$out} <<~'__XML__';
          <?xml version="1.0"?>
          <!DOCTYPE root [
              <!ENTITY sect "§">
          ]>
          <root link="Art.VV">
              A &sect; 1 A
              B Art.XVI B
              C §  9 C
              D &#xa7; 7 D
              E &#167; 6 E
              <!-- Should comments be included in statistics? Art.XXX -->
              <?print "Should processing instructions be included?" Art.2 ?>
          </root>
          __XML__
      }
      
      sub validate_xml($xml) {
          my $dom = 'XML::LibXML'->load_xml(location => $xml);
          print $dom;
      }
      
      sub generate_statistics($xml) {
          my @regexes = (qr/§\s*[0-9]/, qr/Art\.\s*[0-9IVX]/);
      
          open my $in, '<:encoding(UTF-8)', $xml or die $!;
          my $string = do { local $/; <$in> };
          my @tally;
          for my $i (0 .. $#regexes) {
              my $regex = $regexes[$i];
              ++$tally[$i] while $string =~ /$regex/g;
          }
          for my $i (0 .. $#regexes) {
              say encode('UTF-8', "$regexes[$i]:\t$tally[$i]");
          }
      }
      
      my $xml = '1.xml';
      create_xml($xml);
      validate_xml($xml);
      generate_statistics($xml);
      unlink $xml;
      

      The output:
      <?xml version="1.0"?>
      <!DOCTYPE root [
      <!ENTITY sect "§">
      ]>
      <root link="Art.VV">
          A &#xA7; 1 A
          B Art.XVI B
          C &#xA7;  9 C
          D &#xA7; 7 D
          E &#xA7; 6 E
          <!-- Should comments be included in statistics? Art.XXX -->
          <?print "Should processing instructions be included?" Art.2 ?>
      </root>
      (?^u:§\s*[0-9]):	1
      (?^:Art\.\s*[0-9IVX]):	4
      

      You see? The section mark was not counted, the attribute, comment, and processing instruction were. Probably not what you want.

      Update: Included &#xa7;.

      Update 2: Print the XML to show how some representations of the section mark are equivalent.

      Update 3: Added an attribute.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Hi,

      Thanks for your kind welcome!

      Let me try to circumscribe what you told me so that I understand it correctly. I would put my regexes in an array, e.g.

      my @regexes = (&sect;\s*[0-9], Art\.\s*[0-9IVX, ...)

      Is this what you meant?

      Then how do I read "a data file into a scalar as a string"? Is it just my $file = 'fname.xml'?

      Normally I use a file handle like this

      my $infile = $ARGV[0]; open(IN, '<' . $infile) or die $!;

      Which kind of loop construct do you think of?

      With regard to the ISO entities, &sect; which stands for the "§" symbol is an example what I meant.

        my @regexes = (&sect;\s*[0-9], Art\.\s*[0-9IVX, ...)

        Like that, except that each regex needs to be contained in some way otherwise it will look like perl code. You can either enclose them in quotes or mark them as regex by using the qr// operator like this:

        my @regexes = (qr/&sect;\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, ...)
        Then how do I read "a data file into a scalar as a string"?

        Mostly as how you have said you do it normally but being sure to concatenate each line or to read them all at once. There are modules which can help with this such as Path::Tiny, File::Slurper and so on. See lots more about this in the Illumination How do I read an entire file into a string?

        my $infile = $ARGV[0]; open my $inh, '<', $infile or die "Cannot open $infile for reading: $! +"; my $xml; { local $/ = undef; $xml = <$inh>; } close $inh;
        Which kind of loop construct do you think of?

        I was thinking of a for loop, as that is the trivial way to iterate over an array unless there is a good reason to use something else (which does not appear to be the case here).

        Thanks for clarifying about the entities. Those should be fine as they are just data. You may need to escape any characters which have special meaning to the regular expression engine but otherwise they should not cause any problems. Try it and see how you get along.


        🦛

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11162602]
Approved by Corion
Front-paged by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2025-02-17 23:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found