Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Counting matches and removing duplicates

by LexPl (Sexton)
on Nov 15, 2024 at 12:30 UTC ( [id://11162712]=perlquestion: print w/replies, xml ) Need Help??

LexPl has asked for the wisdom of the Perl Monks concerning the following question:

I have written a script that extracts all XML entities from a file using a regex (&[^;]+;).

The output has a header (just plain text) and several thousand matching strings/XML entities with one occurrence per line:

List of entities in 'input.xml':
========================================================================
ü
ü
ä
Ä
ü
ä
ü
–
...

I would like to modify the output so that

  • each type of XML entity is only listed once (or in other words duplicates are excluded from the output),
  • the XML entities are sorted alphabetically on the basis of their name, e.g. auml > ouml > uuml and
  • the respective number of occurrences will be given.

So a sample output could like this:
ä 357
ö 231
...

Replies are listed 'Best First'.
Re: Counting matches and removing duplicates
by hippo (Archbishop) on Nov 15, 2024 at 14:04 UTC

      First of all, many thanks for your valuable input and kind assistance!

      I have managed to generate a statistics for the XML entities in a file in two steps:

      1. generate a list of XML entities via entityList.pl
      2. generate a statistics which describes each entity and its frequency via entityStat.pl
      Would it be possible to combine the functionality of the 2 perl scripts into one comprehensive script? If yes, how could I achieve this?

      entityList.pl

      #!/usr/bin/perl use warnings; use strict; my $infile = $ARGV[0]; print "List of entities in ", $infile, "\n"; #define regexes as search target (in the array @regexes) my $regex = qr/(&[^;]+;)/; open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; #read input file in variable $xml my $xml; { local $/ = undef; $xml = <$in>; } #define output file open my $out, '>', 'ent-list.txt' or die $!; #output list of entities print {$out} "$1\n" while $xml =~ /$regex/g; close $in; close $out;

      entityStat.pl

      #!/usr/bin/env perl use strict; use warnings; my $infile = $ARGV[0]; my $outfile = $ARGV[1]; open(IN, '<' . $infile) or die "Cannot open $infile for reading: $!"; open(OUT, '>' . $outfile) or die $!; chomp(my @matches = <IN>); my %freq; $freq{$_}++ for @matches; print OUT "Statistics of entities in '", $infile, "'\n================ +=================================\n"; for my $entity (sort keys %freq) { printf OUT "%-20s %10s \n", $entity, $freq{$entity}; }

        Yes, of course. In your combined script, instead of printing each line of matches, simply push them onto @matches. There is no need to write them out and then read them in again.


        🦛

Re: Counting matches and removing duplicates
by 1nickt (Canon) on Nov 15, 2024 at 13:04 UTC

    Hi,
    You could put the results in a hash with the entities as keys and the counts as values, incrementing as you go along.

    Hope this helps!


    The way forward always starts with a minimal test.
Re: Counting matches and removing duplicates
by bliako (Abbot) on Nov 15, 2024 at 13:38 UTC

    You generally do that by storing all your data into a hashtable keyed on the properties you want to be unique. A hashtable stores tuples of (key,value). In this particular case the value is of no concern, only the key is important. That's why below the value is always 1. And let the hashtable do the hard work knowing the principle that hashtable's keys are unique. Here is an example:

    use strict; use warnings; my %UH; $UH{'k1'} = 1; # hash contains 1 item $UH('k2'} = 1; # hash contains 2 items $UH{'k1'} = 1; # at this point the hash still contains 2 items for my $k (sort keys %UH){ print "Hash contains key '$k'\n"; }

    If your data is stored in an array then you make items unique by creating a hashtable from the array (%{ {map { $_ => 1 } @items} }) and then taking the keys of that:

    use strict; use warnings; my @items = ('&uuml;','&uuml;','&auml;', '&Auml;'); my @unique_items = sort keys %{ {map { $_ => 1 } @items} }; print "@unique_items\n";

    Note that sorting is optional of course.

    Alternatively, you can use a perl module to do what you want, see e.g. here: https://perlmaven.com/unique-values-in-an-array-in-perl

Re: Counting matches and removing duplicates
by ikegami (Patriarch) on Nov 15, 2024 at 13:09 UTC
    my %counts; while ( my $entity = ... ) { ++$counts{ $entity }; } for my $entity ( sort keys( %counts ) ) { say "$entity $counts{ $entity }"; }

      Many thanks for your input!

      I understand the two routines as such, but I don't understand how the different matching strings - all are XML entities - will be fed into the while loop.

      And to be honest; I didn't understand the suggestion about the hash solution

        It represents the existing loop you are using to discover the entities.

Re: Counting matches and removing duplicates
by harangzsolt33 (Deacon) on Nov 15, 2024 at 18:42 UTC

      You have advocated using grep in a void context. Please explain why.


      🦛

        he overlooked that the snippet was the inside of a subroutine, where the grep result would serve as the return value.
        But at least he gave a credible citation (although he could've cited should've read the PerlFAQ 4 section directly), so I ++ anyway.

      Ever heard of fc?

      Greetings,
      🐻

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

        news to me!!!!! thanks for this.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11162712]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2025-02-15 13:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found