http://www.perlmonks.org?node_id=11117958

Galdor has asked for the wisdom of the Perl Monks concerning the following question:

Hello - I have started to learn Portuguese and I am keeping text dictionary files:
# Drink a cerveja|beer a laranja|orange beber|to drink o copo de vinho|glass of wine o copo|glass or cup o sumo|juice
how do I do a Perl sort to disregard the defininte articles (o, and a) for nouns and sort all words into dictionary order? Any modules useful for general languages learning and grammar? Thanks!

Replies are listed 'Best First'.
Re: Perl custom sort for Portuguese Lanaguage
by haukex (Archbishop) on Jun 12, 2020 at 08:14 UTC
    use warnings; use 5.016; use utf8; use open qw/:std :utf8/; use Text::CSV qw/csv/; # also install Text::CSV_XS for speed use Unicode::Collate; my $Collator = Unicode::Collate->new( preprocess => sub { $_[0][0] =~ s/^(?:o|a)\s+//ir } ); my $rows = csv( in=>*DATA, sep=>"|", esc=>"\\", auto_diag=>2 ); my @sorted = $Collator->sort(@$rows); csv( in=>\@sorted,out=>*STDOUT, sep=>"|", esc=>"\\", quote_space=>0 ); __DATA__ a cerveja|beer o ano|year a laranja|orange beber|to drink įgua|water o copo de vinho|glass of wine o copo|glass or cup o sumo|juice

    Output:

    įgua|water o ano|year beber|to drink a cerveja|beer o copo|glass or cup o copo de vinho|glass of wine a laranja|orange o sumo|juice

    Update: Realized I could use $Collator->sort instead of $Collator->cmp.

      See also Unicode::Collate::Locale in the same distro. Using locale => "pt" would use Portuguese-specific rules (in any exist?) rather than a generic algorithm.

      excellent! Thank you for that! this is really useful one more thing - how can I make it ignore all lines starting with a hash - e.g. '/^#/' - is that possible with this solution? Thanks for pointers!
        how can I make it ignore all lines starting with a hash - e.g. '/^#/' - is that possible with this solution?

        This depends on what you mean by "ignore" - I have a suspicion that perhaps your input file is in sections separated by comments, and you want the sections to be sorted individually? Could you show some short but representative sample input and the expected output for that input?

Re: Perl custom sort for Portuguese Lanaguage
by hippo (Bishop) on Jun 12, 2020 at 08:19 UTC

    TIMTOWTDI but here is a Schwartzian Transform approach.

    #!/usr/bin/env perl use strict; use warnings; my @in = <DATA>; my @sorted = map { "$_->[1]$_->[0]" } sort { $a->[0] cmp $b->[0] } map { s/^([oa] )//; [$_, $1 // ''] } @in; print @sorted; __DATA__ # Drink a cerveja|beer a laranja|orange beber|to drink o copo de vinho|glass of wine o copo|glass or cup o sumo|juice
    Any modules useful for general languages learning and grammar?

    Grammar is tough. Have you looked at the modules in the Lingua::PT space? Perhaps Lingua:PT::Conjugate might be one place to start? Good luck.

    Update: Just spotted the copo de vinho was out of order so here is an improved ST:

    #!/usr/bin/env perl use strict; use warnings; my @in = <DATA>; my @sorted = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { /^(?:[oa] )?([^|]+)/; [$1, $_] } @in; print @sorted; __DATA__ # Drink a cerveja|beer a laranja|orange beber|to drink o copo de vinho|glass of wine o copo|glass or cup o sumo|juice
      Thanks for Linuga::PT::Conjugate steer...I never though about those at all..Nice!!
      fails on įgua