Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Perl custom sort for Portuguese Lanaguage

by haukex (Bishop)
on Jun 12, 2020 at 08:14 UTC ( #11117959=note: print w/replies, xml ) Need Help??


in reply to Perl custom sort for Portuguese Lanaguage

use warnings; use 5.016; use utf8; use open qw/:std :utf8/; use Text::CSV qw/csv/; # also install Text::CSV_XS for speed use Unicode::Collate; my $Collator = Unicode::Collate->new( preprocess => sub { $_[0][0] =~ s/^(?:o|a)\s+//ir } ); my $rows = csv( in=>*DATA, sep=>"|", esc=>"\\", auto_diag=>2 ); my @sorted = $Collator->sort(@$rows); csv( in=>\@sorted,out=>*STDOUT, sep=>"|", esc=>"\\", quote_space=>0 ); __DATA__ a cerveja|beer o ano|year a laranja|orange beber|to drink água|water o copo de vinho|glass of wine o copo|glass or cup o sumo|juice

Output:

água|water o ano|year beber|to drink a cerveja|beer o copo|glass or cup o copo de vinho|glass of wine a laranja|orange o sumo|juice

Update: Realized I could use $Collator->sort instead of $Collator->cmp.

Replies are listed 'Best First'.
Re^2: Perl custom sort for Portuguese Lanaguage
by ikegami (Patriarch) on Jun 14, 2020 at 00:03 UTC

    See also Unicode::Collate::Locale in the same distro. Using locale => "pt" would use Portuguese-specific rules (in any exist?) rather than a generic algorithm.

Re^2: Perl custom sort for Portuguese Lanaguage
by Galdor (Sexton) on Jul 08, 2020 at 06:03 UTC
    excellent! Thank you for that! this is really useful one more thing - how can I make it ignore all lines starting with a hash - e.g. '/^#/' - is that possible with this solution? Thanks for pointers!
      how can I make it ignore all lines starting with a hash - e.g. '/^#/' - is that possible with this solution?

      This depends on what you mean by "ignore" - I have a suspicion that perhaps your input file is in sections separated by comments, and you want the sections to be sorted individually? Could you show some short but representative sample input and the expected output for that input?

        sure. No there is no need to sort sections individually - merely strip all blank lines and all lines start with a hash - here are a few small samples (each is a separate *.dict file):
        # Drink a cerveja|beer a laranja|orange a água|water beber|to drink o copo de vinho|glass of wine o copo|glass or cup o sumo|juice
        and ...
        # numbers zero|zero um|one dois|two três|three quatro|four cinco|five seis|six sete|seven oito|eight nove|nine dez|ten ## 11 - 19 onze|eleven doze|twelve treze|thirteen catorze|fourteen quinze|fifteen dezasseis|sixteen dezasssete|seventeen dezoito|eighteen dezanove|nineteen
        and even ...
        # time # DO NOT SORT! # the time o segundo|second o minuto|minute a ora|hour # the day o dia|day a noite|night a madrugada|early morning a manhã|morning a tarde|afternoon a noite|night o meio dia|midday a meia noite|midnight # days of week a semana|week o fim-de-semana (os fims-de-samana)|week-end a Segunda-feira|Monday a Terça-feira|Tuesday a Quarta-feira|Wednesaday a Quita-feira|Thursday a Sexta-feira|Friday o Sábado|Saturday o Domingo|Sunday # Months of the year o mês (os meses)|month o ano|year Janeiro|January Fevereiro|February Março|March Abril|April Maio|May
        so in the end they will all be "filtered" into one big dictionary output file - one per line in "pt dictionary" aphabetical order. They are kinda in markdown format so I could print each out individually without doing any sorting - but also enjoy benefit of having a "big database" of words to do word-tests, a personal dictionary, and cool stuff like that... If I want "sub-section" sorting as you suggest I will just break them out into separate files (I guess) ... Thanks !

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11117959]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2022-01-19 21:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (56 votes). Check out past polls.

    Notices?