Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^4: Perl custom sort for Portuguese Lanaguage

by Galdor (Sexton)
on Jul 08, 2020 at 06:38 UTC ( #11119024=note: print w/replies, xml ) Need Help??


in reply to Re^3: Perl custom sort for Portuguese Lanaguage
in thread Perl custom sort for Portuguese Lanaguage

sure. No there is no need to sort sections individually - merely strip all blank lines and all lines start with a hash - here are a few small samples (each is a separate *.dict file):
# Drink a cerveja|beer a laranja|orange a água|water beber|to drink o copo de vinho|glass of wine o copo|glass or cup o sumo|juice
and ...
# numbers zero|zero um|one dois|two três|three quatro|four cinco|five seis|six sete|seven oito|eight nove|nine dez|ten ## 11 - 19 onze|eleven doze|twelve treze|thirteen catorze|fourteen quinze|fifteen dezasseis|sixteen dezasssete|seventeen dezoito|eighteen dezanove|nineteen
and even ...
# time # DO NOT SORT! # the time o segundo|second o minuto|minute a ora|hour # the day o dia|day a noite|night a madrugada|early morning a manhã|morning a tarde|afternoon a noite|night o meio dia|midday a meia noite|midnight # days of week a semana|week o fim-de-semana (os fims-de-samana)|week-end a Segunda-feira|Monday a Terça-feira|Tuesday a Quarta-feira|Wednesaday a Quita-feira|Thursday a Sexta-feira|Friday o Sábado|Saturday o Domingo|Sunday # Months of the year o mês (os meses)|month o ano|year Janeiro|January Fevereiro|February Março|March Abril|April Maio|May
so in the end they will all be "filtered" into one big dictionary output file - one per line in "pt dictionary" aphabetical order. They are kinda in markdown format so I could print each out individually without doing any sorting - but also enjoy benefit of having a "big database" of words to do word-tests, a personal dictionary, and cool stuff like that... If I want "sub-section" sorting as you suggest I will just break them out into separate files (I guess) ... Thanks !

Replies are listed 'Best First'.
Re^5: Perl custom sort for Portuguese Lanaguage (updated x2)
by haukex (Bishop) on Jul 08, 2020 at 18:03 UTC

    In that case it's fairly easy. I used Text::CSV to read the data file, but AFAIK it doesn't support ignoring comment lines. If you are certain your files are always going to be as simple as you showed, only two columns separated by | and no |s anywhere else, no quoted fields, etc., then it's also possible to parse the file manually with a regex, for example:

    open my $fh, '<:encoding(UTF-8)', $filename or die "$filename: $!"; my @rows = map { /^([^|]+)\|([^|]+?)$/ or die $_; [$1,$2] } grep { /\S/ && !/^\s*#/ } <$fh>; close $fh;

    And then you can use @rows instead of @$rows in my example above.

    Update: Minor simplification to code.

    Update 2: And soonix makes a good point that continuing to use Text::CSV is also most likely fine, since it's probably safe to assume that you don't have any actual data that starts with #.

      I used Text::CSV to read the data file, but AFAIK it doesn't support ignoring comment lines.

      This works for me:

      csv (in => 'quux.csv', filter => {1 => sub { !/^#/ }});
        This works for me: csv (in => 'quux.csv', filter => {1 => sub { !/^#/ }});

        Unfortunately that also filters lines whose first field is "#foo" (with the quotes). I remember Tux recently saying filtering before parsing wasn't supported, though I'm having trouble finding the reference at the moment (it could have been in the chatterbox too*). It may be a bit tricky because this is valid CSV too:

        abc,"d #e f",ghi

        (That's one row, ["abc", "d\n#e\nf", "ghi"].)

        * Update: I looked again and I think it must have been in the chatterbox; I do distinctly remember someone having a similar question recently...

        If you only want the first lines starting with # to be filtered, that is indeeed what filter is for:

        use Data::Peek; use Text::CSV_XS qw( csv ); my $r = 0; my $aoa = csv (in => *DATA, filter => sub { $_[1][0] =~ m/^\s*#/ ? $r +: ++$r; }); DDumper $aoa; __END__ # This is comment # and so is this # and this a,b,c #but,not,this 1,2,3

        -->

        [ [ 'a', 'b', 'c' ], [ '#but', 'not', 'this' ], [ '1', '2', '3' ] ]

        Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11119024]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2022-01-27 06:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (70 votes). Check out past polls.

    Notices?