in reply to Re: Perl custom sort for Portuguese Lanaguage
in thread Perl custom sort for Portuguese Lanaguage

excellent! Thank you for that! this is really useful one more thing - how can I make it ignore all lines starting with a hash - e.g. '/^#/' - is that possible with this solution? Thanks for pointers!
  • Comment on Re^2: Perl custom sort for Portuguese Lanaguage

Replies are listed 'Best First'.
Re^3: Perl custom sort for Portuguese Lanaguage
by haukex (Bishop) on Jul 08, 2020 at 06:07 UTC
    how can I make it ignore all lines starting with a hash - e.g. '/^#/' - is that possible with this solution?

    This depends on what you mean by "ignore" - I have a suspicion that perhaps your input file is in sections separated by comments, and you want the sections to be sorted individually? Could you show some short but representative sample input and the expected output for that input?

      sure. No there is no need to sort sections individually - merely strip all blank lines and all lines start with a hash - here are a few small samples (each is a separate *.dict file):
      # Drink a cerveja|beer a laranja|orange a água|water beber|to drink o copo de vinho|glass of wine o copo|glass or cup o sumo|juice
      and ...
      # numbers zero|zero um|one dois|two três|three quatro|four cinco|five seis|six sete|seven oito|eight nove|nine dez|ten ## 11 - 19 onze|eleven doze|twelve treze|thirteen catorze|fourteen quinze|fifteen dezasseis|sixteen dezasssete|seventeen dezoito|eighteen dezanove|nineteen
      and even ...
      # time # DO NOT SORT! # the time o segundo|second o minuto|minute a ora|hour # the day o dia|day a noite|night a madrugada|early morning a manhã|morning a tarde|afternoon a noite|night o meio dia|midday a meia noite|midnight # days of week a semana|week o fim-de-semana (os fims-de-samana)|week-end a Segunda-feira|Monday a Terça-feira|Tuesday a Quarta-feira|Wednesaday a Quita-feira|Thursday a Sexta-feira|Friday o Sábado|Saturday o Domingo|Sunday # Months of the year o mês (os meses)|month o ano|year Janeiro|January Fevereiro|February Março|March Abril|April Maio|May
      so in the end they will all be "filtered" into one big dictionary output file - one per line in "pt dictionary" aphabetical order. They are kinda in markdown format so I could print each out individually without doing any sorting - but also enjoy benefit of having a "big database" of words to do word-tests, a personal dictionary, and cool stuff like that... If I want "sub-section" sorting as you suggest I will just break them out into separate files (I guess) ... Thanks !

        In that case it's fairly easy. I used Text::CSV to read the data file, but AFAIK it doesn't support ignoring comment lines. If you are certain your files are always going to be as simple as you showed, only two columns separated by | and no |s anywhere else, no quoted fields, etc., then it's also possible to parse the file manually with a regex, for example:

        open my $fh, '<:encoding(UTF-8)', $filename or die "$filename: $!"; my @rows = map { /^([^|]+)\|([^|]+?)$/ or die $_; [$1,$2] } grep { /\S/ && !/^\s*#/ } <$fh>; close $fh;

        And then you can use @rows instead of @$rows in my example above.

        Update: Minor simplification to code.

        Update 2: And soonix makes a good point that continuing to use Text::CSV is also most likely fine, since it's probably safe to assume that you don't have any actual data that starts with #.