Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?


by AppleFritter (Priest)
on Apr 27, 2014 at 20:42 UTC ( #1084031=user: print w/replies, xml ) Need Help??

Howdy, partner! Name's Apple Fritter, pleasure to meet y'all! I use Perl, but I don't know that much about it (yet). I'm trying to change that, so I frequent the Monastery, reading others' answers and code to learn, and providing my own answers and code to hone my skills.

If I come across useful advice, tips, modules, code snippets, articles etc., I usually add it to my home node (which you are reading right now) for future reference. Maybe you'll find it useful, too!

Not affiliated with Tom Owad's

Note: I'm not active on Perlmonks anymore. I will, however, keep on updating my home node when I come across items worth adding.

For new users:

N.B. when crossposting to several sites, it is considered polite to inform readers of this and provide links to avoid unnecessary/duplicated effort.

For established users:

The Monastery:

General infrastructure:

Perl culture:

Misc. (unordered, unsorted):

Due to the 64 KiB node size limit, this section now resides in AppleFritter's scratchpad.

Monk quotes:

Do not fear death, you will re-awaken to a world built with Perfect Perl 7 and no Python.
-- boftx, Re^3: Using die() in methods

the moment you try to separate the physical construction of code -- kloc, function points, abstracts test quantities -- from the intellectual processes of gathering requirements; understanding work-patterns and flows; and imagining suitable, appropriate, workable algorithms to meet them; you do not have sufficient understanding of the process involved in code development to be making decisions about it.
-- BrowserUk, Re: Nobody Expects the Agile Imposition (Part VII): Metrics

You were unlucky in the sense that your program seems to have remained valid Perl even with all variables removed.
-- Corion, Re: [OneLiner] What am I doing wrong in my regex?

I insist on being paid to use Windows products, sir!
-- Your Mother, Re^3: PerlWizard - A free wizard for automatic Perl software code generation using simple forms

No further rational discussion is possible here because I find your preferred style utterly abhorrent :)
-- BrowserUk, Re^3: Porting (old) code to something else

AppleFritter elsewhere:

Two monks sat together for lunch. The first monk said, "What do you see when you see me?"
The second replied, "I see a reflection of the Buddha."
The first, feeling nasty, said, "When I look at you, I see a pile of shit."
The second just smiled. The first turned angry. "Why are you smiling?"
The second replied, "What comes out of a man is a reflection of what's inside a man. I am filled with the Buddha nature, so everywhere I look, I see a reflection of the Buddha."

Posts by AppleFritter
Size-limited, fitness-based lists in Cool Uses for Perl
3 direct replies — Read more / Contribute
by AppleFritter
on Aug 08, 2015 at 19:05

    Monks and monkettes! I recently found myself wondering, what's the longest words in the dictionary (/usr/share/dict, anyway)?

    This is easily found out, but it's natural to be interested not just in the longest word but (say) the top ten. And when your dictionary contains (say) eight words of length fifteen and six words of length fourteen, it's also natural to not want to arbitrarily select two of the latter, but list them all.

    I quickly decided I needed a type of list that would have a concept of the fitness of an item (not necessarily the length of a word), and try not to exceed a maximum size if possible (while retaining some flexibility). My CPAN search-fu is non-existent, but since it sounded like fun, I just rolled my own. Here's the first stab at what is right now called List::LimitedSize::Fitness (if anyone's got a better idea for a name, please let me know):

    This features both "flexible" and "strict" policies. With the former, fitness classes are guaranteed to never lose items, but the list as a whole might grow beyond the specified maximum size. With the latter, the list is guaranteed to never grow beyond the specified maximum size, but fitness classes might lose items. (Obviously you cannot have it both ways, not in general.)

    Here's an example of the whole thing in action:

    This might output (depending on your dictionary):

    $ perl wordsEn.txt .......... length 21 antienvironmentalists antiinstitutionalists counterclassification electroencephalograms electroencephalograph electrotheraputically gastroenterologically internationalizations mechanotheraputically microminiaturizations microradiographically length 22 counterclassifications counterrevolutionaries electroencephalographs electroencephalography length 23 disestablismentarianism electroencephalographic length 25 antidisestablishmentarian length 28 antidisestablishmentarianism 19 words total (10 requested). $

    If you've got any thoughts, tips, comments, rotten tomatoes etc., send them my way! (...actually, forget about the rotten tomatoes.)

    Also, does anyone think this module would be useful to have on CPAN, in principle if not in its current state?

Resetting a flip-flop operator in Seekers of Perl Wisdom
1 direct reply — Read more / Contribute
by AppleFritter
on Aug 06, 2015 at 06:52

    Greetings, esteemed monks! Allow this humble pony to drink the sweet nectar of knowledge from the font of your collective wisdom. (Or alternatively, how 'bout some hard cider?)

    I need to read a number of files. In each file, each line holds a piece of data, or a marker indicating the beginning or end of a section; I'm interested only in data in a specific section. Normally, I'd do something like this:

    foreach my $HANDLE (@HANDLES) { while(<$HANDLE>) { chomp; next unless /^PP_START$/ .. /^PP_END$/; # process line } }

    However, it turns out that in these log files, the section end marker may be omitted if there is no following section: the end of the file itself indicates the end of the section then.

    This wreaks havoc with the above logic, as the flip-flop operator, not having seen the marker, still evaluates to true when the outer loop moves on to the next file, and wrongly causes lines before the start marker in that file to be processed.

    Of course it would be trivial to add a flag indicating whether I'm in the right section, and reset that for each file. But doing that would essentially manually emulate the flip-flop operator, which strikes me as less than elegant. So I'm wondering -- is there a way to "reset" the flip-flop operator, as it were, so that it starts returning false again at the beginning of each new file?

"Unrecognized character" while use utf8 is in effect in Seekers of Perl Wisdom
2 direct replies — Read more / Contribute
by AppleFritter
on Apr 17, 2015 at 06:03

    Oh monks most tawny and tangy, whose wisdom and knowledge of all things Perl is unalienable and indefeasible, help me out, for I'm very much missing the obvious.

    As you will well know, Perl allows Unicode characters in variable names, so long as use utf8; is in effect. So the following snippet works as expected (apologies for the unresolved HTML entities, Perlmonks itself does not handle Unicode properly):

    my $&#x4EBA; = "World"; say "Hello, $&#x4EBA;";

    However, the following does not:

    my $&#1F310; = "World"; say "Hello, $&#1F310;";

    Perl 5.20.0 complains about this, saying:

    Unrecognized character \x{1f310}; marked by <-- HERE after my $<-- + HERE near column 5 at line 9.

    This is even though the character is in Unicode 6.3.0, which Perl 5.20.0 supports.

    So why isn't it working? Help me out, fellow monks.

perl 5.21.10 released in Perl News
1 direct reply — Read more / Contribute
by AppleFritter
on Mar 20, 2015 at 17:21

    Perl 5.21.10, another development release, came out on March 20th (that's today!). Get it on CPAN or on metaCPAN while it's hot!

    And here's the perldelta as well:

    (This my first time posting a piece of Perl news. If I broke anything, e.g. a link, please /msg me and I'll fix it.)

Identifying scripts (writing systems) in Cool Uses for Perl
2 direct replies — Read more / Contribute
by AppleFritter
on Sep 16, 2014 at 17:32

    Dear monks and nuns, priests and scribes, popes and antipopes, saints and stowaways lurking in the monastery, lend me your ears. (I promise I'll return them.) I'm still hardly an experienced Perl (user|programmer|hacker), but allow me to regale you with a story of how Perl has been helping me Get Things Done™; a Cool Use for Perl, or so I think.

    I was recently faced with the problem of producing, given a number of lines each written in a specific script (i.e. writing system; Latin, Katakana, Cyrillic etc.), a breakdown of scripts used and how often they appeared. Exactly the sort of problem Perl was made for - and thanks to regular expressions and Unicode character classes, a breeze, right?

    I started by hardcoding a number of scripts to match my snippets of text against:

    my %scripts; foreach (@lines) { my $script = m/^\p{Script=Latin}*$/ ? "Latin" : m/^\p{Script=Cyrillic}*$/ ? "Cyrillic" : m/^\p{Script=Han}*$/ ? "Han" : # ... "(unknown)"; $scripts{$script}++; }

    Obviously there's a lot of repetition going on there, and though I had a list of scripts for my sample data, I wasn't sure new and uncontemplated scripts wouldn't show up in the future. So why not make a list of all possible scripts, and replace the hard-coded list with a loop?

    my %scripts; LINE: foreach my $line (@lines) { foreach my $script (@known_scripts) { next unless $line =~ m/^\p{Script=$script}*$/; $scripts{$script}++; next LINE; } $scripts{'(unknown)'}++; }

    So far, so good, but now I needed a list of the scripts that Perl knew about. Not a problem, I thought, I'll just check perluniprops; the list of properties Perl knows about was staggering, but I eventually decided that any property of the form "\p{Script: ...}" would qualify, so long as it had short forms listed (which I took as an indication that that particular property was the "canonical" form for the script in question). After some reading and typing and double-checking, I ended up with a fairly long list:

    my @known_scripts = ( "Arabic", "Armenian", "Avestan", "Balinese", "Bamum", "Batak", "Bengali", "Bopomofo", "Brahmi", "Br +aille", "Buginese", "Buhid", "Canadian_Aboriginal", "Carian", "Chakma", "Cham", "Cherokee", "Coptic", "Cuneiform", "Cypriot", "Cyrillic", # ... );

    Unfortunately, when I ran the resulting script, Perl complained:

    Can't find Unicode property definition "Script=Chakma" at (...) line ( +...)

    What had gone wrong? Versions, that's what: I'd looked at the perluniprops page on, documenting Perl 5.20.0, but this particular Perl was 5.14.2 and didn't know all the scripts that the newer version did, thanks to being built against an older Unicode version. Now, I could've just looked at the locally-installed version of the same perldoc page, but - wouldn't it be nice if the script automatically adapted itself to the Perl version it ran on? I sure reckoned it'd be.

    What scripts DID the various Perl versions recognize, anyway? What I ended up doing (perhaps there's an easier way) was to look at lib/unicore/Scripts.txt for versions 5.8, 5.10, ..., 5.20 in the Perl git repo (I skipped 5.6 and earlier, because a) the relevant file didn't exist in the tree yet back then, and b) those versions are ancient, anyway). And by "look at", I mean download (as scripts-58.txt etc.), and then process:

    $ for i in 8 10 12 14 16 18 20; do perl scripts-5$i.txt >5$ +i.lst; done $ for i in 8 10 12 14 16 18; do diff --unchanged-line-format= --new-li +ne-format=%L 5$i.lst 5$((i+2)).lst >5$((i+2)).new; done $ was a little helper script to extract script information (apologies for the confusing terminology, BTW):

    #!/usr/bin/perl use strict; use warnings; use feature qw/say/; my %scripts; while(<>) { next unless m/; ([A-Za-z_]*) #/; $scripts{$1}++; } $, = "\n"; say sort { $a cmp $b } map { $_ = ucfirst lc; $_ =~ s/(?<=_)(.)/uc $1/ +ge; qq/"$_"/ } keys %scripts;

    I admit, I got lazy at this point and manually combined those files (58.lst, as well as, etc.) into a hash holding all the information, instead of having a script output it. Nonetheless, once this was done, I could easily load all the right scripts for a given Perl version:

    # New Unicode scripts added in Perl 5.xx my %uniscripts = ( '8' => [ "Arabic", "Armenian", "Bengali", "Bopomofo", "Buhid", "Canadian_Aboriginal", "Cherokee", "Cyrillic", "Deseret", "Devanagari", "Ethiopic", "Georgian", "Gothic", "Greek", "Guja +rati", "Gurmukhi", "Han", "Hangul", "Hanunoo", "Hebrew", "Hiragana", "Inherited", "Kannada", "Katakana", "Khmer", "Lao", "Latin", "Malayalam", "Mongolian", "Myanmar", "Ogham", "Old_Italic", "O +riya", "Runic", "Sinhala", "Syriac", "Tagalog", "Tagbanwa", "Tamil", "Telugu", "Thaana", "Thai", "Tibetan", "Yi" ], '10' => [ "Balinese", "Braille", "Buginese", "Common", "Coptic", "Cuneif +orm", "Cypriot", "Glagolitic", "Kharoshthi", "Limbu", "Linear_B", "New_Tai_Lue", "Nko", "Old_Persian", "Osmanya", "Phags_Pa", "Phoenician", "Shavian", "Syloti_Nagri", "Tai_Le", "Tifinagh", "Ugaritic" ], '12' => [ "Avestan", "Bamum", "Carian", "Cham", "Egyptian_Hieroglyphs", "Imperial_Aramaic", "Inscriptional_Pahlavi", "Inscriptional_Parthian", "Javanese", "Kaithi", "Kayah_Li", "Lepcha", "Lisu", "Lycian", "Lydian", "Meetei_Mayek", "Ol_Chik +i", "Old_South_Arabian", "Old_Turkic", "Rejang", "Samaritan", "Saurashtra", "Sundanese", "Tai_Tham", "Tai_Viet", "Vai" ], '14' => [ "Batak", "Brahmi", "Mandaic" ], '16' => [ "Chakma", "Meroitic_Cursive", "Meroitic_Hieroglyphs", "Miao", "Sharada", "Sora_Sompeng", "Takri" ], '18' => [ ], '20' => [ ], ); (my $ver = $^V) =~ s/^v5\.(\d+)\.\d+$/$1/; my @known_scripts; foreach (keys %uniscripts) { next if $ver < $_; push @known_scripts, @{ $uniscripts{$_} }; } print STDERR "Running on Perl $^V, ", scalar @known_scripts, " scripts + known.\n";

    The number of scripts Perl supports this way WILL increase again soon, BTW. Perl 5.21.1 bumped the supported Unicode version to 7.0.0, adding another bunch of new scripts as a result:

    # tentative! '22' => [ "Bassa_Vah", "Caucasian_Albanian", "Duployan", "Elbasan", "Gra +ntha", "Khojki", "Khudawadi", "Linear_A", "Mahajani", "Manichaean", "Mende_Kikakui", "Modi", "Mro", "Nabataean", "Old_North_Arabia +n", "Old_Permic", "Pahawh_Hmong", "Palmyrene", "Pau_Cin_Hau", "Psalter_Pahlavi", "Siddham", "Tirhuta", "Warang_Citi" ],

    But that's still in the future. For now I just tested this on 5.14.2 and 5.20.0 (the two Perls I regularly use); it worked like a charm. All that was left to do was outputting those statistics:

    print "Found " . scalar keys(%scripts) . " scripts:\n"; print "\t$_: " , $scripts{$_}, " line(s)\n" foreach(sort { $a cmp $b } + keys %scripts);

    (You'll note that in the above two snippets, I'm using print rather than say, BTW. That's intentional: say is only available from Perl 5.10 on, and this script is supposed to be able to run on 5.8 and above.)

    Fed some sample data that I'm sure Perlmonks would mangle badly if I tried to post it, this produced the following output:

    Running on Perl v5.14.2, 95 scripts known. Found 18 scripts: Arabic: 21 line(s) Bengali: 2 line(s) Cyrillic: 12 line(s) Devanagari: 3 line(s) Georgian: 1 line(s) Greek: 1 line(s) Gujarati: 1 line(s) Gurmukhi: 1 line(s) Han: 29 line(s) Hangul: 3 line(s) Hebrew: 1 line(s) Hiragana: 1 line(s) Katakana: 1 line(s) Latin: 647 line(s) Sinhala: 1 line(s) Tamil: 4 line(s) Telugu: 1 line(s) Thai: 1 line(s)

    Problem solved! And not only that, it's futureproof now as well, adapting to additional scripts in my input data, and easily extended when new Perl versions support more scripts, while maintaining backward compatibility.

    What could still be done? Several things. First, I should perhaps find out if there's an easy way to get this information from Perl, without actually doing all the above.

    Second, while Perl 5.6 and earlier aren't supported right now, they could be. Conveniently, the 3rd edition of Programming Perl documents Perl 5.6; the \p{Script=...} syntax for character classes doesn't exist yet, I think, but one could write \p{In...} instead, e.g. \p{InArabic}, \p{InTamil} and so on. Would this be worth it? Not for me, but the possibility is there if someone else ever had the need to run this on an ancient Perl. (Even more ancient Perls may not have the required level of Unicode support for this, though I wouldn't know for sure.)

    Lastly, since the point of this whole exercise was to identify writing systems used for snippets of text, there's room for optimization. Perhaps it would be faster to precompile a regular expression for each script, especially if @lines is very large. Most of the text I'm dealing with is in the Latin script; as such, I should perhaps test for that before anything else, and generally try to prioritize so that lesser-used scripts are pushed further down the list. Since I'm already keeping a running total of how often each script has been seen, this could even be done adaptively, though whether doing so would be worth the overhead in practice is another question, one that could only be answered by measuring.

    But neither speed nor support for ancient Perls is crucial to me, so I'm done. This was a fun little problem to work on, and I hope you enjoyed reading about it.

Log In?

What's my password?
Create A New User
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2017-08-21 03:56 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (317 votes). Check out past polls.