|
User since: |
Apr 27, 2014 at 20:42 UTC
(10 years ago) |
Last here: |
Aug 10, 2021 at 08:25 UTC
(3 years ago) |
Experience: |
5780
|
Level: | Vicar (15) |
Writeups: |
533
|
Location: | Equestria | Species: | Pony | Motto: | How do you like them apples? | Curacy: | 671st Monk in the Book, 2014-09-04 | Unitarian Jihad name: | Mother Superior Mutual Assured Destruction of Appreciative Joy | Hymn: | Pie Iesu domine, dona eis requiem *bonk* |
User's localtime: |
Sep 19, 2024 at 12:29 CEST
|
Scratchpad: |
View
|
For this user: | Search nodes |
|
Howdy, partner! Name's Apple Fritter, pleasure to meet y'all! I use Perl, but I don't know that much about it (yet). I'm trying to change that, so I frequent the Monastery, reading others' answers and code to learn, and providing my own answers and code to hone my skills.
If I come across useful advice, tips, modules, code snippets, articles etc., I usually add it to my home node (which you are reading right now) for future reference. Maybe you'll find it useful, too!
Not affiliated with Tom Owad's applefritter.com.
Note: I'm not active on Perlmonks anymore. I may still update my home node when I come across items worth adding.
Vere papa mortuus est!
For new users:
Introductions to the Monastery:
- PerlMonks for the Absolute Beginner
- The Perl Monks Guide to the Monastery
- PerlMonks FAQ
- New Monks
- Spirit of the Monastery:
☞ |
Spirit of the Monastery:
Perlmonks
relies upon a spirit of fellowship which places the responsibility for
brotherly conduct on the individual Monk.
Highly opinionated and informed contributions are encouraged, but never at the expense of
mutual respect among Monks, adherence to the agreed upon rules of the
Monastery, or the basic joy of perl.
Protection of these vital elements serves to eliminate adverse conduct
from the Monastery.
Such actions as taunting of other Monks, aggression in posts, belligerent
intimidation, intentional disrespect, or other self-aggrandizing
inconsiderate behavior are contrary to the Spirit of the Monastery
and must be avoided by all Monks.
|
|
☜ |
On civility/kindness:
- Larry Wall, keynote address at YAPC::Europe 2015: Get Ready to Party:
You don't always have to agree with your companions on the road, but it certainly helps to be friendly if you disagree.
- Don't bite the newbies
- perlpolicy:
Always be civil. [...] Civility is simple: stick to the facts while avoiding demeaning remarks and sarcasm. It is not enough to be factual. You must also be civil. Responding in kind to incivility is not acceptable.
While civility is required, kindness is encouraged; if you have any doubt about whether you are being civil, simply ask yourself, "Am I being kind?" and aspire to that.
- "The first rule of ethics is "don't be a dick", from which all other rules logically follow."
- Re^7: RAM: It isn't free . . . (the aggrieved troll their troll)
- Re^4: When do we change our replies? (approving):
My main advice to everybody related to this is for one to only respond to questions where one has something helpful to offer in response and for which one is particularly suited to answer. [...]
If a question annoys you, then minimize your annoyance by immediately moving on to something more enjoyable for you. Please try to refrain from sharing your annoyance so that we all get to suffer from it. Most of you are probably even clever enough to figure out a lot of the questions that are likely to end up annoying you so you can avoid even clicking through to them in the first place.
If a question annoys everybody, then everybody will ignore it. The history of the internet says that's one of the best ways to end something. If the question doesn't annoy everybody, then we have a case of somebody asking a question and some others willingly answering the question via a web site. That sounds a lot like "success".
-
Further reading: Useful homenodes
Introductions to Perl and resources for learning Perl:
Introductions, first steps and general information:
Tutorials:
Best practices and other information:
FAQs:
Books:
For non-IT folks, e.g. biologists:
There's also many books dedicated to specific topics such as Perl/Tk, DBI, Perl and ☞XML, ☞CGI programming with Perl, and much much more; see Perl Reference Materials: Books for an (outdated) list.
Other lists and resources:
Reviews, opinions etc.:
Asking questions (on Perlmonks and elsewhere):
How to ask questions (based on ww, Re: Replace key pair value from one to other file):
- Above all, welcome to the Monastery!
- Read the instructions ("Asking questions effectively", "Formatting your write-up", below).
- Read the documentation.
- Show effort. Write some code; at the very least, try. Help is free; doing your job for you is not.
- Describe what you want to accomplish. Be precise.
- Show us your code.
- Describe failures, expected results, and actual results.
- If applicable, show us verbatim (!) error messages/warnings.
- If applicable, give us some sample data.
- Give us the larger picture: tell us what you want to achieve, not just how you decided to go about achieving it. There may be better ways of doing it that you haven't contemplated.
- Remember, we're here to help, but we need your to help you.
Asking questions effectively:
Formatting your write-up:
Other places to get help:
Other places learn about Perl:
N.B. when crossposting to several sites, it is considered polite to inform readers of this and provide links to avoid unnecessary/duplicated effort.
For established users:
Combinatorics:
Daemons:
Databases:
Also see ☞Unicode flags for database drivers further down.
Data munging:
Data structures:
Date/time manipulation:
Articles:
Parsing:
Time zone conversion:
Debugging:
Design patterns:
Distros, packages etc. (e.g. for Windows users):
Email:
Errors / Warnings:
- Making die print stack traces:
eval / Exceptions:
External commands:
File input/output:
Input:
Output:
- CPAN: IO::Tee - write to many files/handles at once.
File names:
- CORE: File::Basename - split filenames into path and (actual) filename.
Graphs (the mathematical kind):
Graphs, charts and plots:
- There is no great plotting module for Perl. You may want to consider shelling out to Python and using matplotlib.
Modules:
- CPAN: GD::Graph - decent for bar/line charts, pie charts suck, PNG/GIF output
- CPAN: SVG::TT::Graph - decent pie charts, limited bar charts, SVG output
HTML:
Parsing:
General tips:
- Don't use regular expressions. You will get it wrong; use a HTML parsing module.
- You may be able to use ☞XML parsing if you're dealing with XHTML.
Articles:
Modules:
List processing:
- CORE: List::Util - reduce, any/all, first, sum/product, min/max, pairgrep, pairmap etc.
- CPAN: List::MoreUtils - uniq, zip, etc.
- CPAN: List::AllUtils (the previous two in one convenient module)
- Missing from List::Util / List::MoreUtils / List::AllUtils: pairwise_distinct (workaround: uniq(@list) == @list).
- CPAN: List::Compare - union, intersection, differences, symmetric difference etc.
Logic:
Logging:
Math:
Basic arithmetic:
Large numbers:
Marshalling/serialization:
MediaWiki:
OOP (object-oriented programming):
Operators:
- <> is shorthand for <ARGV>, which is just as magic. Corollary: *ARGV is magic as well.
- .. and ... (range/flip flop):
- Perl's secret operators: perlsecret
- More secret operators: new "!"-based secret operators
Optimization:
- Premature optimization is the root of all evil.
- Athanasius, Recamán's sequence and memory usage:
[O]ptimising an algorithm may actually consist in optimising its underlying data structures. Obvious? Yes, but still worth a reminder now and then.
- raven667, Re: Firefox 50.0 (lwn.net):
Efficiency gains should be targeted based on real world profiling and not based on review of code that "looks slow" as you will waste a ton of time lost in the details, chasing down non-existent performance problems, sometimes making things worse if you fight the compiler, and missing the big issues which are usually more fundamental to the design and data structure usage or locking in the hottest paths of the application.
- CPAN: Devel::NYTProf - powerful, fast, feature-rich Perl source code profiler
- Caching:
- The two hardest problems in computer science are cache invalidation and cache invalidation.
- CPAN: Memoize - transparently cache function results
Option processing:
References:
Regular expressions, parsing and grammars:
Security:
Signals:
Sorting:
Statistics (the mathematical kind):
Temporary files:
Text input/output:
Threads:
UIs:
- CPAN: AnyEvent (generic event loop)
- CPAN: Curses (note that there's a reason it's called that)
Unicode/UTF8:
HOWTOs, BCPs, tips and tricks:
- Keep in mind the difference between bytes, codepoints, and characters ("extended grapheme clusters"). Variable-length encodings (UTF8) complicate things. So do combining diacritics.
- Make STDOUT use UTF-8: binmode STDOUT, ':utf8'; (from perldiag).
- "Magic incantation" for defaulting to UTF8 when opening files, and also for STD*: use open IO => ':utf8', ':std';. Actually, :encoding(UTF-8) may be better than :utf8, see Re: A UTF8 round trip with MySQL.
- hippo, in Re: Matching/replacing a unicode character only works after decode():
The correct order of operations for working with encoded data (whether utf8 or any other encoding) is:
- Input
- Decode
- Operate
- Encode
- Output
If you don't decode your input you'll be comparing apples and elephants which is why your regex fails to match. However, if you do no operations on the data at all, then you can skip the middle three steps because your perl script in that case is just essentially a pipe between your input (eg. database) and your output (eg. web page).
- UTF-8 text files with Byte Order Mark
- Check whether Perl thinks your data is UTF8: $flag = utf8::is_utf8($string);
- Unicode flags for database drivers:
- MySQL and UTF-8:
[...] MySQL offers a "charset" named UTF8. Guess what, it's not UTF8. It's actually a synonym for UTF8MB3, which is MySQL's bizarre internal "UTF8 except we only allow 3 bytes per character" rule. If you actually need UTF8 you must upgrade to a very new version and explicitly ask MySQL for "UTF8MB4".
Anybody who has used MySQL before can guess what happens if you try to insert actual Unicode data (say, an HTML-ised comment your PHP blogging framework wants to store) into one of these UTF8 columns. Afraid to incur your wrath with an error you probably haven't handled correctly, MySQL will quietly truncate the string, removing everything from the offending codepoint onwards. [...]
- Catching "Unicode non-character" warnings - with a good and exhaustive reply by Tom Christiansen (might be outdated)
- perlunicook -- cookbookish examples of handling Unicode in Perl (also here, by Tom Christiansen)
- JSON::XS has "a few notes on Unicode and Perl":
Since this often leads to confusion, here are a few very clear words on how Unicode works in Perl, modulo bugs.
-
Perl strings can store characters with ordinal values > 255.
This enables you to store Unicode characters as single characters in a Perl string - very natural.
-
Perl does not associate an encoding with your strings.
... until you force it to, e.g. when matching it against a regex, or printing the scalar to a file, in which case Perl either interprets your string as locale-encoded text, octets/binary, or as Unicode, depending on various settings. In no case is an encoding stored together with your data, it is use that decides encoding, not any magical meta data.
-
The internal utf-8 flag has no meaning with regards to the encoding of your string.
Just ignore that flag unless you debug a Perl bug, a module written in XS or want to dive into the internals of perl. Otherwise it will only confuse you, as, despite the name, it says nothing about how your string is encoded. You can have Unicode strings with that flag set, with that flag clear, and you can have binary data with that flag set and that flag clear. Other possibilities exist, too.
If you didn't know about that flag, just the better, pretend it doesn't exist.
-
A "Unicode String" is simply a string where each character can be validly interpreted as a Unicode code point.
If you have UTF-8 encoded data, it is no longer a Unicode string, but a Unicode string encoded in UTF-8, giving you a binary string.
A string containing "high" (> 255) character values is not a UTF-8 string.
It's a fact. Learn to live with it.
- Markus Kuhn: UTF-8 and Unicode FAQ for Unix/Linux
Win32-specific issues:
Useful CPAN modules:
Fonts:
Scripts and tools:
Talks, articles, references, presentations and meditations:
Interesting questions and discussions:
Using utf8 in your script proper:
- use utf8;
- Identifiers can contain Unicode, but not arbitrary Unicode characters. See perldata:
If working under the effect of the use utf8; pragma, the following rules apply:
/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
(?[ ( \p{Word} & \p{XID_Continue} ) ]) * /x
That is, a "start" character followed by any number of "continue" characters. Perl requires every character in an identifier to also match \w (this prevents some problematic cases); and Perl additionally accepts identfier names beginning with an underscore.
Variables:
WWW:
XML:
Misc.:
The Monastery:
General infrastructure:
Modules / *PAN:
News etc.:
Perl culture:
Misc. (unordered, unsorted):
Due to the 64 KiB node size limit, this section now resides in AppleFritter's scratchpad.
Monk quotes:
Do not fear death, you will re-awaken to a world built with Perfect Perl 7 and no Python.
-- boftx, Re^3: Using die() in methods
the moment you try to separate the physical construction of code -- kloc, function points, abstracts test quantities -- from the intellectual processes of gathering requirements; understanding work-patterns and flows; and imagining suitable, appropriate, workable algorithms to meet them; you do not have sufficient understanding of the process involved in code development to be making decisions about it.
-- BrowserUk, Re: Nobody Expects the Agile Imposition (Part VII): Metrics
You were unlucky in the sense that your program seems to have remained valid Perl even with all variables removed.
-- Corion, Re: [OneLiner] What am I doing wrong in my regex?
I insist on being paid to use Windows products, sir!
-- Your Mother, Re^3: PerlWizard - A free wizard for automatic Perl software code generation using simple forms
No further rational discussion is possible here because I find your preferred style utterly abhorrent :)
-- BrowserUk, Re^3: Porting (old) code to something else
AppleFritter elsewhere:
Two monks sat together for lunch. The first monk said, "What do you see when you see me?"
The second replied, "I see a reflection of the Buddha."
The first, feeling nasty, said, "When I look at you, I see a pile of shit."
The second just smiled. The first turned angry. "Why are you smiling?"
The second replied, "What comes out of a man is a reflection of what's inside a man. I am filled with the Buddha nature, so everywhere I look, I see a reflection of the Buddha."
|
Posts by AppleFritter
|
Safely capturing the output of an external program
in Seekers of Perl Wisdom
4 direct replies — Read more / Contribute
|
by AppleFritter
on Mar 08, 2020 at 19:52
|
|
Esteemed monks,
I'm sure this has been asked (and answered before), but I can't seem to find said question. I'd like to call from within Perl an external program, passing it some arguments, and capture its output. Usually I'd reach for backticks or the qx// operator, but the arguments that need to be passed come from user-supplied data, and while the program being called itself should be safe to invoke, there's the issue of the shell and its shenanigans.
To give a bit more context, I'm working with a TeX installation and need to call kpsewhich (a wrapper around the kpathsea library, which will help you locate various files that TeX will make use of). So I'd want to get the output of, say, kpsewhich cmr10.tfm; but the name of the file I'm looking up comes from a user-supplied file I have no control over, and I'd rather not feed kpsewhich cmr10.tfm ; evil_things_go_here to the shell. (You get the idea.)
As far as I'm aware system and exec have "safe" invocations that will avoid the shell (even on braindead OSes, like Windows). Does qx//? Or for that matter, is there another (different, possibly better) way to locate TeX's files? A Perl wrapper for the kpathsea library, perhaps? (This manpage hints that such a thing exists, but it's not on CPAN AFAICT.)
Thanks.
|
Accessing SQLite databases within ZIP files
in Seekers of Perl Wisdom
7 direct replies — Read more / Contribute
|
by AppleFritter
on Oct 01, 2017 at 07:22
|
|
Dearest life forms lurking in the Monastery!
I'm trying to process resource files produced by a third-party application. These resource files are actually ZIP files containing, among other things:
- an SQLite database;
- a bunch of binary blobs (stored as file entries in the ZIP archive, rather than as BLOBs in the SQLite DB); and
- a JSON file mapping resource identifiers used in the DB to the binaries' filenames.
I'd like to access all this data. I'd also like to do this in the easiest, DWIMiest, most natural manner possible.
The most straightforward way is of course to extract the ZIP file, and then use DBI, JSON::XS and whatever modules are appropriate to handle the binaries (images, sounds, videos etc). But I'd like to avoid this, if possible; I want to be able to point my script at the ZIP file without having to worry about disk space, clean-up, and all that.
There's a variety of modules on CPAN for transparently handling ZIP archives (in fact, IO::Uncompress::Unzip is in core). What I have not found is a way of accessing a database without extracting it to disk first. More precisely, what I'd like to do is either:
- have DBD::SQLite read the DB directly from the ZIP file, using some kind of transparent intermediary layer; or
- extract the DB into memory (i.e. a Perl scalar), and then have DBD::SQLite read that.
I only need to read the DB, BTW, not modify it, so any complications to do with putting modifications back into the ZIP can safely be ignored.
So, my question is: is this possible, using only existing CPAN modules? A cursory search didn't reveal anything useful.
|
Faster alternative to Math::Combinatorics
in Seekers of Perl Wisdom
6 direct replies — Read more / Contribute
|
by AppleFritter
on Sep 01, 2017 at 09:20
|
|
Oh monks of the round table, who dance whene'er they're able, who dine well here in Camelot and eat ham and jam and spam a lot!
Can someone recommend a faster alternative to Math::Combinatorics, or maybe suggest a better way of doing the following?
I'm trying to generate all multisets (bags) of a specific total "weight" (let's call it w), where each element comes from a given list (of numbers, in this case), and each list element may have multiplicity 0..w in each multiset. In other words, what I'm trying to generate is a list of w-tuples of elements of the given list — but unordered tuples rather than ordered ones.
An example may be instructive. Let's say w is 4, and the list is (0, 2, 3). Then I'd like to get the following multisets:
0,0,0,0
0,0,0,2
0,0,0,3
0,0,2,2
0,0,2,3
0,0,3,3
0,2,2,2
0,2,2,3
0,2,3,3
0,3,3,3
2,2,2,2
2,2,2,3
2,2,3,3
2,3,3,3
3,3,3,3
(The order in which the multisets itself are generated isn't important to me either, BTW. I've only listed them in order for the sake of readability.)
Not wanting to implement this myself, I turned to CPAN and found Math::Combinatorics. This works, but it's fairly slow. Here's a (slightly simplified) excerpt from my code:
#!/usr/bin/perl
use Modern::Perl '2015';
use Math::Combinatorics;
my $states = 4;
foreach my $count (1, 2, 3, 4, 7, 8) {
say "count=$count";
my $iter = Math::Combinatorics->new(
count => $count,
data => [ grep { $_ != 1 } (0 .. ($states - 1)) ],
frequency => [($count) x ($states - 1)]
);
while(my @states = $iter->next_multiset) {
say join(",", @states);
}
}
This produces the desired output, but it takes almost 90 seconds to run for $states = 4, and much longer for 5 and up:
90 seconds wouldn't be so bad, since this is part of a larger script to generate datafiles that only really needs to be run once (to generate the file). But I'd rather not spend days waiting for it to finish for higher values of $states.
Any suggestions? Like I said, I'd prefer to stick to CPAN, but I'll take what I can get.
Thanks!
|
Size-limited, fitness-based lists
in Cool Uses for Perl
3 direct replies — Read more / Contribute
|
by AppleFritter
on Aug 08, 2015 at 19:05
|
|
Monks and monkettes! I recently found myself wondering, what's the longest words in the dictionary (/usr/share/dict, anyway)?
This is easily found out, but it's natural to be interested not just in the longest word but (say) the top ten. And when your dictionary contains (say) eight words of length fifteen and six words of length fourteen, it's also natural to not want to arbitrarily select two of the latter, but list them all.
I quickly decided I needed a type of list that would have a concept of the fitness of an item (not necessarily the length of a word), and try not to exceed a maximum size if possible (while retaining some flexibility). My CPAN search-fu is non-existent, but since it sounded like fun, I just rolled my own. Here's the first stab at what is right now called List::LimitedSize::Fitness (if anyone's got a better idea for a name, please let me know):
This features both "flexible" and "strict" policies. With the former, fitness classes are guaranteed to never lose items, but the list as a whole might grow beyond the specified maximum size. With the latter, the list is guaranteed to never grow beyond the specified maximum size, but fitness classes might lose items. (Obviously you cannot have it both ways, not in general.)
Here's an example of the whole thing in action:
This might output (depending on your dictionary):
$ perl longestwords.pl wordsEn.txt
..........
length 21
antienvironmentalists
antiinstitutionalists
counterclassification
electroencephalograms
electroencephalograph
electrotheraputically
gastroenterologically
internationalizations
mechanotheraputically
microminiaturizations
microradiographically
length 22
counterclassifications
counterrevolutionaries
electroencephalographs
electroencephalography
length 23
disestablismentarianism
electroencephalographic
length 25
antidisestablishmentarian
length 28
antidisestablishmentarianism
19 words total (10 requested).
$
If you've got any thoughts, tips, comments, rotten tomatoes etc., send them my way! (...actually, forget about the rotten tomatoes.)
Also, does anyone think this module would be useful to have on CPAN, in principle if not in its current state?
|
Resetting a flip-flop operator
in Seekers of Perl Wisdom
1 direct reply — Read more / Contribute
|
by AppleFritter
on Aug 06, 2015 at 06:52
|
|
Greetings, esteemed monks! Allow this humble pony to drink the sweet nectar of knowledge from the font of your collective wisdom. (Or alternatively, how 'bout some hard cider?)
I need to read a number of files. In each file, each line holds a piece of data, or a marker indicating the beginning or end of a section; I'm interested only in data in a specific section. Normally, I'd do something like this:
foreach my $HANDLE (@HANDLES) {
while(<$HANDLE>) {
chomp;
next unless /^PP_START$/ .. /^PP_END$/;
# process line
}
}
However, it turns out that in these log files, the section end marker may be omitted if there is no following section: the end of the file itself indicates the end of the section then.
This wreaks havoc with the above logic, as the flip-flop operator, not having seen the marker, still evaluates to true when the outer loop moves on to the next file, and wrongly causes lines before the start marker in that file to be processed.
Of course it would be trivial to add a flag indicating whether I'm in the right section, and reset that for each file. But doing that would essentially manually emulate the flip-flop operator, which strikes me as less than elegant. So I'm wondering -- is there a way to "reset" the flip-flop operator, as it were, so that it starts returning false again at the beginning of each new file?
|
"Unrecognized character" while use utf8 is in effect
in Seekers of Perl Wisdom
2 direct replies — Read more / Contribute
|
by AppleFritter
on Apr 17, 2015 at 06:03
|
|
Oh monks most tawny and tangy, whose wisdom and knowledge of all things Perl is unalienable and indefeasible, help me out, for I'm very much missing the obvious.
As you will well know, Perl allows Unicode characters in variable names, so long as use utf8; is in effect. So the following snippet works as expected (apologies for the unresolved HTML entities, Perlmonks itself does not handle Unicode properly):
my $人 = "World";
say "Hello, $人";
However, the following does not:
my $F310; = "World";
say "Hello, $F310;";
Perl 5.20.0 complains about this, saying:
Unrecognized character \x{1f310}; marked by <-- HERE after my $<--
+ HERE near column 5 at 1123740.pl line 9.
This is even though the character is in Unicode 6.3.0, which Perl 5.20.0 supports.
So why isn't it working? Help me out, fellow monks.
|
perl 5.21.10 released
in Perl News
1 direct reply — Read more / Contribute
|
by AppleFritter
on Mar 20, 2015 at 17:21
|
|
Perl 5.21.10, another development release, came out on March 20th (that's today!). Get it on CPAN or on metaCPAN while it's hot!
And here's the perldelta as well:
(This my first time posting a piece of Perl news. If I broke anything, e.g. a link, please /msg me and I'll fix it.)
|
Identifying scripts (writing systems)
in Cool Uses for Perl
2 direct replies — Read more / Contribute
|
by AppleFritter
on Sep 16, 2014 at 17:32
|
|
Dear monks and nuns, priests and scribes, popes and antipopes, saints and stowaways lurking in the monastery, lend me your ears. (I promise I'll return them.) I'm still hardly an experienced Perl (user|programmer|hacker), but allow me to regale you with a story of how Perl has been helping me Get Things Done™; a Cool Use for Perl, or so I think.
I was recently faced with the problem of producing, given a number of lines each written in a specific script (i.e. writing system; Latin, Katakana, Cyrillic etc.), a breakdown of scripts used and how often they appeared. Exactly the sort of problem Perl was made for - and thanks to regular expressions and Unicode character classes, a breeze, right?
I started by hardcoding a number of scripts to match my snippets of text against:
my %scripts;
foreach (@lines) {
my $script =
m/^\p{Script=Latin}*$/ ? "Latin" :
m/^\p{Script=Cyrillic}*$/ ? "Cyrillic" :
m/^\p{Script=Han}*$/ ? "Han" :
# ...
"(unknown)";
$scripts{$script}++;
}
Obviously there's a lot of repetition going on there, and though I had a list of scripts for my sample data, I wasn't sure new and uncontemplated scripts wouldn't show up in the future. So why not make a list of all possible scripts, and replace the hard-coded list with a loop?
my %scripts;
LINE: foreach my $line (@lines) {
foreach my $script (@known_scripts) {
next unless $line =~ m/^\p{Script=$script}*$/;
$scripts{$script}++;
next LINE;
}
$scripts{'(unknown)'}++;
}
So far, so good, but now I needed a list of the scripts that Perl knew about. Not a problem, I thought, I'll just check perluniprops; the list of properties Perl knows about was staggering, but I eventually decided that any property of the form "\p{Script: ...}" would qualify, so long as it had short forms listed (which I took as an indication that that particular property was the "canonical" form for the script in question). After some reading and typing and double-checking, I ended up with a fairly long list:
my @known_scripts = (
"Arabic", "Armenian", "Avestan",
"Balinese", "Bamum", "Batak", "Bengali", "Bopomofo", "Brahmi", "Br
+aille",
"Buginese", "Buhid",
"Canadian_Aboriginal", "Carian", "Chakma", "Cham", "Cherokee",
"Coptic", "Cuneiform", "Cypriot", "Cyrillic",
# ...
);
Unfortunately, when I ran the resulting script, Perl complained:
Can't find Unicode property definition "Script=Chakma" at (...) line (
+...)
What had gone wrong? Versions, that's what: I'd looked at the perluniprops page on perl.org, documenting Perl 5.20.0, but this particular Perl was 5.14.2 and didn't know all the scripts that the newer version did, thanks to being built against an older Unicode version. Now, I could've just looked at the locally-installed version of the same perldoc page, but - wouldn't it be nice if the script automatically adapted itself to the Perl version it ran on? I sure reckoned it'd be.
What scripts DID the various Perl versions recognize, anyway? What I ended up doing (perhaps there's an easier way) was to look at lib/unicore/Scripts.txt for versions 5.8, 5.10, ..., 5.20 in the Perl git repo (I skipped 5.6 and earlier, because a) the relevant file didn't exist in the tree yet back then, and b) those versions are ancient, anyway). And by "look at", I mean download (as scripts-58.txt etc.), and then process:
$ for i in 8 10 12 14 16 18 20; do perl scripts.pl scripts-5$i.txt >5$
+i.lst; done
$ for i in 8 10 12 14 16 18; do diff --unchanged-line-format= --new-li
+ne-format=%L 5$i.lst 5$((i+2)).lst >5$((i+2)).new; done
$
scripts.pl was a little helper script to extract script information (apologies for the confusing terminology, BTW):
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
my %scripts;
while(<>) {
next unless m/; ([A-Za-z_]*) #/;
$scripts{$1}++;
}
$, = "\n";
say sort { $a cmp $b } map { $_ = ucfirst lc; $_ =~ s/(?<=_)(.)/uc $1/
+ge; qq/"$_"/ } keys %scripts;
I admit, I got lazy at this point and manually combined those files (58.lst, as well as 510.new, 512.new etc.) into a hash holding all the information, instead of having a script output it. Nonetheless, once this was done, I could easily load all the right scripts for a given Perl version:
# New Unicode scripts added in Perl 5.xx
my %uniscripts = (
'8' => [
"Arabic", "Armenian", "Bengali", "Bopomofo", "Buhid",
"Canadian_Aboriginal", "Cherokee", "Cyrillic", "Deseret",
"Devanagari", "Ethiopic", "Georgian", "Gothic", "Greek", "Guja
+rati",
"Gurmukhi", "Han", "Hangul", "Hanunoo", "Hebrew", "Hiragana",
"Inherited", "Kannada", "Katakana", "Khmer", "Lao", "Latin",
"Malayalam", "Mongolian", "Myanmar", "Ogham", "Old_Italic", "O
+riya",
"Runic", "Sinhala", "Syriac", "Tagalog", "Tagbanwa", "Tamil",
"Telugu", "Thaana", "Thai", "Tibetan", "Yi"
],
'10' => [
"Balinese", "Braille", "Buginese", "Common", "Coptic", "Cuneif
+orm",
"Cypriot", "Glagolitic", "Kharoshthi", "Limbu", "Linear_B",
"New_Tai_Lue", "Nko", "Old_Persian", "Osmanya", "Phags_Pa",
"Phoenician", "Shavian", "Syloti_Nagri", "Tai_Le", "Tifinagh",
"Ugaritic"
],
'12' => [
"Avestan", "Bamum", "Carian", "Cham", "Egyptian_Hieroglyphs",
"Imperial_Aramaic", "Inscriptional_Pahlavi",
"Inscriptional_Parthian", "Javanese", "Kaithi", "Kayah_Li",
"Lepcha", "Lisu", "Lycian", "Lydian", "Meetei_Mayek", "Ol_Chik
+i",
"Old_South_Arabian", "Old_Turkic", "Rejang", "Samaritan",
"Saurashtra", "Sundanese", "Tai_Tham", "Tai_Viet", "Vai"
],
'14' => [
"Batak", "Brahmi", "Mandaic"
],
'16' => [
"Chakma", "Meroitic_Cursive", "Meroitic_Hieroglyphs", "Miao",
"Sharada", "Sora_Sompeng", "Takri"
],
'18' => [
],
'20' => [
],
);
(my $ver = $^V) =~ s/^v5\.(\d+)\.\d+$/$1/;
my @known_scripts;
foreach (keys %uniscripts) {
next if $ver < $_;
push @known_scripts, @{ $uniscripts{$_} };
}
print STDERR "Running on Perl $^V, ", scalar @known_scripts, " scripts
+ known.\n";
The number of scripts Perl supports this way WILL increase again soon, BTW. Perl 5.21.1 bumped the supported Unicode version to 7.0.0, adding another bunch of new scripts as a result:
# tentative!
'22' => [
"Bassa_Vah", "Caucasian_Albanian", "Duployan", "Elbasan", "Gra
+ntha",
"Khojki", "Khudawadi", "Linear_A", "Mahajani", "Manichaean",
"Mende_Kikakui", "Modi", "Mro", "Nabataean", "Old_North_Arabia
+n",
"Old_Permic", "Pahawh_Hmong", "Palmyrene", "Pau_Cin_Hau",
"Psalter_Pahlavi", "Siddham", "Tirhuta", "Warang_Citi"
],
But that's still in the future. For now I just tested this on 5.14.2 and 5.20.0 (the two Perls I regularly use); it worked like a charm. All that was left to do was outputting those statistics:
print "Found " . scalar keys(%scripts) . " scripts:\n";
print "\t$_: " , $scripts{$_}, " line(s)\n" foreach(sort { $a cmp $b }
+ keys %scripts);
(You'll note that in the above two snippets, I'm using print rather than say, BTW. That's intentional: say is only available from Perl 5.10 on, and this script is supposed to be able to run on 5.8 and above.)
Fed some sample data that I'm sure Perlmonks would mangle badly if I tried to post it, this produced the following output:
Running on Perl v5.14.2, 95 scripts known.
Found 18 scripts:
Arabic: 21 line(s)
Bengali: 2 line(s)
Cyrillic: 12 line(s)
Devanagari: 3 line(s)
Georgian: 1 line(s)
Greek: 1 line(s)
Gujarati: 1 line(s)
Gurmukhi: 1 line(s)
Han: 29 line(s)
Hangul: 3 line(s)
Hebrew: 1 line(s)
Hiragana: 1 line(s)
Katakana: 1 line(s)
Latin: 647 line(s)
Sinhala: 1 line(s)
Tamil: 4 line(s)
Telugu: 1 line(s)
Thai: 1 line(s)
Problem solved! And not only that, it's futureproof now as well, adapting to additional scripts in my input data, and easily extended when new Perl versions support more scripts, while maintaining backward compatibility.
What could still be done? Several things. First, I should perhaps find out if there's an easy way to get this information from Perl, without actually doing all the above.
Second, while Perl 5.6 and earlier aren't supported right now, they could be. Conveniently, the 3rd edition of Programming Perl documents Perl 5.6; the \p{Script=...} syntax for character classes doesn't exist yet, I think, but one could write \p{In...} instead, e.g. \p{InArabic}, \p{InTamil} and so on. Would this be worth it? Not for me, but the possibility is there if someone else ever had the need to run this on an ancient Perl. (Even more ancient Perls may not have the required level of Unicode support for this, though I wouldn't know for sure.)
Lastly, since the point of this whole exercise was to identify writing systems used for snippets of text, there's room for optimization. Perhaps it would be faster to precompile a regular expression for each script, especially if @lines is very large. Most of the text I'm dealing with is in the Latin script; as such, I should perhaps test for that before anything else, and generally try to prioritize so that lesser-used scripts are pushed further down the list. Since I'm already keeping a running total of how often each script has been seen, this could even be done adaptively, though whether doing so would be worth the overhead in practice is another question, one that could only be answered by measuring.
But neither speed nor support for ancient Perls is crucial to me, so I'm done. This was a fun little problem to work on, and I hope you enjoyed reading about it.
|
|
|
|