Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Cool Uses for Perl

( #1044=superdoc: print w/ replies, xml ) Need Help??

This section is the place to post your general code offerings.

CUFP's
Color highlighted perl grep
1 direct reply — Read more / Contribute
by FreeBeerReekingMonk
on Feb 23, 2015 at 12:49

    To highlight a grep, we use grep --color, however, sometimes either grep does not have highlighting (AIX), or, we want to highlight multiple expressions with different colors.

    usage: ls -l | grep -e foo -e bar |./highlight.pl this that foo bar

    #!/usr/bin/perl use Term::ANSIColor; while(<STDIN>){ for $i (1..15){ next unless(defined($ARGV[$i-1])); s/($ARGV[$i-1])/&colored($1,"ansi15 on_ansi$i")/gexi; } print; }

    Of course, just use an alias or put it in the path and call it just "hl" for sanity. It supports 15 different arguments, but can be extended to the 255 different combinations Linux has. On some terminals however, you want your own selected colours, in that case, you will need to specify them by hand, for example:

    !/usr/bin/perl use Term::ANSIColor; $VERSION = '1.1'; @C = ('black on_yellow','black on_green','black on_cyan','black on_red +', 'red on_white', 'black on_magenta', 'white on_red', 'white on_blue +', 'blue on_white', 'yellow on_cyan' ); while(<STDIN>){ for $i (0..$#C){ s/($ARGV[$i])/&colored($1,$C[$i])/gexi if($ARGV[$i]); } print; }

    In the latter example, you can specify many more attributes, like bold and underscore, even blink... peruse ANSIColor.pm for examples.

    Caveats: Do not grep on digits 0..15 (except as first argument), as the ansi codes also contain numbers...

random file script
1 direct reply — Read more / Contribute
by etherald
on Jan 30, 2015 at 23:50
    here's my random file script to recursively choose a number of random files matching a regex from a directory or list of directories
    #!/usr/bin/env perl # recursively choose a number of files optionally matchinng a regex # from a directory or list of directories. defaults to cwd. # outputs absolute paths by default, relative with -r # use like so: # randomfile -n 23 -p \(mp3\|ogg\|flac\)$ ~/music ~/music2 use strict; use warnings; use Getopt::Long; use Cwd; use File::Random qw/random_file/; use File::Spec; my $pat = '^.+$'; my $num = 1; my $relative = 0; GetOptions( 'p|pattern=s' => \$pat, 'n|number=i' => \$num, 'r|relative' => \$relative, ); while ($num) { my $random_file; my $dir; do { $dir = $ARGV[rand @ARGV] or cwd(); $random_file = random_file( -dir => $dir, -check => qr/$pat/, -recursive => 1); } until $random_file; $random_file = "$dir/$random_file"; $random_file = File::Spec->abs2rel($random_file) if $relative; print "$random_file\n"; --$num; }
Alphabetize in Esperanto
1 direct reply — Read more / Contribute
by aplonis
on Jan 30, 2015 at 10:21

    A hobby of mine is translating Jack Vance into Esperanto. And yes, I have permission for this! These I distribute for free in EPub format. In each ebook I like to embed a mini linked-in dictionary to help out beginners.

    I was wanting to re-organize some standalone EPubs into one omnibus EPub. I wanted one end-of-book dictionary instead of six end-of-chapter ones. That meant re-alphabetizing hundreds of anchor links. No big deal to do it in Perl...except that it's Unicode...and Esperanto. Here's how I did it.

yellow pages
2 direct replies — Read more / Contribute
by japh2963
on Jan 29, 2015 at 01:32

    was looking for freelance work and came across a request for a script that would scrape yellow pages website. Hopefully the poster will google more before they spend $300 bucks for this...

     

    #!/usr/bin/perl # program: yp.pl # purpose: search for phone number on www.yellowpages.com # date: 2015.01.28 use strict; use warnings; use LWP::Simple; my $usage = "\n\tusage: perl $0 [PHO-NEN-UMBR]\n"; my $baseUri = 'http://www.yellowpages.com/search?search_terms=+'; my $phoneNumber = $ARGV[0]; die $usage unless($phoneNumber =~ m/^\d{3}-?\d{3}-?\d{4}$/); my $content = get("$baseUri$phoneNumber"); $content =~ s/\s+/ /g; $content =~ s/\&nbsp;/ /g; $content =~ s/>\s*</\n/g; my $hits = 0; foreach my $e(split /\n/, $content){ chomp $e; if($e =~ m/^.*itemprop=['|"](\S+)['|"].*>(.*),?<\/\S+.*$/i){ my $itemProp = $1; my $itemValue = $2; $itemValue =~ s/,\s*$//; print "$itemProp: $itemValue\n"; $hits++; } } print "no results found for number '$phoneNumber'\n" if(!$hits); exit;
MD5 checksums for Windows
1 direct reply — Read more / Contribute
by golux
on Jan 13, 2015 at 11:58
    This program "sum.pl" (for Windows) generates checksums matching those produced by the "md5sum" program in Linux. I wrote it because I often need to validate whether 2 files on different computers are the same.

    Enter "sum.pl" without arguments for a syntax message. Both files and/or directories (ie. "folders") are accepted as arguments. With the switch -R subdirectories are searched recursively. The switches -s <key> and -r control how the output is sorted. The -d switch gives a final report of any duplicate checksums found.

    Hope this might be of general use to others as well!

    Update:   At Anonymous Monk's suggestion, I've added a "-c" switch which produces a checksum format compatible with "md5sum". It does this by skipping the filesize, and prefixing the path with '*' to signify that the checksum was done in binary mode.

    say  substr+lc crypt(qw $i3 SI$),4,5
How to perldebug a Term::ReadLine application
1 direct reply — Read more / Contribute
by LanX
on Dec 01, 2014 at 09:42

    the problem

    I recently heard monks complaining that applications using Term::ReadLine can't be debugged within the perldebugger b/c it's interface relies on Term::Readline.

    the trick

    here one solution (at least for linux) I wanted to have documented (before I forget it again ;)

    call the script you want to debug (here calc_TRL.pl ) from a shell with

    PERLDB_OPTS="TTY=`tty` ReadLine=0" xterm -e perl -d ./calc_TRL.pl

    and a second xterm will be opened running the program.

    how it works

    a second xterm is started running the debugger, but b/c of the TTY setting in PERLDB_OPTS all debugger communication goes via the parent xterm , while the calc app normally displays in the child xterm .

    ReadLine=0 tells the debugger not to rely on a working Term::ReadLine.

    NB: It's important that calling the second xterm blocks the execution in the first xterm till it's finished. Like this keystrokes aren interpreted by two applications in the first xterm. Just put an & at the end to see how things get messed up otherwise if the shell tries to step in.

    how it looks like

    first xterm

    becomes the front end for the debugger

    as you see I get the lines from the app in the second xterm listed can set a breakpoint at the end of the loop and tell twice to continue till next breakpoint.

    second xterm

    runs the application, I'm asked to enter a calculation which is evaluated, interupted twice by a breakpoint at line 9.

    Enter code: 1+2 3 Enter code: 4+4 8

    the test script

    > cat ./calc_TRL.pl use Term::ReadLine; my $term = Term::ReadLine->new('Simple Perl calc'); my $prompt = "Enter code: "; my $OUT = $term->OUT || \*STDOUT; while ( $_ = $term->readline($prompt) ) { my $res = eval($_); warn $@ if $@; print $OUT $res, "\n" unless $@; $term->addhistory($_) if /\S/; }

    tested with Term::ReadLine::Gnu installed.

    generalisation

    you can use this approach whenever you want the debugger communication separated into a separate term. e.g. Curses::UI comes to mind

    discussion

    the solution is not "perfect", of course you need to arrange the windows and switch with Alt-Tab between them. (maybe screen could solve this or an emacs inegration)

    Furthermore you won't have a history with arrow navigation within the debugger, cause TRL was disabled.

    another approach is to communicate via sockets with a debugger run within emacs, since emacs has it's own TRL-emulation this shouldn't interfere.

    see also Re: Testing terminal programs within emacs (SOLVED) for an approach to handle all this automatically, by restarting a script with altered environment and different terminal.

    TIMTOWTDI

    see perldebguts , perldebtut and perldeb,

    Also "Pro Perl Debugging" book and various TK tools on CPAN.

    Cheers Rolf

    (addicted to the Perl Programming Language and ☆☆☆☆ :)

The hills are alive...
4 direct replies — Read more / Contribute
by Lady_Aleena
on Nov 26, 2014 at 22:25

    ..with The Sound of Music. ;)

    I had @SoM_notes and $SoM sitting around doing nothing, so this evening, I made them do something. In make_SoM_song and get_SoM_def, you enter a string of alphabetical notes (c, d, e, f, g, a, b). The notes can be separated by a comma, semicolon, or a space. The functions will return the note name given by Maria von Trapp in The Sound of Music.

    I wrote random_SoM_note and random_SoM_song because I couldnt' help myself. Most of you know how much I love to randomly generate things. :)

    make_SoM_song, get_SoM_def, and random_SoM_song all return array references.

    Enjoy the code!

    package SoundofMusicSong; use strict; use warnings; use Exporter qw(import); our @EXPORT_OK = qw(make_SoM_song get_SoM_def random_SoM_note random_S +oM_song); my @base_notes = qw(c d e f g a b); my @SoM_notes = qw(do re me fa so la te); my %notes; @notes{@base_notes} = @SoM_notes; my $SoM = { 'do' => 'a deer a female deer', 're' => 'a drop of golden sun', 'me' => 'a name I call myself', 'fa' => 'a long long way to run', 'so' => 'a needle pulling thread', 'la' => 'a note to follow so', 'te' => 'a drink with jam and bread', }; sub make_SoM_song { my ($user_song) = @_; my @song_notes = split(/[ ,;]/, $user_song); my @new_song = map { $_ = $_ =~ /^[a-g]$/ ? $notes{$_} : 'not a note +'; $_ } @song_notes; return \@new_song; } sub get_SoM_def { my ($user_song) = @_; my $notes = make_SoM_song($user_song); my @new_song = map { $_ = $$SoM{$_} ? $_.' '.$$SoM{$_} : 'not a note +'; $_ } @$notes; return \@new_song; } sub random_SoM_note { my $note = $SoM_notes[rand @SoM_notes]; return $note; } sub random_SoM_song { my ($number_of_notes) = @_; my $notes = $number_of_notes ? $number_of_notes : int(rand(100)) + 1 +; my @new_song; push @new_song, random_SoM_note for (1..$notes); return \@new_song; } 1;
    No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!
    Lady Aleena
CPAN Namespace Navigator
1 direct reply — Read more / Contribute
by Discipulus
on Nov 25, 2014 at 06:23
    CPAN Namespace Navigator is an interactive program that let you to navigate all namespaces on CPAN.
    The idea born when i read that before upload something to CPAN is better to explore existing modules, but when i asked here in the chat how to browse it i discovered that ther is not a real exploration program to do it.

    So the challenge was to hack directly the fomous file 02packages.details.txt that we receive (gzipped) when we search some module with some CPAN client. I used Term::ReadLine not without some headache.

    I decided (unwisely) to eval directly the data received to build up a big HoH with the whole hierarchy of CPAN modules and reletad infos. As suggested (wisely) by ambrus and yitzchak i looked at tye's Data::Diver and on my own at an ancient and unmaintained Data::Walker one.

    I was not able to bind Data::Diver at my will to add to the structure others infos like parent namespace or version, so i reinvented that wheel evaluating everything by myself.

    Surprisingly it worked.

    This is the usage and the navigation commands available during the navigation:
    USAGE: cpannn.pl [02packages.details.txt] NAVIGATION: . simple list of contained namespaces .. move one level up + detailed list of contained namespaces * read the readme file of current namespace ** download the current namespace's package ? print this help TAB completion enabled on all sub namespaces cpannn.pl by Discipulus as found at perlmonks.org
    And here you have the code, finally crafted after 37 steps of development.


    HtH
    L*

    update: take a look also at Re: Autocomplete in perl console application
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Archive by month and extension
No replies — Read more | Post response
by GotToBTru
on Nov 24, 2014 at 16:01

    Created the following to archive data from our applications. We archive by month, and by file extension, so those are built in assumptions in this program.

    Source Code:

    POD:

    1 Peter 4:10
img - display a small graphic file at the command line
1 direct reply — Read more / Contribute
by sflitman
on Nov 22, 2014 at 21:21

    I do a lot of work in Putty and need to look at icon files sometimes. I thought it would be cool to get Putty to display them in bash directly, rather than using X11 forwarding. This is not meant to be any kind of substitute for real graphics, but is a quick way to see whether a particular image file (like an icon or a web button) is what I think it is. Note that it requires 256 color to be turned on in Putty, that your Terminal setting is putty-256color, and it only handles image formats handled by GD (png, jpg, gif).

    #!/usr/bin/perl # Steve Flitman - released to Public Domain - display a small image on + the console using 256-color mode Putty/Screen # Color output to terminal derived from Todd Larason <jtl@molehill.org +> use strict; use warnings; use GD; unless (@ARGV) { die "img file ...\nDisplay files at command line using ANSI 256 col +or mode\n"; } # set colors 16-231 to a 6x6x6 color cube for (my $red=0; $red<6; $red++) { for (my $green=0; $green<6; $green++) { for (my $blue=0; $blue<6; $blue++) { printf("\x1b]4;%d;rgb:%2.2x/%2.2x/%2.2x\x1b\\", 16 + ($red * 36) + ($green * 6) + $blue, ($red ? ($red * 40 + 55) : 0), ($green ? ($green * 40 + 55) : 0), ($blue ? ($blue * 40 + 55) : 0)); } } } # colors 232-255 are a grayscale ramp, intentionally leaving out black + and white for (my $gray=0; $gray<24; $gray++) { my $level=($gray * 10) + 8; printf("\x1b]4;%d;rgb:%2.2x/%2.2x/%2.2x\x1b\\", 232 + $gray, $level, $level, $level); } my ($file,$x,$y,$r,$g,$b,$color,$index,$image,$width,$height); for $file (@ARGV) { die "Cannot read $file: $!\n" unless -r $file; my $image=GD::Image->new($file); die "Not a recognized image format: $file\n" unless defined $image; my ($width,$height)=$image->getBounds(); for (my $y=0; $y<$height; $y++) { for (my $x=0; $x<$width; $x++) { my $index=$image->getPixel($x,$y); my ($r,$g,$b)=$image->rgb($index); if ($r+$g+$b==0) { # black $color=0; } elsif ($r==255 && $g==255 && $b==255) { # white $color=15; } elsif ($r==$g && $g==$b) { # grayscale $color=232+($r>>3); } else { $color=16+(int($r/42.6)*36)+(int($g/42.6)*6)+int($b/42.6); + # smush 256 color range to 6 levels } print "\x1b[48;5;${color}m "; } print "\x1b[0m\n"; # reset } print "\x1b[0m\n"; # reset } exit;

    Dedicated to the memory of John Todd Larason, http://molehill.org

    Enjoy!

    SSF

NSA's FoxAcid
2 direct replies — Read more / Contribute
by morgon
on Nov 20, 2014 at 09:49
    According to various top-secret documents provided by Snowden, FoxAcid is the NSA codename for what the NSA calls an "exploit orchestrator," an internet-enabled system capable of attacking target computers in a variety of different ways. It is a Windows 2003 computer configured with custom software and a series of Perl scripts.

    (Source: http://www.theguardian.com/world/2013/oct/04/tor-attacks-nsa-users-online-anonymity)

    It may stretch the definition of "cool" and may be old news but maybe a few monks will find this amusing...

read file a paragraph at a time and return paragraph OR get specific data from paragraph
3 direct replies — Read more / Contribute
by james28909
on Nov 02, 2014 at 16:10
    i have figured out how to read a file a paragraph at a time :D lol. in this example i will be reading a file a paragraph at a time and then getting only certain data back.

    this will return everything AFTER the pattern up until a newline char.works very quickly for me but prob could be done faster by a seasoned coder. you can modify it to suit your needs hopefully.
    tested with active perl and win7 ultimate x64
    $/ = ""; #set to paragraph mode while(<$file>){ if ($_ =~ /$first_match/ && $_ =~ /$second_match/ && $_ =~ /$third +_match/){ print "$_\n"; my ($needed_data_0) = (/^data_here(.+)/, $_); my ($needed_data_1) = (/^more_data(.+)/, $_); my ($needed_data_2) = (/^other_data(.+)/, $_); print "data0 is: $needed_data_0\n"; print "data1 is: $needed_data_1\n"; print "data2 is: $needed_data_2"; } }
    any input or pointers are welcome.
    :)
Random video player
2 direct replies — Read more / Contribute
by james28909
on Oct 29, 2014 at 14:02
    this little script will use file random to get a video file and will use your player of choice to launch it. mediainfo.exe was used to get sleep count.

    if you see anyway i can improve it please make a comment. i am trying to figure out how to push the filename to an array and do a check if it has played it in the last 100 videos/loops(that would help keep it from repeating if it does i think). i would also like to specify how many times to loop before exiting, that way it doesnt just loop forever and ever and the user can specify how many shows they want to watch.

    i wasnt please with the way i had to get duration, but it worked and i was astounded, so if you have of a better was to get duration time, please let me know. the problem i was dealing with while getting duration with any tool was spaces in the path/filenames. Here is the code:
    use strict; use warnings; use diagnostics; use File::Random qw(random_file); my $dir = $ARGV[0]; if ( not defined $dir ) { print "\nUsage: random.pl [folder]; exit(0); }else{ random($dir); } sub random{ my ($dir) = @_; while (1){ my $mpc = "C:/Program Files (x86)/K-Lite Codec Pack/Media +Player Classic/mpc-hc.exe"; my $rndm_file = random_file( -dir => $dir, #-check => qr/./, -recursive => 1 ); if ($rndm_file =~ /\.(ini|nfo|db)$/i){ print "$rndm_file\n"; random($dir); } print $rndm_file; #get duration my $t = ("MediaInfo.exe --Output=Video;%Duration% \"F:/TV/ +$rndm_file\""); system(1, $mpc, "F:/TV/$rndm_file"); my $time = qx($t); my $sleep_time = $time/1000; #in seconds because m +ediainfo.exe outputs milliseconds i think. print "\nDuration in seconds: $sleep_time\n"; sleep($sleep_time); random($dir); } }
    this works great if you have a ton of media in a folder and want to randomly watch any given inside in that directory, and you do not have to worry about spaces in path/filenames. and this is also for windows but i think it could be used on any other platform as well with small changes
Using Data::Compare recursively to better identity differences between hashes
2 direct replies — Read more / Contribute
by Lady_Aleena
on Oct 18, 2014 at 01:49

    Yesterday I wanted to compare two hashes to see if they were the same. I looked around a little bit and found Data::Compare. It was good at telling me the two hashes were different, however, it did not tell me where. So, I wrote a small little subroutine to recursively check my hash of hashes (of hashes). It was able to identity where I had to look to make corrections, almost to the exact spot. (I am unsure how to compare arrays of hashes just yet which is why the following little subroutine will almost take you to the right spot.)

    There are still holes in the script, but it worked for me today.

    #!/usr/bin/perl use strict; use warnings FATAL => qw( all ); use Data::Compare; use Data::Dumper; # You can take out all instances of the subroutine 'line' to print wha +t you want in those places. sub deep_data_compare { my ($tab, $old_data, $new_data, $data) = @_; my $old = $old_data; my $new = $new_data; my $compare = new Data::Compare($old, $new); if ($compare->Cmp == 0) { line($tab,$data) if $data; if (ref($old) eq 'HASH' && ref($new) eq 'HASH') { line($tab+1,'old to new'); for (keys %$old) { deep_data_compare($tab+2,$_,$$old{$_},$$new{$_}); } } # I have not figured out this part yet. # elsif (ref($old) eq 'ARRAY' && ref($new) eq 'ARRAY') { # } else { print Dumper($new); print Dumper($old); } } } sub rline { my ($tab,$line) = @_; return qq(\t) x $tab.qq($line\n); } sub line { print rline(@_); } deep_data_compare(0, \%old_hash, \%new_hash, 'widgets');
    No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!
    Lady Aleena
Identifying scripts (writing systems)
2 direct replies — Read more / Contribute
by AppleFritter
on Sep 16, 2014 at 17:32

    Dear monks and nuns, priests and scribes, popes and antipopes, saints and stowaways lurking in the monastery, lend me your ears. (I promise I'll return them.) I'm still hardly an experienced Perl (user|programmer|hacker), but allow me to regale you with a story of how Perl has been helping me Get Things Done™; a Cool Use for Perl, or so I think.

    I was recently faced with the problem of producing, given a number of lines each written in a specific script (i.e. writing system; Latin, Katakana, Cyrillic etc.), a breakdown of scripts used and how often they appeared. Exactly the sort of problem Perl was made for - and thanks to regular expressions and Unicode character classes, a breeze, right?

    I started by hardcoding a number of scripts to match my snippets of text against:

    my %scripts; foreach (@lines) { my $script = m/^\p{Script=Latin}*$/ ? "Latin" : m/^\p{Script=Cyrillic}*$/ ? "Cyrillic" : m/^\p{Script=Han}*$/ ? "Han" : # ... "(unknown)"; $scripts{$script}++; }

    Obviously there's a lot of repetition going on there, and though I had a list of scripts for my sample data, I wasn't sure new and uncontemplated scripts wouldn't show up in the future. So why not make a list of all possible scripts, and replace the hard-coded list with a loop?

    my %scripts; LINE: foreach my $line (@lines) { foreach my $script (@known_scripts) { next unless $line =~ m/^\p{Script=$script}*$/; $scripts{$script}++; next LINE; } $scripts{'(unknown)'}++; }

    So far, so good, but now I needed a list of the scripts that Perl knew about. Not a problem, I thought, I'll just check perluniprops; the list of properties Perl knows about was staggering, but I eventually decided that any property of the form "\p{Script: ...}" would qualify, so long as it had short forms listed (which I took as an indication that that particular property was the "canonical" form for the script in question). After some reading and typing and double-checking, I ended up with a fairly long list:

    my @known_scripts = ( "Arabic", "Armenian", "Avestan", "Balinese", "Bamum", "Batak", "Bengali", "Bopomofo", "Brahmi", "Br +aille", "Buginese", "Buhid", "Canadian_Aboriginal", "Carian", "Chakma", "Cham", "Cherokee", "Coptic", "Cuneiform", "Cypriot", "Cyrillic", # ... );

    Unfortunately, when I ran the resulting script, Perl complained:

    Can't find Unicode property definition "Script=Chakma" at (...) line ( +...)

    What had gone wrong? Versions, that's what: I'd looked at the perluniprops page on perl.org, documenting Perl 5.20.0, but this particular Perl was 5.14.2 and didn't know all the scripts that the newer version did, thanks to being built against an older Unicode version. Now, I could've just looked at the locally-installed version of the same perldoc page, but - wouldn't it be nice if the script automatically adapted itself to the Perl version it ran on? I sure reckoned it'd be.

    What scripts DID the various Perl versions recognize, anyway? What I ended up doing (perhaps there's an easier way) was to look at lib/unicore/Scripts.txt for versions 5.8, 5.10, ..., 5.20 in the Perl git repo (I skipped 5.6 and earlier, because a) the relevant file didn't exist in the tree yet back then, and b) those versions are ancient, anyway). And by "look at", I mean download (as scripts-58.txt etc.), and then process:

    $ for i in 8 10 12 14 16 18 20; do perl scripts.pl scripts-5$i.txt >5$ +i.lst; done $ for i in 8 10 12 14 16 18; do diff --unchanged-line-format= --new-li +ne-format=%L 5$i.lst 5$((i+2)).lst >5$((i+2)).new; done $

    scripts.pl was a little helper script to extract script information (apologies for the confusing terminology, BTW):

    #!/usr/bin/perl use strict; use warnings; use feature qw/say/; my %scripts; while(<>) { next unless m/; ([A-Za-z_]*) #/; $scripts{$1}++; } $, = "\n"; say sort { $a cmp $b } map { $_ = ucfirst lc; $_ =~ s/(?<=_)(.)/uc $1/ +ge; qq/"$_"/ } keys %scripts;

    I admit, I got lazy at this point and manually combined those files (58.lst, as well as 510.new, 512.new etc.) into a hash holding all the information, instead of having a script output it. Nonetheless, once this was done, I could easily load all the right scripts for a given Perl version:

    # New Unicode scripts added in Perl 5.xx my %uniscripts = ( '8' => [ "Arabic", "Armenian", "Bengali", "Bopomofo", "Buhid", "Canadian_Aboriginal", "Cherokee", "Cyrillic", "Deseret", "Devanagari", "Ethiopic", "Georgian", "Gothic", "Greek", "Guja +rati", "Gurmukhi", "Han", "Hangul", "Hanunoo", "Hebrew", "Hiragana", "Inherited", "Kannada", "Katakana", "Khmer", "Lao", "Latin", "Malayalam", "Mongolian", "Myanmar", "Ogham", "Old_Italic", "O +riya", "Runic", "Sinhala", "Syriac", "Tagalog", "Tagbanwa", "Tamil", "Telugu", "Thaana", "Thai", "Tibetan", "Yi" ], '10' => [ "Balinese", "Braille", "Buginese", "Common", "Coptic", "Cuneif +orm", "Cypriot", "Glagolitic", "Kharoshthi", "Limbu", "Linear_B", "New_Tai_Lue", "Nko", "Old_Persian", "Osmanya", "Phags_Pa", "Phoenician", "Shavian", "Syloti_Nagri", "Tai_Le", "Tifinagh", "Ugaritic" ], '12' => [ "Avestan", "Bamum", "Carian", "Cham", "Egyptian_Hieroglyphs", "Imperial_Aramaic", "Inscriptional_Pahlavi", "Inscriptional_Parthian", "Javanese", "Kaithi", "Kayah_Li", "Lepcha", "Lisu", "Lycian", "Lydian", "Meetei_Mayek", "Ol_Chik +i", "Old_South_Arabian", "Old_Turkic", "Rejang", "Samaritan", "Saurashtra", "Sundanese", "Tai_Tham", "Tai_Viet", "Vai" ], '14' => [ "Batak", "Brahmi", "Mandaic" ], '16' => [ "Chakma", "Meroitic_Cursive", "Meroitic_Hieroglyphs", "Miao", "Sharada", "Sora_Sompeng", "Takri" ], '18' => [ ], '20' => [ ], ); (my $ver = $^V) =~ s/^v5\.(\d+)\.\d+$/$1/; my @known_scripts; foreach (keys %uniscripts) { next if $ver < $_; push @known_scripts, @{ $uniscripts{$_} }; } print STDERR "Running on Perl $^V, ", scalar @known_scripts, " scripts + known.\n";

    The number of scripts Perl supports this way WILL increase again soon, BTW. Perl 5.21.1 bumped the supported Unicode version to 7.0.0, adding another bunch of new scripts as a result:

    # tentative! '22' => [ "Bassa_Vah", "Caucasian_Albanian", "Duployan", "Elbasan", "Gra +ntha", "Khojki", "Khudawadi", "Linear_A", "Mahajani", "Manichaean", "Mende_Kikakui", "Modi", "Mro", "Nabataean", "Old_North_Arabia +n", "Old_Permic", "Pahawh_Hmong", "Palmyrene", "Pau_Cin_Hau", "Psalter_Pahlavi", "Siddham", "Tirhuta", "Warang_Citi" ],

    But that's still in the future. For now I just tested this on 5.14.2 and 5.20.0 (the two Perls I regularly use); it worked like a charm. All that was left to do was outputting those statistics:

    print "Found " . scalar keys(%scripts) . " scripts:\n"; print "\t$_: " , $scripts{$_}, " line(s)\n" foreach(sort { $a cmp $b } + keys %scripts);

    (You'll note that in the above two snippets, I'm using print rather than say, BTW. That's intentional: say is only available from Perl 5.10 on, and this script is supposed to be able to run on 5.8 and above.)

    Fed some sample data that I'm sure Perlmonks would mangle badly if I tried to post it, this produced the following output:

    Running on Perl v5.14.2, 95 scripts known. Found 18 scripts: Arabic: 21 line(s) Bengali: 2 line(s) Cyrillic: 12 line(s) Devanagari: 3 line(s) Georgian: 1 line(s) Greek: 1 line(s) Gujarati: 1 line(s) Gurmukhi: 1 line(s) Han: 29 line(s) Hangul: 3 line(s) Hebrew: 1 line(s) Hiragana: 1 line(s) Katakana: 1 line(s) Latin: 647 line(s) Sinhala: 1 line(s) Tamil: 4 line(s) Telugu: 1 line(s) Thai: 1 line(s)

    Problem solved! And not only that, it's futureproof now as well, adapting to additional scripts in my input data, and easily extended when new Perl versions support more scripts, while maintaining backward compatibility.

    What could still be done? Several things. First, I should perhaps find out if there's an easy way to get this information from Perl, without actually doing all the above.

    Second, while Perl 5.6 and earlier aren't supported right now, they could be. Conveniently, the 3rd edition of Programming Perl documents Perl 5.6; the \p{Script=...} syntax for character classes doesn't exist yet, I think, but one could write \p{In...} instead, e.g. \p{InArabic}, \p{InTamil} and so on. Would this be worth it? Not for me, but the possibility is there if someone else ever had the need to run this on an ancient Perl. (Even more ancient Perls may not have the required level of Unicode support for this, though I wouldn't know for sure.)

    Lastly, since the point of this whole exercise was to identify writing systems used for snippets of text, there's room for optimization. Perhaps it would be faster to precompile a regular expression for each script, especially if @lines is very large. Most of the text I'm dealing with is in the Latin script; as such, I should perhaps test for that before anything else, and generally try to prioritize so that lesser-used scripts are pushed further down the list. Since I'm already keeping a running total of how often each script has been seen, this could even be done adaptively, though whether doing so would be worth the overhead in practice is another question, one that could only be answered by measuring.

    But neither speed nor support for ancient Perls is crucial to me, so I'm done. This was a fun little problem to work on, and I hope you enjoyed reading about it.


Add your CUFP
Title:
CUFP:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":


  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chanting in the Monastery: (4)
    As of 2015-03-02 03:41 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      When putting a smiley right before a closing parenthesis, do you:









      Results (24 votes), past polls