Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

The Monastery Gates

( #131=superdoc: print w/replies, xml ) Need Help??

Donations gladly accepted

If you're new here please read PerlMonks FAQ
and Create a new user.

New Questions
Lingua::EN::Tagger adapting to other language
No replies — Read more | Post response
by Anonymous Monk
on Dec 13, 2017 at 10:15


    I am trying a quite (for me) difficult task: adapt the module Lingua::EN::Tagger to be used with another language. To do so I need to train the probability values with a corpus in my language. The probability values are saved in several YAML files. Unfortunately there is 0 documentation, as far as I can say, describing how to do this. Actually, I have problems understanding how the probabilities are saved. I have some background in linguistics and in corpus linguistics. However, without documentation it is a hard task for me. I have seen that there is also a German version (Lingua::EN::Tagger) derived from the EN one. So the task, provided a corpus and some manual tagging to train the model, should be doable. I've written to the authors to get some info on how to proceed, but no response. Has somebody already tried to do something like this? If yes, have you found some documentation online on how to train the model? Any suggestion would be very much appreciated. Best.

C headers not found when installing Sys::Virt on CentOS
3 direct replies — Read more / Contribute
by chrestomanci
on Dec 12, 2017 at 12:05

    Greetings wise brothers, I am trying to install Sys::Virt on a CentOS box. but perl cannot find the necessary header files.

    I have installed the libvirt-devel package using yum, it has installed a number of header files in sensible places, as reported by "rpm -q -l libvirt-devel".

    I have attemped to install Sys::Virt package both using cpanminus, and by hand (unpack the tarball run Makefile.PL, run the generated Makefile). Either way it reports that libvirt is not installed.

    The check inside Makefile.PL is done using pkg-config. I have run that by hand and it also reports that libvirt is missing.

    I tried editing Makefile.PL to remove the check, and running it anyway. It then spits out a Makefile, which when I run creates lots of errors about missing macros and symbols that are defined in the libvirt headers.

    So in summary I have a situation where the CentOS core packaging system thinks the package has been installed, and has written the headers to the correct places. Meanwhile both pkg-config and the perl build system cannot find the package or the required header files, and so the build fails.

    I know that the build system for compiling XS modules in C is correctly setup, because I have successfully installed other XS based modules including XML::LibXML and JSON::XS

    At this point, I am stuck, so I am asking for help. Please bear in mind that I am not that familiar with CentOS (My preferred Distro is Debian), so I might have missed something obvious in the way that RedHat derived distros lay out files or configure libraries.

Converting fasta (with multiple sequences) into tabular using perl
5 direct replies — Read more / Contribute
by rarenas
on Dec 12, 2017 at 11:57

    Hello. I started learning perl two weeks ago and I have a practice assignment to convert fasta files into tabular format using perl. I managed to write a program that converted a fasta into tbl but only when there is one sequence in the fasta file. If I have multiple sequences in a fasta file, I cannot manage to convert them to tbl properly. This is the code I wrote that works with one sequence in the fasta:

    #!/usr/bin/perl # use strict; use warnings; die "Please specify suitable file\n" if (@ARGV != 1); my ($fasta) = @ARGV; my $outfile = "$fasta.tbl"; open(my $in, "<", "$fasta") or die "error reading $fasta. $!"; open(my $out, ">", "$outfile") or die "error creating $outfile. $!"; my $identifier = ""; my $union = ""; while (<$in>) { chomp; next unless m/\w/; if ($_ =~ m/>/) { # Identifier! $_ =~ s/>//; ($identifier) = split /\:|\s|\||,|;/, $_; print "$identifier\n"; } else { # We have a line with sequence $union = $union . $_; } } print $out "$identifier\t$union\n"; close($in); close($out);

    I realized that it would be a lot better to use hashes instead of arrays to separate the different sequences. I want to have the sequence title/name be the key and the sequence be the value. I also thought it would be good to use the local command so that I can separate based on ">" symbol instead of by line because all fasta file titles start with that symbol. I am stuck on actually implementing those realizations and then using a loop to edit the formatting for each sequence. Any suggestions? Thank you in advance!

    I am using the simple fasta file below for practice but do note that many fasta files contain extra information in the title and may have a space between the title and the sequence. We only want the name in tbl and not the extra information. The code above takes care of those extras only for one sequence.

    Fasta format


    Tabular format

DBD::Sqlite queries slow - and gives wrong results
6 direct replies — Read more / Contribute
by astroboy
on Dec 11, 2017 at 14:42

    Ok, this is pretty weird IMHO. I have some Perl code written to create and populate a SQLite database. It's very simple - a list of users and their AD groups in two different tables. It's been running ok for at least 18 months. I have some other Perl code that will query the database. This querying code has started running slowly, and in some cases returns no results where I know there are records - I can test it by pasting the SQL in to the SQLiteStudio v2.1.5 editor. It returns the results instantly.

    Here's an example of some Perl query code:

    #!/usr/bin/perl -w use strict; use DBI; my $dbfile = 'C:/db/employee.db'; my $dbh = DBI->connect( "dbi:SQLite:dbname=$dbfile", "", "", { RaiseError => 1, AutoCommit => 0, } ); my $sql = q{ select e.* from employees e, groups g where e.sam_account_name = g.sam_account_name and g.group_name = 'Group Name' order by last_name, first_name }; foreach my $emp (@{$dbh->selectall_arrayref($sql, {Slice => {}})}) { printf( "\t\t%-15s %-15s: (%s)\n", $emp->{first_name}, $emp->{last_name}, $emp->{sam_account_name}, ); } $dbh->rollback;

    The query in $sql is copied and pasted into the SQLiteStudio editor. Regardless of the group name I choose, the editor returns rows in approximately 0.001 seconds. Perl takes several seconds may return no rows, even where there are matching candidates. If I change the group name, it may return the same rows as the editor but it can take 10+ seconds. The result set is always small 2 - 20 rows depending on the group. The database is 520MB.

    Note the code above was written to simplify my problem. The actual code has the group name as a placeholder, and I simply fetch each row rather than returning everything into an array as I do above. Regardless, the results are the same.

    This is running on Windows 7. I was using DBD::SQLite 1.54 - As a test this morning I upgraded to 1.55_04 to see if there were any fixes in the developer version. I recreated the database, but the results are still the same

unique sequences
8 direct replies — Read more / Contribute
by Anonymous Monk
on Dec 10, 2017 at 18:31

    I am really new to perl and am taking a course on it. I wrote the following program for an assignment and am getting the incorrect output. I'm getting over a million lines while the expected output is closer to 250,000. The last 12 nts need to be unique to the genome. I have a feeling it's due to my regex. Any advice would be greatly appreciated. Thankyou.

    #!/usr/bin/perl use strict; use warnings; my %windowSeqScore = (); my $input_file = '/scratch/Drosophila/dmel-all-chromosome-r6.02.fasta' +; my $sequenceRef = loadSequence($input_file); my $output_file = 'unique12KmersEndingGG.fasta'; open (KMERS,">", $output_file) or die $!; my $windowSize = 21; my $stepSize = 1; for ( my $windowStart = 0 ; $windowStart <= ( length ( $$sequenceRef ) + - $windowSize ); $windowStart += $stepSize ) { my $windowSeq = substr ( $$sequenceRef, $windowStart, $windowS +ize); if ($windowSeq =~ /([ATCG]{10}GG$)/) { $windowSeqScore{$windowSeq}++; } } my $count = 0; for (keys %windowSeqScore){ $count ++; if ($windowSeqScore{$_} == 1 ) { print KMERS ">crispr_$count", "\n", $_, "\n"; } } sub loadSequence { my ($sequenceFile) = @_; my $sequence = ""; unless ( open( FASTA, "<", $sequenceFile ) ) { die $!; } while (<FASTA>){ my $line = $_; chomp ($line); if ($line !~ /^>/ ) { $sequence .= $line; } } return \$sequence; }

    This is some of the output I'm getting


    this is some of the expected output I should be getting

Rename files in gzip tarball: No such file in archive: '/path/to/file1.txt'
2 direct replies — Read more / Contribute
by Bowlslaw
on Dec 08, 2017 at 19:03

    This program reads a list of files from the directory specified on the command line, creates an array of hashes, where each file has key path, size, and sha256sum.

    I am trying to create a gzipped tarball of the files, where each files is name the checksum appended with the file's original extension. I create a gzipped tarball of the files successfully. However, when I try to use Archive::Tar's rename method, I am met with this error: No such file in archive: '/path/to/file1.txt' at ./ line 62. This error repeats for each file in the archive.

    Is it because the archive is just a flat list of files? If so, how does one use the rename method?

    use strict; use warnings; use Data::Dumper qw(Dumper); use File::Spec qw(catfile rel2abs); use Digest::SHA qw(sha256_hex); use Archive::Tar; use Archive::Tar::File; my $dir = $ARGV[0]; my $url = $ARGV[1]; my @AoH; my @checksumfiles; my $tar = Archive::Tar->new; my $archive = "archive.tar.gz"; opendir DIR, $dir or die "cannot open dir $dir: $!\n"; chdir $dir or die "cannot navigate to dir $dir: $!\n"; while(my $file = readdir DIR) { next unless(-f File::Spec->catfile($dir, $file)); next if($file =~ m/^\./); my $fullpath = File::Spec->rel2abs($file); my $fullsize = -s File::Spec->catfile($dir, $file); my $fullid = sha256_hex($fullpath); my %hash = ( path => $fullpath, size => $fullsize, id => $fullid, ); push(@AoH, \%hash); } my @array; for my $i(0..$#AoH) { no warnings 'uninitialized'; my ($ext) = $AoH[$i]{path} =~ (/(\.[^.]+)$/); my $idext = $AoH[$i]{id} . $ext; push(@checksumfiles, $idext); push(@array, $AoH[$i]{path}); } Archive::Tar->create_archive($archive, COMPRESS_GZIP, @array); #print Archive::Tar->list_archive($archive, COMPRESS_GZIP), "\n"; for my $i(0..$#array) { $tar->rename($array[$i], $checksumfiles[$i]); } print Dumper sort \@array; print Dumper sort \@checksumfiles; #print Dumper sort \@AoH;
Tidying and simplifying a regular expression
6 direct replies — Read more / Contribute
by Dallaylaen
on Dec 08, 2017 at 12:00

    Hello monks and nuns,

    I'm just wondering if there is a module or recipe to strip a regular expression of meaningless grouping. Consider the following code:

    bash$ perl -wle 'my $rex = qr/./; $rex = qr/$rex./ for 1..10; print $r +ex;' (?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:(?^:.).).).).).).).).).).)

    It's relatively easy to spot that it's just a (?:..........), however, the expression is not stringified exactly like that. Is there a way possible to tidy it up automatically?

    Inspired by this node, but I think it would be nice to have a simplifier anyway...

Guidelines for listing functions/methods in POD?
7 direct replies — Read more / Contribute
by Dallaylaen
on Dec 07, 2017 at 15:34

    Hello dear esteemed monks,

    Having published some modules, I finally started wondering about formatting function names

    For some reason, cannot even recall why, I began documenting my modules' functions using a header with usage example:

    =head2 frobnicate( $foo, $bar )

    Now it looks rather cumbersome to me, so I'm leaning towards

    =head2 frobnicate =over =item frobnicate( $foo, $bar ) =item frobnicate( \%baz ) =back

    But I see that many CPAN authors go even further and remove functions/methods from index altogether, leaving only

    =item frobnicate()

    I for one prefer more structured documentation. Where can I find guidelines for doing it properly? What are the reasons for and against each practice? At least Test::Pod::Coverage permits all three...

    Oh, it looks like I'm sold on the second variant: after re-reading perldoc perlpod it turns out that sections are linkable via L<Foo::Bar/frobnicate>. Still posting this, there sure is something to add to my thoughts!

Single sign on with AD
2 direct replies — Read more / Contribute
by newbie200
on Dec 07, 2017 at 10:45

    Hello, I am currently new to perl. I am trying to implement sso on a perl web app but don't seem to get my head round it. here are the technical details

    on apache i downloaded, installed and configured the module. this allowed me to detect a user logged on a computer, I was able to know if the user was in a local domain or global domain. now comes the tricky part. i have to program in my web app an sso which sees the person logged on from apache. I also have ldap configured. It just seems so confusing to me

    I would be glad if someone can explain more on this. do i need an sso server? how do i connect my perl webapp to read my apache and get the information required.
New Meditations
The problem of "the" default shell
4 direct replies — Read more / Contribute
by afoken
on Dec 09, 2017 at 08:17

    I've got a little bit tired of searching my "avoid the default shell" postings over and over again, so I wrote this meditation to sum it up.

    What is wrong with the default shell?

    In an ideal world, nothing. The default shell /bin/sh would have a consistent, well-defined behaviour across all platforms, including quoting and escaping rules. It would be quite easy and unproblematic to use.

    But this is the real world. Different platforms have different default shells, and they change the default shell over time. Also, shell behaviour changed over time. Remember that the Unix family of operating systems has evolved since the 1970s, and of course, this includes the shells. Have a look at "Various system shells" to get a first impression. Don't even assume that operating systems keep using the same shell as default shell.

    And yes, there is more than just the huge Unix family. MS-DOS copied concepts from CP/M and also a very little bit of Unix. OS/2 and the Windows NT family (including 2000, XP, Vista, 7, 10) copied from MS-DOS. Windows 1-3, 9x, ME still ran on top of DOS. From this tree of operating systems, we got and cmd.exe.

    By the way: Modern MacOS variants (since MacOS X) are part of the Unix family, and so is Android (after all, it's just a heavily customized Linux).

    Some ugly details:

    And when it comes to Windows (and DOS, OS/2), legacy becomes really ugly.

    So, to sum it up, there is no thing like "the" default shell. There are a lot of default shells, all with more or less different behaviour. You can't even hope that the default shell resembles a well-known family of shells, like bourne. So there is much potential for nasty surprises.

    Why and how does that affect Perl?

    Perl has several ways to execute external commands, some more obvious, some less. In the very basic form, you pass a string to perl that roughly ressembles what you would type into your favorite shell:

    • system('echo hello');
    • exec('echo hello');
    • open my $pipe,'echo hello |' or die "Can't open pipe: $!"; my $hello=do { local $/; <$pipe> }; close $pipe;
    • my $hello=qx(echo hello);
    • my $hello=`echo hello`;

    Looks pretty innocent, doesn't it? And it is, until you want to start doing real-world things, like passing arguments containing quotes, dollar signs, or backslashes to an external program. You need to know the quoting rule of whatever shell happens to be the default shell.

    For those cases, perl is expected to pass the string to /bin/sh for execution. Except that in this innocent case, and several other cases, perl does not invoke the default shell at all. Burried deep in the perl sources, there is some heuristics happening. If perl thinks that it can start the executable on its own, because the command does not contain what is documented as "shell metacharacters", perl splits the command on its own and can avoid invoking the default shell.

    Why? Because perl can easily figure out what the shell would do, and do it by itself instead. This avoids a lot of overhead and so is faster and does not use as much memory as invoking the shell would.

    Unfortunately, the documentation is a little bit short on details. See "Perl guessing" in Re^2: Improve pipe open? (redirect hook): From the code of Perl_do_exec3() in doio.c (perl 5.24.1), it seems that the word "exec" inside the command string triggers a different handling, and some of the logic also depends on how perl was compiled (preprocessor symbol CSH).

    If you don't need support from the default shell, you can help perl by passing system, exec, and open a list of arguments instead of a string. This "multi-argument" or "list form" of the commands always avoids the shell, and it completely avoids any need to quote.

    (Well, at least on Unix. Windows is a completely different beast. See Re^3: Perl Rename and Re^3: Having to manually escape quote character in args to "system"?. It should be safe to pretend that you are on Unix even if you are on Windows. Perl should do the right thing with the "list form".)

    So our examples now look like this:

    • system('echo','hello','here','is','a','dollar:','$');
    • exec('echo','hello','here','is','a','dollar:','$');
    • open my $pipe,'-|','echo','hello','here','is','a','dollar:','$' or die "Can't open pipe: $!"; my $hello=do { local $/; <$pipe> }; close $pipe;

    Did you notice that qx() and its shorter alias `` don't support a list form? That sucks, but we can work around that by using open instead. Writing a small function that wraps open is quite easy. See "Safe pipe opens" in perlipc.

    Edge cases

    OK, let's assume I've convinced you to use the list forms of system, exec, and open. You want to start a program named "foo bar", and it needs an argument "baz". Yes, the program has a space in its name. This is unusual but legal in the Unix family, and quite common on Windows.

    • system('foo bar','baz');
    • exec('foo bar','baz');
    • open my $pipe,'-|','foo bar','baz' or die ...

    or even:

    my @command=('foo bar','baz'); and one of:

    • system @command;
    • exec @command;
    • open my $pipe,'-|',@command or die ...

    All is well. Perl does what you expect, no default shell is ever involved.

    Now, "foo bar" get's an update, and you no longer have to pass the "baz" argument. In fact, you must not pass the "baz" argument at all. Should be easy, right?

    • system 'foo bar';
    • exec 'foo bar';
    • open my $pipe,'-|','foo bar' or die ...


    my @command=('foo bar'); and one of:

    • system @command;
    • exec @command;
    • open my $pipe,'-|',@command or die ...

    Wrong! system, exec, and even open in the three-argument form now see a single scalar value as the command, and start once again guessing what you want. And they will wrongly guess that you want to start "foo" with an argument of "bar".

    The solution for system and exec is hidden in the documentation of exec: Pass the executable name using indirect object syntax to system or exec, and perl will treat the single-argument list as list, and not a single command string.

    • system { 'foo bar' } 'foo bar';
    • exec { 'foo bar' } 'foo bar';


    my @command=('foo bar'); and one of:

    • system { $command[0] } @command;
    • exec { $command[0] } @command;

    If the command list is not guaranteed to contain at least two arguments (e.g. because arguments come from the user or the network), you should always use the indirect object notation to avoid this trap.

    Did you notice that we lost another way of invoking external commands here? There is (currently) no way in perl to use pipe open with a single-element command list without triggering the default shell heuristics. That's why I wrote Improve pipe open?. Yes, you can work around by using the code shown in "Safe pipe opens" in perlipc and using exec with indirect object notation in the child process. But that takes 10 to 20 lines of code just because perl tries to be smart instead of being secure.

    Avoiding external programs

    Why do you want to run external programs? Perl can easily replace most of the basic Unix utilities, by using internal functions or existing modules. And as an additional extra, you don't depend on the external programs. This makes your code more portable. For example, Windows does not have ls, grep, awk, sed, test, cat, head, or tail out of the box, and find is not find, but a poor excuse for grep. If you use perl functions and modules, that does not matter at all. Likewise, not all members of the Unix family have the GNU variant of those utilities. Again, if you use perl functions and modules, it does not matter.

    ToolPerl replacement
    echoprint, say
    rm -rFile::Path
    mkdir -pFile::Path
    grepgrep (note: you need to open and read files manually)
    ls, findFile::Find, glob, stat, lstat, opendir, readdir, closedir
    test, [, [[stat, lstat, -X, File::stat
    cat, head, tailopen, readline, print, say, close, seek, tell
    lnlink, symlink
    curl, wget, ftpLWP::UserAgent and friends
    sshNet::SSH2, Net::OpenSSH

    Note: The table above is far from being complete.


    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Log In?

What's my password?
Create A New User
[LanX]: hmm the front page of the "daily mail" looks quite Nazi
[marto]: they have the Swastika on there yet again?
[LanX]: heh heh

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2017-12-14 14:50 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (396 votes). Check out past polls.