Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Meditations

( #480=superdoc: print w/ replies, xml ) Need Help??

If you've discovered something amazing about Perl that you just need to share with everyone, this is the right place.

This section is also used for non-question discussions about Perl, and for any discussions that are not specifically programming related. For example, if you want to share or discuss opinions on hacker culture, the job market, or Perl 6 development, this is the place. (Note, however, that discussions about the PerlMonks web site belong in PerlMonks Discussion.)

Meditations is sometimes used as a sounding-board — a place to post initial drafts of perl tutorials, code modules, book reviews, articles, quizzes, etc. — so that the author can benefit from the collective insight of the monks before publishing the finished item to its proper place (be it Tutorials, Cool Uses for Perl, Reviews, or whatever). If you do this, it is generally considered appropriate to prefix your node title with "RFC:" (for "request for comments").

User Meditations
How to make a progress counter for parsing HTML with HTML::TreeBuilder
No replies — Read more | Post response
by ambrus
on Oct 30, 2014 at 12:33

    This is the true story of a trivial bug I made in a perl program yesterday.

    This program parses a 3 megabyte sized HTML file using the HTML::TreeBuilder module. The program takes less than 30 seconds to run, but that'ss still boring to wait and I'd like to see whether it hangs, so I decided to add a progress counter. Now, as I haven't written all of the program yet, much of the time is currently spent in just parsing the HTML file and building a tree representation in memory from it. Thus, I needed a progress counter in the HTML parsing itself (as well as one in the rest of the program).

    Before I added the progress counter, all of the HTML parsing happened in just one call of the HTML::TreeBuilder->parse_file method. If I kept that, if would be difficult to add a progress counter in it. Thus, I changed the code to instead read the HTML file in 64 kilobyte chunks, feed them each to the parser with the HTML::TreeBuilder->parse method, and print progress after each according to how much of the file is read.

    I thus wrote this.

    use HTML::TreeBuilder; my $filename = ...; my $tree = HTML::TreeBuilder->new; { open my $fileh, "<", $filename or die qq(error opening input h +tml file "$filename": $!); binmode $fileh; my $filesize = -s $fileh; while (read $fileh, my $buf, (1<<16)) { $tree->parse($buf); printf(STDERR "Parsing html, %2d%%;\r", int(100*tell($ +fileh)/($filesize+1))); } $tree->eof; print STDERR "Parsing html complete. \n"; }

    This worked fine. I got a comforting progress counter with percentages rolling quickly on the screen.

    Later, however, I wanted to work around a bug in the HTML, namely some missing open tags. This can be done mechanically, because this is a generated HTML file, but it was easier if I could modify the text of the HTML before parsing it to the tree, because otherwise the tree would have a wrong shape that would be difficult to fix.

    Thus, I chose to do some substitution on the text of the HTML before parsing it. This was easier by slurping the whole HTML file and doing substitutions on the whole thing. So I changed the code to slurp the file contents, substitute it, but then I still wanted to feed it to HTML::TreeBuilder in chunks to get a nice progress counter. No big deal, I wrote this.

    use HTML::TreeBuilder; my $filename = ...; my $tree = HTML::TreeBuilder->new; { printf STDERR "Reading html file.\n"; open my $fileh, "<", $filename or die qq(error opening input h +tml file "$filename": $!); binmode $fileh; local $/; my $filec = <$fileh>; eof($fileh) or die qq(error reading input html file); printf STDERR "Substing html file.\n"; $filec =~ ...; my $filesize = length $filec; printf STDERR "Substed html has length %d\n", $filesize; my $filetell = 0; while (my$buf = substr $filec, 0, (1<<16), "") { $filetell += length $filec; $tree->parse($buf); printf STDERR "Parsing html: %2d%%;\r", int(100*$filet +ell/($filesize+1)); } $tree->eof; print STDERR "Parsing html complete. \n"; }

    This didn't work. The progress counter started showing very high numbers, going up to tens of thousands of percents. I stopped the program because I was worried it got into an infinite loop repeatedly parsing the same part of the file over and over again, and will build an infinite tree.

    After a while, I found the problem. It turns out that the HTML was parsed correctly, only the progress was displayed wrong.

    Can you spot the bug? I'll reveal the solution under the fold.

RFC: MooX::Restore
1 direct reply — Read more / Contribute
by boftx
on Oct 28, 2014 at 23:01

    I saw this module come across recently: MooseX::Role::UnsafeConstructable

    I immediately thought of a few use-cases where I want to instantiate an object from, say, a database row but having init_arg => undef, in my Moo code would prevent that.

    As it turns out, it is fairly simple to create a Moo::Role that can provide a new method, possibly named restore that can ignore the init_arg directive and allow one to instantiate a Moo object from a hash or hashref that would otherwise be blocked. A side benefit is that such a method could still call builders and whatnot if needed for attributes that were not stored in the database row.

    My questions are these: a) does anyone else have a similar use-case where it would be handy to do something like my $obj = MyClass->restore( $db_rowref );, bypassing init_arg restrictions, and b) what would be the correct name for such a Role? (I really think "UnsafeConstructable" is a bad choice.)

    I realize there are a few (or more) warts on this, especially where init_arg is used to rename an attribute. I would love to hear thoughts on what one would expect to happen in those cases.

    You must always remember that the primary goal is to drain the swamp even when you are hip-deep in alligators.
RFC: QA Uploads
1 direct reply — Read more / Contribute
by mgv
on Oct 27, 2014 at 17:35

    Debian has a process called "QA uploads" if a package is orphaned1, any Debian Developer can upload a new version of the package without adopting it.

    When adopting a package/module, the adopter feels compelled to fix all bugs, add more tests, clean up everything, etc (otherwise they wouldn't be doing their job as maintainers). The amount of work discourages people from adopting modules.

    With QA Uploads, an interested user can fix that particularily annoying bug without the burden of having to maintain the module.

    Thus, I believe that adding QA uploads to PAUSE would increase the average quality of modules. I haven't thought about implementation details, but I think the PAUSE indexer could simply index any upload of an orphaned module.

    1 Debian / CPAN equivalence:
    DebianCPAN
    O: / OrphanedADOPTME has f/m/c
    RFA: / Request for AdoptionHANDOFF has c
    RFH: / Request for HelpNEEDHELP has c
    QA Uploads are only possible for orphaned packages.
Reanimating regular issue: Indirect Object Notation
2 direct replies — Read more / Contribute
by McA
on Oct 27, 2014 at 06:22

    Hi all,

    as a regular reader of the Perlweekly newsletter I stumbled on this entry in Edition #170: Stop using indirect object notation.

    In the same moment I thought: Didn't I ask something related some time ago? Yes, I did. And I found it: Reference needed.

    So, I bring this to awareness once again.

    The reactions on twitter are interesting. IMHO the very first action that could be taken: Change all (changeable) documentation where new Class is used. Because most people don't care. They're copy&pasting the examples and synopsis of CPAN modules. And you can find this indirect notation on CPAN.

    Regards
    McA

zentara is going bye-bye
8 direct replies — Read more / Contribute
by Anonymous Monk
on Oct 25, 2014 at 14:50
    Hello esteemed monks and nerds out there. I post this anonymously because my computer blew out yesterday, and instead of fixing it, or wasteing bucks on another, I decided to let the computer go. I take this cosmic ray hit on my computer, as a sign from God, that wasteing time on the illusion of programming, is just part of Maya, the Great Illusion. It was great fun and all, and taught me alot, but I'm not staying on this planet, and if the karma associated with computers is bad, then getting rid of my computer is good.

    So, I'm not dead, I'm not fading or slowly iterrating away, I just don't see value in wasting time on an illusion.

    So, as final advice, zentara says seek the Vaikunthas, and remember, I'm not ignoring posts, but if God takes your computer away, what can you do? :-)

Refactoring Perl5 with Lua
1 direct reply — Read more / Contribute
by rje
on Oct 21, 2014 at 14:31

    WARNING: It may be that I'm simply thinking about Parrot in a different way...

    If you've read my previous post on microperl, then you're sufficiently prepared to take this post with a grain of salt. As a brief summary, I'll re-quote something Chromatic wrote to start me thinking about this problem in general:

    "If I were to implement a language now, I'd write a very minimal core suitable for bootstrapping. ... Think of a handful of ops. Think very low level. (Think something a little higher than the universal Turing machine and the lambda calculus and maybe a little bit more VMmy than a good Forth implementation, and you have it.) If you've come up with something that can replace XS, stop. You're there. Do not continue. That's what you need." (Chromatic, January 2013)

    Warning: I've never written a VM or a bytecode interpreter. I have written interpreters and worked with bytecodes before (okay, a 6502 emulator, but that's basically a bytecode interpreter, right?) Just remember that I'm not posting from a position of strength.

    So I found the Lua opcode set, and it seems a good starting point for talking about a small, though perhaps not minimal, Turing machine that seems to do much of what Chromatic was thinking about... except for XS, which I still haven't wrapped my head around.

    Lua has a register-based 35 opcode VM with flat closures, threads, coroutines, incremental garbage collection... and manages to shoehorn in a tail call, a "for" loop, and a CLOSURE for goodness' sake. And some of those opcodes could be "macros" built on top of other opcodes, rather than atomic opcodes (only if speed were unimportant): SUB, MUL, DIV, POW, LE.

    Again, a disclaimer: I haven't been in a compiler construction class for 25 years, and my career has typically been enterprise coding, data analysis, and tool scripting. Regardless, a small opcode set seems to me to be important for portability. And... 35 codes... well, that's dinky.

    I don't assume that Lua's codes are sufficient for Perl... things are likely missing or just not quite right for Perl. But I have to start somewhere, right? And I figure some of you have the right Domain Knowledge to shed some light on the subject. Right?

    There's lots of neat notes in the aforementioned Lua design doc, written in a clear and concise manner. And now for a brief glance at Lua's opcodes:

On optimizing nested loops
3 direct replies — Read more / Contribute
by FloydATC
on Oct 19, 2014 at 06:05

    While working on a complex script doing lookups and searches on a dozen arrays of hashes (each array representing a relational database table) I stumbled across an extremely simple improvement that instantly gave almost twice the performance.

    The original loop looked like this:

    sub filter { my $where = shift; my @in = @_; # This class method is used to filter an array of hashrefs against a + set of criteria defined in $where. # Example: # @matching_hosts = filter( { site => 56, type => 4 }, @all_hosts) +; # In this example, @matching_hosts will only contain those hashrefs +that would return TRUE for the following code: # ($_->{'site'} eq '56' && $_->{'type'} eq '4') # Note that the "eq" and "&&" are implied; no other operators are su +pported. # The order of the array is not affected. my @out = (); foreach my $record (@in) { my $keep = 1; foreach my $field (keys %{$where}) { unless ($record->{$field} eq $where->{$field}) { $keep = 0; last; } push @out, $record if $keep; } } return @out; }

    The rewritten loop looks like this:

    sub filter { my $where = shift; my @in = @_; # This class method is used to filter an array of hashrefs against a + set of criteria defined in $where. # Example: # @matching_hosts = filter( { site => 56, type => 4 }, @all_hosts) +; # In this example, @matching_hosts will only contain those hashrefs +that would return TRUE for the following code: # ($_->{'site'} eq '56' && $_->{'type'} eq '4') # Note that the "eq" and "&&" are implied; no other operators are su +pported. # The order of the array is not affected. my @out = (); # Make one pass per match term foreach my $field (keys %{$where}) { my $value = $where->{$field}; @out = grep { $_->{$field} eq $value } @in; @in = @out; # Prepare for next pass (if any) } return @out; }

    The running times of actual reports dropped from over 4 seconds to less than 2 seconds. Some of that improvement obviously came from using the built-in grep{} function instead of manually checking each value and push()'ing hashrefs to the @out array, but I didn't expect that much of an improvement.

    There had to be a different explanation, and that got me thinking about the cost of setting up and executing a foreach() loop:

    $ cat foreach_inner #!/usr/bin/perl use strict; use warnings; foreach my $foo (1 .. 3) { foreach my $bar (1 .. 10000000) { my $pointless = "$foo.$bar"; } }
    $ time ./foreach_inner real 0m8.975s user 0m8.954s sys 0m0.013s
    $ cat foreach_outer #!/usr/bin/perl use strict; use warnings; foreach my $foo (1 .. 10000000) { foreach my $bar (1 .. 3) { my $pointless = "$foo.$bar"; } }
    $ time ./foreach_outer real 0m14.106s user 0m14.092s sys 0m0.003s

    Both test scripts do the exact same amount of (pointless) work, the difference between the two scripts is that 'foreach_inner' has to execute 9999997 more foreach() loops than 'foreach_outer'.

    Sometimes, even a seemingly pointless improvement can make a significant difference if made in the right place.

    Now, the way filters are specified in $where is pretty much nailed down because that hashref is built and used in a lot of different contexts. I am still looking for a way to express the whole thing as a single grep{} block to eliminate the looping altogether. Maybe tomorrow.

    -- FloydATC

    Time flies when you don't know what you're doing

RFC: Bi-directional multi-client non-blocking TCP server/client
No replies — Read more | Post response
by glenn
on Oct 17, 2014 at 11:45

    I created these two libraries to handle multiple clients connecting to multiple servers. It is designed where the client will send data to a specific server while the server sends updates to all clients. In my case the client is the Tk UI for our testing program which is running on the server and managing test systems. This allows not only remote control of the server but keeps all interested people up to date. The data is passed as XML as it gives nice control structures and the IPs for the sender and receiver can be added from the socket info.

    Perhaps someone can enlighten me, in my original design I used two threads one for RX the other for TX and blocked until action needed to be taken. To accomplish this I had to deconstruct the IO::Select lib so that the INET socket should be shared between the two threads; however, I was never able to successfully store and share the socket. This would further reduce CPU usage by allowing the TX queue and RX socket to block until there was data. I appreciate appreciate any insight.

    If this can be accomplished without threads...

    TESTED: 1 server, 3 clients

    USAGE: SERVER: CLIENT:
Default Dropdown Value
3 direct replies — Read more / Contribute
by choroba
on Oct 17, 2014 at 03:35
    Recently, I was refactoring a CGI script at work. It contained a subroutine used to determine the default value for a dropdown list:
    sub DefaultHashValue { my %h = @_; my %r = reverse %h; my @k = sort values %h; return $r{ $k[0] } }

    Neat and short, I thought. But wait, what exactly does it do? We pick up the asciibetically first value and find the corresponding key. It took me some time to understand it (yes, I'm tough). Could this code be written in a more speaking way?

    I'd probably write it differently:

    sub sort_keys { my %h = @_; my @s = sort { $h{$a} cmp $h{$b} } keys %h; return $s[0] }

    Our dropdowns vary in size from 2 elements to several hundreds. For pure curiosity (there were no speed problems), I benchmarked the solutions (see below). Interestingly, for lists over 50 elements, the original solution was faster.

    It wasn't so hard to come with a winner. It's still readable, too:

    sub min { my %h = @_; my $min = (keys %h)[0]; $h{$_} lt $h{$min} and $min = $_ for keys %h; return $min }

    Which solution would you use and why? Or, would you use something else? Why? (I stayed with the original).

    For the interested, the full testing and benchmarking code:

    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
How to Contribute to Perl+Science
4 direct replies — Read more / Contribute
by PerlSufi
on Oct 14, 2014 at 11:37
    Hello Monks,
    After only some minor experience solving Bio Informatics problems using perl, I was wondering how I could contribute to Bio Informatics or science in general with perl.
    Aside from giving a talk about perl and BioInformatics at my local perlmonger's, I am still eager to contribute. I have written small modules that export subs to do basic things like translate RNA strings to protein.
    However, I have not released these to CPAN because CPAN has BioPerl- which may do these things already. From the view of a new comer, BioPerl is a little difficult to work with. I do thoroughly enjoy solving BioInformatics problems with perl- I also have an interest in Astronomy.
    Any insight is greatly appreciated :)
RFC: An on-disk NFS-safe key-value store database (NFSdb)
3 direct replies — Read more / Contribute
by RecursionBane
on Oct 12, 2014 at 13:29
    Greetings, Monks!

    It has been too long since I have solicited your opinion.

    After looking at the dozens upon dozens of database mechanisms available, I see that there are two major types:

    1. On-disk, serverless, "low"-concurrency database as file(s); Examples include:
    2. Remote (even if via localhost/), server/client, "high"-concurrency databases; Examples include:
    I had a specific requirement for a database that was:
    • Multi-process safe
    • Multi-host safe
    • Network File System (NFS) safe
    • Multi-master enabled (potentially to thousands of master processes concurrently)
    • Easily backed up on a frequency-basis
    • Lacking a single point of failure, assuming IT-managed storage filers

    None of the local databases I have found claim to be both multi-process safe and NFS-safe:

    • Some of them are averse to NFS (see: SQLite, BerkeleyDB, LMDB),
    • Others do not allow multiple processes accessing the database at the same time (see: TokyoCabinet, LevelDB), and,
    • Still others perform coarse-locking for multi-process access (see: MLDBM::Sync).

    Remote databases require one or more server hosts, or else the program will have to open and maintain one (and only one!) local server-process and have all other processes connect to it via localhost. Additionally, having managed to choke a MySQL server with unoptimized long-running queries early on while developing a complex project, I tend to shy away from remote databases.

    Despite the risk of link rot, it is hoped that the extensive collection of links above helps users find a database binding in Perl that works for their needs. A description of NFSdb begins below.

    Let's start with how NFSdb benchmarks against SQLite with multiple writers and readers across a network file system.

    # Benchmarks with 100000 sequential keys with random record values acr +oss four concurrent readers/writers # # NFSdb settings: # # atomic_read: 0 # atomic_write: 1 # db_root: ./nfsdb # debug: 0 # depth: 0 # lock_read: 0 # lock_write: 0 # nonblocking_write: 1 # profile: 0 # # Benchmark : Avg (us) Max (us) Min (us) # ========= ======== ======== ======== # SQLite fresh writes : 12921.69 978057.00 1379.00 # NFSdb fresh writes : 3337.90 117746.00 1893.00 # SQLite repeat writes : 11329.72 880585.00 3419.00 # NFSdb repeat writes : 3952.88 159310.00 2121.00 # SQLite fresh reads : 2379.53 509153.00 1536.00 # NFSdb fresh reads : 1139.35 12749.00 533.00 # SQLite repeat reads : 2471.39 40543.00 1518.00 # NFSdb repeat reads : 1101.33 13373.00 311.00

    Note that the average times for writes are up to 4x better, and max times are up to 8x better; this is because of table-level locking in SQLite.
    Of course, this isn't an entirely fair comparison because SQLite provides a relational layer, whereas NFSdb is simply a key-value store. There are many situations, however, where a key-value store would suffice, but programmers code up a solution around SQLite anyway. There is a better way!

    Now, let's talk about the implementation.

    While perusing CPAN, I found File::SharedNFSLock to make locking across NFS feasible by exploiting hardlinks (Update: A kind Anonymous Monk points out that this module warns of potential race conditions if hardlinking is not a viable locking solution on NFS). Inspired by CHI::Driver::File's automatic hashing and deep-directory creation, I then proceeded to naively whip up a simple key-value store that I call NFSdb, with the following features:

    • Low-overhead (no server/client, but it does have a few non-core dependencies)
    • Object-oriented (my first OO module!)
    • NFS-safe locking available
    • Atomic (lockless) write supported
    • Indexless (so searching is not possible; the exact key is required for retrieval)
    • Benchmarks favorably compared to SQLite

    Since every "record" is a file on-disk, even with locking enabled, individual "cells" can be locked, leading to high concurrency when compared to SQLite's table-locking mechanism. With lockless writes, it is possible to achieve even higher performance with the tradeoff that your read_key() may not see the absolute newest data (I suppose this could be labeled "eventual consistency").

Installing wxPerl 0.9923 with wxWidgets 3.0.1 on Unbuntu 14.04LTS 64bit
1 direct reply — Read more / Contribute
by jmlynesjr
on Oct 11, 2014 at 21:19

    I'm in the process of replacing my old 32bit Thinkpad with a new 64bit HP 15. As these things go, MPIDE, Fritzing, Eagle, and wxPerl all required libraries that weren't included in 14.04. After a lot of searching, all have been successfully installed. Below is the script I used for the wxWidgets/wxPerl installation. Hope it can be of some use to someone. Also cross posted to the wxPerl Wiki.

    Update1:

    Based on comments here at the Monestary and discussions with the original author, listed below is an updated version of the script.

    James

    There's never enough time to do it right, but always enough time to do it over...

Perl Success Stories
6 direct replies — Read more / Contribute
by aartist
on Oct 08, 2014 at 14:16
    I was visiting Success Stories and found them very old. The latest story is dated as old as September 2001. Is there an another version being written by somebody? Any blog/websites reflect the current status ?
A port of "Dukedom" to Perl
1 direct reply — Read more / Contribute
by boftx
on Oct 07, 2014 at 02:36

    I was bored the other day so I decided to port the game "Dukedom" from C to Perl. Here is the result, I'd greatly appreciate comments/feedback on how to make it display agnostic so it can be used for websites, Tk, etc. besides command line scripts. I have code refs now that can be changed out, but I'm pretty sure I need to do more. I am toying with using exception objects to signal the need for display/input and to provide callbacks to re-enter the state machine at the proper point.

    Please keep in mind that this is only the first draft and no docs or tests have been written yet. However, the command line script will work and allow you to play the game.

    https://github.com/boftx/Games-Dukedom

    You can find the original code that I ported from here: https://github.com/caryo/Dukedom/blob/master/imports/dukedom.c

    You must always remember that the primary goal is to drain the swamp even when you are hip-deep in alligators.
The importance of avoiding the shell
5 direct replies — Read more / Contribute
by jhourcle
on Sep 25, 2014 at 07:34

    For those who haven't heard, there was a Bash exploit announced yesterday. Although a patch did come out (4.3.25), there are reports that it does not fully fix the problem.

    Using variations of the test string that was posted to slashdot, it looks as if perl makes your system invulnerable:

    sh-3.2$ env x='() { :;}; echo vulnerable' sh -c "echo this is a test" vulnerable this is a test sh-3.2$ env x='() { :;}; echo vulnerable & echo' perl -e 'system "echo + test"' test sh-3.2$ env x='() { :;}; echo vulnerable' perl -e 'print `echo test`' test

    ... but unfortunately, perl only protects you when you either pass system a list. In other cases, if it sees a shell meta character in your string, you're still vulnerable:

    sh-3.2$ env x='() { :;}; echo vulnerable' perl -e 'print `echo test;`' vulnerable test sh-3.2$ env x='() { :;}; echo vulnerable' perl -e 'system "echo test;" +' vulnerable test sh-3.2$ env x='() { :;}; echo vulnerable' perl -e 'system qw(echo test +;)' test;

    Your main attack vector is CGIs -- anyone can set their user-agent, or pass in a query string, and the webserver will set environmental variables automatically. Should your scripts shell out, they're exploitable.

    So, the moral of the story: always use the list form of system, and avoid backticks if you can. If you have to do strange things w/ redirecting output, look at IPC::Open2 and IPC::Open3 which can also take list inputs.


Add your Meditation
Title:
Meditation:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":


  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (12)
    As of 2014-10-30 19:16 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (208 votes), past polls