Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

RFC: 101 Perl PDL Exercises for Data Analysis

by thechartist (Monk)
on May 06, 2019 at 23:36 UTC ( #1233413=perlmeditation: print w/replies, xml ) Need Help??

101 Perl PDL Exercises for Data Analysis (May 2019 with PDL 2.019)

Before "data science" became a fashionable topic in computing, Perl hackers have been cleaning and analyzing data since Perl was written. The following tutorial provides Perl examples to the problems posed in: 101 NumPy Exercises for Data Analysis

My purpose is to demonstrate that Perl has the necessary tools to complete common data analysis tasks with minimal effort.

The philosophy of Perl has always been "There is more than one way to do it." Data analysis is no exception. While PDL has excellent functionality "out of the box", you might find it more effective to use individual CPAN modules to solve particular problems.

This document assumes you know some basic programming -- loops, conditionals, variables, etc. Perl syntax is similar to any C derived language. Perl has a few fundamental data types:

  • 1. Scalar: single items, such as strings of characters, and sequences of numbers -- Prefixed by '$'
  • 2. Lists/Arrays: a collection of items in order -- Prefixed by '@' Arrays are variables. The values of an array are lists.
  • 3. Hashes: a collection of key/value pairs. -- Prefixed by '%'. Hashes can be converted to lists, and vice versa.
  • Examples:

    $foo = 99; # Assigns the integer 99 to $foo. Scalar context. @Foo = ('Jack', 5, 'Jill', 4, 'John', 7); # A list assigned to @foo, v +alues separated by commas. %foo = ('Jack', 5, 'Jill', 4, 'John', 7); # A list as key, value pairs +. Better ways to write this exist.

    There are others (typeglobs and references), but they will not be needed for the exercises that follow.

    As always in Perl, there is more than one way to do anything. For PDL, one can enter simply invoke the Perl interpreter at the command line (like any other Perl script), or use a REPL (Read, Evaluate, Print, Loop) interface for interactive analysis. This exercise will show the Perl PDL one liner entered at the command line, but code in between quotation marks should work at the REPL also.

    Exercise 1 1. Import PDL and print the version.


    $ perl -MPDL -e "print $PDL::VERSION;"

    2. Create a 1D array of numbers from 0 to 9


    $ perl -MPDL -e "$arr = sequence(10); print $arr;"

    3. Q. Create a 3×3 numpy array of all True’s


    $ perl -MPDL -e "$arr = ones(3,3), print $arr;"

    4. Q. Q. Extract all odd numbers from arr = [0,1,2,3,4,5,6,7,8,9].


    $ perl -MPDL -e "$arr = sequence(10); $odd = where($arr, ($arr%2) == 1 +); print $odd;"

    5. Q. Replace all odd numbers in arr (from question 4) with -1.

    Input: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Output: [ 0, -1, 2, -1, 4, -1, 6, -1, 8, -1]

    Answer:  perl -MPDL -e "$arr = sequence(10); $odd = $arr->where($arr % 2 == 1); $odd .= -1 ; print $arr;"

    6.Replace all odd numbers in arr with -1 without changing arr

    Input: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [ 0, -1, 2, -1, 4, -1, 6, -1 +, 8, -1]

    Answer: $ perl -MPDL -e "$arr = sequence(10); $out = sequence(10); $odd = $arr->where($arr %2 == 1); $odd .= -1; print $out, $arr;" Note: the '.=' operator is a special type of assignment operator in the PDL context. Ordinarily this is used for string concatenation.

    7. Q. Convert a 1D array to a 2D array with 2 rows.

    Input: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Output: [ [0, 1, 2, 3, 4], [5, 6, 7, 8, 9] ]


    $ perl -MPDL -e "$seq = sequence(10); $seq_1 = $seq->reshape(5,2); pri +nt $seq_1;"

    8. Q. Stack arrays a and b vertically

    Input: a = [0,1,2,3,4,5,6,7,8,9] b = [1,1,1,1,1,1,1,1,1,1] Output: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]


    $ perl -MPDL -e "$arr_a = sequence(10); $arr_b = ones(10); $out = pdl( + $arr_a, $arr_b )->reshape( 5,4 ) ; print $out; "

    9. Q. Stack the arrays a and b horizontally.

    Output: [[0, 1, 2, 3, 4, 1, 1, 1, 1, 1], [5, 6, 7, 8, 9, 1, 1, 1, 1, 1]]


    perl -MPDL -e "$arr_a = sequence(10)->reshape(5,2); $arr_b = ones(10)- +>reshape(5,2); print append( $arr_a, $arr_b );"

    Replies are listed 'Best First'.
    Re: RFC: 101 Perl PDL Exercises for Data Analysis
    by choroba (Cardinal) on May 07, 2019 at 09:50 UTC
      Very nice younger sibling to RFC: 100 PDL Exercises (ported from numpy)!

      Just one nitpick: Why do you use the command line? Using scripts or the PDL interactive shell would be more friendly to people not using MSWin whose quoting your examples employ.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        I use the command line because later on, I am going to port exercises from the O'Reilly book Data Science at the Command Line to Perl. This nonsensical idea that Perl is unsuited for "data science" tasks (which is really data cleaning/munging) needs to be refuted.

        Data Science at the Command Line link

        I think this ability to do quick things at the command line is an advantage that Perl has over Python.

        You do bring up a good point about the quoting. How should I make this more general, to make clear these one-liners work (with some modification) on all systems that have Perl and PDL installed?

          You might be interested to use the Perl debugger as REPL

          perl -MPDL -de0

          You'd have less typing and others could easily copy.

          And you can also dump edit history when needed.

          Cheers Rolf
          (addicted to the Perl Programming Language :)
          Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

          I would heavily recommend either perldl or pdl2 as REPLs, over the Perl debugger, because it's more specifically set up for PDL stuff; they both load PDL and PDL::NiceSlice, have a help system, a demo system, and facilitate multi-line commands including copy-paste support.
    Re: RFC: 101 Perl PDL Exercises for Data Analysis
    by reisinge (Hermit) on May 07, 2019 at 13:30 UTC

      I think you should be using single quotes so shell stuff doesn't get expanded. And maybe -l to add a newline:

      $ perl -MPDL -le '$arr = sequence(10); print $arr' [0 1 2 3 4 5 6 7 8 9]
      In general, they do what you want, unless you want consistency. -- perlfunc

        These examples were done on Strawberry Perl on Windows, that is why quotes are done that way. I may take LanX up on his recommendation to use the -de switch.

          In my experience, it's hard enough quoting things on the command line in *nix. 10x points for doing the same in Windows (with the quotes reversed and everything). Thank Larry (?) for q//, qq//, and qr// qw//!

          Quantum Mechanics: The dreams stuff is made of

    Re: RFC: 101 Perl PDL Exercises for Data Analysis
    by etj (Chaplain) on Apr 30, 2022 at 14:06 UTC
      A tool intended to help data cleaning of CSV files, suitable for further data-science-ing, is data-prepare.
    Re: RFC: 101 Perl PDL Exercises for Data Analysis
    by etj (Chaplain) on Apr 30, 2022 at 14:10 UTC

    Log In?

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: perlmeditation [id://1233413]
    Approved by LanX
    Front-paged by stevieb
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (1)
    As of 2023-09-27 06:15 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found