Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Bottom-Up Data Mining with Perl

by dragonchild (Archbishop)
on Mar 05, 2003 at 18:42 UTC ( [id://240653]=note: print w/replies, xml ) Need Help??


in reply to Bottom-Up Data Mining with Perl

Right off the top of my head, some things to include would be:
  • The flip-flop operator
  • The ideas of $\, $", and others.
  • chomp vs. chop and where each is good
  • Templates. They're not just for HTML! (Useful for reading as well as writing.)
  • Data-driven parsing.
  • Functional parsing (different from data-driven). tilly wrote something very cool on this topic regarding HTML-like parsing with functional programming.
  • When to use a regex vs split vs unpack.
  • How to use unpack! (I still don't get how to use it ...)
  • The Burrito principle. (Very cool!)
Post your paper on PM when it's done. I would love to read it!

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Replies are listed 'Best First'.
Re: Re: Bottom-Up Data Mining with Perl
by jjohhn (Scribe) on Mar 10, 2003 at 01:58 UTC
    Could you expand on how split, pack, unpack and regexes are related? I feel there's something to what you say, but I can't at all pin it down.
      split, unpack, and regexes are all ways to parse a given line of data. Each is useful in different circumstances. For example:
      • split is more useful with delimited lines, such as tab-delimited or comma-delimited. (However, using a module like Text::CSV is better for delimited text. This is because of lines like "abcd,'Smith, John', blah" - the comma in the quotes is part of the item, not a delimiter.) Now, one could use a regex here, but the regex is harder to understand, and even harder to get right.
        my @items = split $delim, $line; #### vs. (and I know this will make mistakes my @items = $line =~ /^?([^$delim]*)(?:${delim}$)?/g;
      • unpack (if you understand how to use it!) is really good with data that is formatted, like so many columns is the first thing, so many the second, etc. This is often data from a mainframe.

        Again, you can use a regex here, but you have to roll it for it to be maintainable. (I'd put an unpack example here, if I was comfortable knowing how to work it.)

        my @columns = ( 20, 10, 25, 5, 2, 2, 20); my $regex = map { "(.{$_})" } @columns; $regex = qr/^${regex}$/; my @items = $line =~ /$regex/;
      For every example I give on different parsing needs, there is a module on CPAN that does it better, faster, and safer. I personally would never hand-parse data in production. Heck, you can use CGI to parse HTML pages without even having an http server!

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://240653]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-04-24 01:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found