Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

RFC: Text Processing for Chemists Tutorial

by davidrw (Prior)
on Jun 02, 2006 at 12:57 UTC ( #553278=perlmeditation: print w/replies, xml ) Need Help??

This is a talk i'm giving later today to a bunch of computational chemistry students to introduce them to what you can do w/text files on the commandline. Originally it was going to be just perl, but then I decided to start w/the standard utilities, so the perl isn't until the second half, but it's there in force ;).

The goal isn't to have them walking out of the 1 hour seminar knowing how to use all the tools, but hopefully knowing that the tools are out there and several starting/jumping of fpoints for using them.

Any & all comments appreciated, espcially with the goal of making this (either as-is or modified) into a piece suitable for the Tutorials section.

=pod =head1 Introduction to Text Utilities UGA Chemistry Summer Lecture Series, June 2, 2006 =head1 Intended Audience Chemistry summer research students (upperclassman & graduate) doing co +mputational tasks. =head1 Abstract This guide is an introduction, demonstration, and reference point to e +xpose Chemistry summer research students to the realm of text process +ing via the command-line interface and all of the power and efficienc +y it offers. Using standard text utilities will lead up to more advan +ced scripting with Perl. =head1 Environment Linux running the bash shell =head1 Author & Presentor David Westbrook David Westbrook, E<lt>dwestbrook@gmail.comE<gt> =head1 Warning Label This presentation is a crash course. The primary goal is exposure to +the freely available tools and methods so that they can be learned an +d utilized in the future as the need arises. =head1 The Toolbox First we will review all of the basic tools we have available by defau +lt for working with and manipulating text files. =over 4 =item man THIS IS ONE OF THE MOST IMPORTANT COMMANDS. "man" stands for "Manual" + and provdes documentation for all of the commands in this document. man man man cd man ls =back =head2 Seeing files Before we can work with a file, we have to know where it is. =over 2 =item cd This is used to Change Directories. cd /tmp cd ~ cd - cd .. cd ../../foo/bar =item pwd This Prints the Working Directory -- i.e. tells you what directory you + are in. =item ls LiSts files. Shows the contents of the current directory. See the man + page for the many options. ls ls /tmp ls -lart ls -lart ../foo =item locate Searches all the filenames on the system for a given search string. T +he availability of this command can be system-dependent, and there ar +e several caveats: 1) it works off a database that is generated night +ly, so files created today won't be found 2) It respects permissions, + so files that you can't read won't be found 3) It only works on the + local filesystem so files in mounted directories won't be found. locate host locate pass locate etc/pa =item find Recursively lists all of the files in the given (defaults to current) +directory. Many, many options in the man page. find find /tmp find -type f /tmp find /tmp -type f -maxdepth 1 -mtime +6 -exec echo {} \; =back =head2 File information We need to be able to obtain basic information about a file to know wh +at we're working with. =over 2 =item ls -l List the details of a file. This includes the permissions, owner, grou +p, size, and last modified date. =item wc WordCount. Displays the number of lines, words, and bytes in a file. =item file Attempts to determine the file's contents -- e.g. html or text or bina +ry or excel, etc. =item identify Similar to L<file> but for graphics files. Will include size and color + information. This is provided by the L<> +toolset. =back =head2 File contents Now we can begin to work with the file's actual contents. Note that "p +rint" means "output to the screen" in this context. =over 2 =item cat Just prints out the contents of each file it's given. (Same as I<type> + in DOS) cat file1 cat file1 file2 cat -n file1 =item less Shows a file one screen at a time (known as a 'pager'). (There is also + a command 'more', but it has less features than L<less>.) =item head Prints out the first N lines of a file. head file1 head -2 file1 head -3 file1 file2 =item tail Prints out the last N lines of a file. tail file1 tail -2 file1 tail -3 file1 file2 tail +5 file1 =item grep Search files for a given string and print the matching lines. See man + page for many, many options. grep foo file1 grep -i foo file1 grep -l foo * grep -n foo file1 grep -A3 foo file1 =item strings Prints out all the words found in a file. Especially useful on binary +files for finding the pieces of text buried in its compiled contents. strings a.out strings foo.exe strings /bin/ls =item sort Orders (i.e. sorts!!) the lines of a file. See man page for details. sort file1 =item uniq Displays just the unique lines of a file. The file must be sorted. =item cut Print just the specified columns of a file. See man page for details. cut -f1,3 file1 cut -f1,5,6 -d: file1 =item split Split a file into chunks. See man page. =item join Combine two files based on a common column. See man page. =back =head2 File Management These are listed for quick reference -- refer to the man pages for fur +ther details. =over 2 =item cp =item mv =item rm =item mkdir =item rmdir =back =head2 Editors =over 2 =item vi vi (or vim) does have a little bit of a learning curve, but is well wo +rth it -- it is very powerful and is available on pretty much every * +nix machine (there is gvim for Windows, too). It is best to find a r +eference (book or online tutorial) for the commands. Some essentials +: :q quits :q! quits w/o saving :w save :w! force save :wq saves and quits i enter editing (insert) mode ESC return to command mode /foo search for foo :s/foo/bar replace foo with bar Others that you'll want to know (in no particular order): yy p dd dw w :$ :1 :55 :s/foo/bar/g :%s/foo/bar :%s/foo/bar/g :5,10s +/foo/bar n N :n :N :wn :wN ctrl-g s x =item view Same as L<vi> but starts it in read-only mode. It's a very good habit + to use L<view> when you know you're only looking at a file so you do +n't accidentally change it. =back =head2 Miscellaneous =over 2 =item clear clears the screen -- same as cls in DOS =item echo Just displays its arguments to the screen (same as DOS). echo blah echo path=$PATH echo -n foo echo -e foo\tbar\nstuff =item touch Updates the last modified timestamp on a file. If file doesn't exist, + creates a 0-byte file. =item seq Prints out sequences of numbers. See options. Also see L<Loops> for e +xample usage. seq 1 10 seq 1 10 2 =item cal Prints out a nicely formatted calendar. cal cal 7 2006 =item look Prints out words from a dictionary file that start with the given stri +ng. look foo look princ =item date Prints out the date. See options in man page for various formats. date date -e =item sleep Pauses for N seconds. sleep 2 =item alias Define your own commands. alias cls=clear =item wget Gets files from the web (or ftp). Extremely useful and powerful -- can + mirror entire sites. See man page for lots of options. wget wget =item curl Another tool to get remote files (in case wget isn't available). curl --remote =item lynx A text-based web browser! USeful for simple pages, testing connections +, sucking down source code, converting html to text, or downloading f +iles from HTTP or FTP sites. =back =head1 Combining Tools =head2 Pipes '|' is the "pipe" character. It is uses to take the output from the l +eft-hand side (LHS) and give/shove ("pipe") it as input to the right- +hand side (RHS). Here are several example tasks that consectutively +use two or more of the tools we have discussed. =head3 Find a word that starts with "c" and has a "mel" in it. look c | grep mel =head3 See if the word FOO is in the first 3 lines of a file. head -3 file1 | grep FOO =head3 Take the lines that have FOO, look at just the first column, an +d show the unique values grep FOO file1 | cut -f1 | sort -u =head3 Determine the location of a file with FOO in its name. find | grep FOO locate FOO =head2 Redirection The output of a command can be saved to another file. =head3 Output grep FOO file1 > foo_lines =head3 Append Output grep FOO file1 >> foo_lines grep FOO file2 >> foo_lines =head3 Input grep FOO file1 cat file1 | grep FOO grep FOO < file1 a.out < input.dat =head3 Backticks echo `date` ls -lart `find | grep FOO` =head1 Bash A commonly used shell (although there are many) is bash. Besides just + running regular commands, it also supports setting/retrieving of var +iables and loops and conditionals. The man page is extensive. =head2 Variables foo=Bar echo my foo var = '$foo' We won't discus it here, but bash supports variable mangling. e.g. echo $foo echo ${foo%%.*} echo ${foo##*.} echo ${foo#*.} =head2 Loops for s in foo bar stff ; do echo s=$s ; done for s in foo bar stff do echo s=$s done for n in `seq 1 5` ; do touch /tmp/f$n.txt ; done =head1 Text Processing There are three powerful interpreters that can be used to filter text. + The man pages for each contain a wealth of information. =head2 sed Useful & efficient for substituions. sed s/1/AAAA/g /etc/hosts =head2 awk Useful for working with columns. awk '{print $2,$1}' /etc/hosts =head2 perl Useful for everything :) We'll come back to it a moment, but here are + examples that serve as replacements for many of the above commands. # echo perl -pe '' $f # sed s/// perl -pe 's/1/AAAA/g' $f # cut/awk perl -ane 'print $F[1], " ", $F[0]' $f # grep perl -ne 'print if /foo/' $f # head perl -ne 'print if $. <= 10' $f =head1 Regular Expressions What is a regular expression (regex)? It is just a pattern of somethi +ng you want to match in a string. And that pattern can be anything, +simple or very complex. What uses them? grep/egrep, sed, vi, and perl (and other lnaguages) No +te that there are several different "flavors" of regex depending on w +hat's using it, but they are all more-or-less the same. We will focu +s on perl regex. men perlretut man perlre Regular expressions can be scary at first so we will try to look at th +em from a general overview: =head2 Matching /a/ The I</>'s simply mark our pattern (note that perl can use anything fo +r the delimitersi with the I<m//> operator, e.g. I<m#a#>, I<m!a!>) an +d the I<a> is what we're matching, which is just the lower-case lette +r 'a'. /a*b/ This is 0 or more 'a' followed by a 'b'. /a+b/ This is one or more 'a' followed by a 'b' /a\+b/ This is literally "a+b" -- the backslash is used to escape otherwise s +pecial characters. /Number:\d+ Some word: \w+/ This is a string that includes a number and a word, e.g. "Blah Number: + 1234 Some word: foo1bar Blah" =head2 Substitution Expressions can be replaced with new values using the I<s///> substitu +tion operator: s/a/b/ Replaces an 'a' with 'b' s/a/b/g Replaces all 'a' with 'b' s/a/b/ig Replaces all 'a' or 'A' with 'b' s/n=(\d+)/N($1)/ Changes "n=1234" to "N(1234)". When there are parentheses in the patt +ern, they are used for grouping and for capturing -- the first set of + parens because $1, the second $2, and so on. =head2 More Regex This has barely scratched the surface, but we will see example usage o +f more regex components below. =head1 Perl The first place to start with command-line perl is the perlrun manpage +, and looking at & copying/using one-liner examples. perl -e 'print "hello world\n"' Using I<-p> to loop through a file and print each line: f=/tmp/datafile.txt perl -pe '' $f perl -pe 's/a/BBBBB/' $f perl -pe 's/a/BBBBB/g' $f Using I<-n> to loop through a file and look at each line: perl -ne '' $f perl -ne 'print' $f perl -ne 'print $_' $f perl -ne 'print if /a/' $f perl -ne 'print "$.)" . $_' $f perl -ne 'print "$.)" . $_ if $. % 2 == 0' $f Some things seen so far: =over 2 =item $_ This is one of many special variables (see man perlvar) that perl has. + It is perhaps the most special because it is the "default" -- whenev +er you don't supply a command with something it assumes you want to u +se $_ =item if(){} Basic IF clause in perl -- similar to other languages. I<if( ... ){ . +.. }elsif( ... ){ ... }else{ ... }> =item ... if ... ; Perl lets you short-hand simple if statements by reversing the order, +which is also nice because it's less lines (and no curlies) and can b +e more natural to read. Perl also provides I<unless> which is simply + a shortcut for I<if(!( ... ))> print "ok" if $ok; print "bad" if ! $ok; print "bad" unless $ok; while( ... ){ next unless ... ; last if ... ; } =item $. This is another special variable (see man perlvar) that is the current + line number when reading in a file. =back So now we can take a closer look at this: perl -ne 'print if /a/' $f And write it more explicitly in several ways to demonstrate the syntax +: perl -ne 'print $_ if /a/' $f perl -ne 'if( /a/ ){ print $_ }' $f perl -ne 'print $_ if $_ =~ /a/' $f perl -ne 'print $_ unless $_ !~ /a/' $f Here is a good time to note that the unofficial Perl motto is B<TMTOWT +DI> (There's More Than One Way To Do It). Another powerful command-line option is I<-a> to Auto-split, much like + cut & awk do. -aF =head1 Examples =head2 A geometry file needs to become many files split --lines=30 /tmp/g___ for f in /tmp/g___* ; do d=`head -1 $f | sed s/^**//` mkdir -p blah/$d tail +2 $f > blah/$d/geom done =head2 Rename a bunch of .tpl files, dropping the extension for n in `seq 1 3` ; do touch f$n.tpl ; done for f in *.tpl ; do mv $f ${f%.tpl} ; done ls *.tpl | perl -ne 'chomp;$f0=$_;s/\.tpl$//;print "mv $f0 $_\n"' ls *.tpl | perl -pe 's/^(.+)(\..*)/mv $1$2 $1/' =head2 Get the first & fourth numbers from certain lines of a file If you look at the second line, it starts with BOMD, and then numbers. + I want to pick a first (-264.05765232) and the fourth number (0.000 +00000000) and write it in a new file. Then I want to repeat this in +every data entries in the file (as you can see, one entry takes 13 li +nes). grep '^ BOMD' deMon.mol | awk '{print $3, $6}' > deMon.mol.filtered grep '^ BOMD' deMon.mol | perl -alne 'print "$F[2] $F[5]"' > deMon.m +ol.filtered perl -alne 'print "$F[2] $F[5]" if $F[0] eq "BOMD"' deMon.mol > deMo +n.mol.filtered =head2 Get the number of days between two dates perl -MDate::Calc=Delta_Days -le 'print Delta_Days(2005,9,16, 2006,2 +,28)' =head2 Display a web page's source perl-MLWP::Simple -e "print get(shift)" wget -O - lynx --source =head2 Get lines N -> M of a file These examples show how to display lines 5-8, inclusive from the /etc/ +passwd file: head -8 /etc/passwd | tail -4 tail +5 /etc/passwd | head -4 perl -ne 'print if 5<=$. && $.<=8' /etc/passwd # man perlvar for explanation of $. cat -n /etc/passwd | perl -ne 'print if s/^\s*[5678]\s+//' Now, to get the lines from /etc/password starting at a line with "news" in it, and stopping at a line with "ftp" in it, these all work (all the same except ordering, which determines whether or not the start and/or end lines are included): =over 1 =item [start, end] perl -ne '$ok||=/news/; print if $ok; $ok=0 if /ftp/' /etc/passwd =item [start,end) perl -ne '$ok||=/news/; $ok=0 if /ftp/; print if $ok' /etc/passwd =item (start,end] perl -ne 'print if $ok; $ok||=/news/; $ok=0 if /ftp/' /etc/passwd =item (start,end) perl -ne '$ok=0 if /ftp/; print if $ok; $ok||=/news/' /etc/passwd =back Basic approach is to take advantage of -n (man perlrun) and flip a flag on/off at the boundries. Note that the /news/ is a regex, and can take complex patterns (man perlre) =head2 Lazy math perl -le 'print( 3+5 )' # need the parens here There is also 'bc' command. =head2 Make & use a program to sum numbers alias add="perl -lne '\$x+=\$_; END{print \$x}'" cut -f1 file1 | add =head2 Perl One-liners =over 2 =item Favourite One-liners? L<> A web server! perl -MIO::All -e 'io(":8080")->fork->accept->(sub { $_[0] < io(-x $ +1 ? "./$1 |" : $1) if /^GET \/(.*) / })' dos2unix perl -pi -e 's/\r//' filename =item What one-liners do people actually use? L<> =item One Liners L<> =back =head1 Reference Material =over 2 =item man Also note that the command I<apropos> searches man pages. =item man perl Which is basically a table of contents for the many perl man pages. On +es of particular interest are these manpages: perl perlrun perlsyn pe +rlfunc perlre perlretut =item perldoc Displays documentation for everything perl. perldoc -f sleep perldoc perlfunc perldoc -q how perldoc File::Find =item CPAN L<> is one of Perl's great strengths -- it is a +huge repository of modules (libraries) to do pretty much anything and + everything with perl. =item Perl Monks L<> is a great Perl community site. The knowledge + base of the forums, tutorials, and FAQ's is very extensive and the m +embers are very open & willing to help with any level (complete begin +ner through guru) question. This talk is posted at =item ME! I love to help with this stuff -- it's my vocation & hobby. I'm reach +able at E<lt>dwestbrook@gmail.comE<gt> or as davidryan0 on AIM. =back =cut

Replies are listed 'Best First'.
Re: RFC: Text Processing for Chemists Tutorial
by kvale (Monsignor) on Jun 02, 2006 at 15:31 UTC
    This is a lot of information to absorb in just one hour. You are progressing from the simplest Unix commands to a web server. Chemistry students are bright, but I don't think anyone starting from complete ignorance of Unix, and perhaps programming, is going to be able to understand all of this in real time.

    I taught a Perl for Bioinformatics class a few years ago that also was an intro to the Linux CLI. We had 10 hour long lectures anlong with 10 two hour labs. At the end of the course, most students could write simple programs, but mabye 10% managed to clue into Perl's real power. Especially at the very beginning, programming is hard.

    So I'd recommend at the very least a handout of your talk (not in raw POD) so that they can play with your examples on their own. And for the first time regexer, I'd recommend perlrequick rather than perlretut. It is written more simply and is less overwhelming.


      Yeah, it was definitely a lot... but the goal was exposure -- "there exists these set of commands to do your work for you" and not actually being able to talk the talk and go script stuff.

      Luckily the whole audience (~dozen) had at least basic linux/programming experience so it wasn't starting from "this is a prompt" or anything..
      They seemed to be following the two real-life chemistry data file examples at least..

      I used (my own hacked up version since it doesn't have much in terms of config options) Pod::Pdf to create a nice PDF version to print up for them.

      cool -- i actually didn't know about perlrequick -- i knew i'd learn something from replies to this post! :)
Re: RFC: Text Processing for Chemists Tutorial
by planetscape (Chancellor) on Jun 03, 2006 at 13:41 UTC

    The online book, Data-Intensive Linguistics, by Chris Brew and Marc Moens, has a nice introduction to using UNIX tools for linguistic processing. You may wish to have a look at it and adapt some of what's there, or just link to it in a "Resources" or "Further Reading" section.

    As a member of pedagogues, I also want to thank you for thinking of contributing to our Tutorials section. Once you've hammered out a final version based on comments in this thread, you should, IMHO, feel free to post it as a Tutorial. (Maybe after running it through pod2html or something, though...) :-)


    Update: This link, on the page noted above, may also be useful...


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://553278]
Front-paged by tye
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2018-03-21 10:02 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (265 votes). Check out past polls.