http://www.perlmonks.org?node_id=1006220

aseee has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a file in which

ID AaaI ET R2 AC RB00001; OS Acetobacter aceti ss aceti PT XmaIII RS CGGCCG, 1; CR . RN [1] RA Tagami H., Tayama K., Tohyama T., Fukaya M., Okumura H., Kawamura + Y., RA Horinouchi S., Beppu T.; RL FEMS Microbiol. Lett. 56:161-166(1988). //

patterns repeats itself hundred of times. What I want is to store the AaaI in Id column, R2 in ET column, RB00001 in AC, Acetobacter aceti ss aceti in OS column, XmaIII in PT column and CGGCCG in RS column of an database table.I know it it can be done in regular expression but I am unable to grep regular expression. Please also give some basic and advance links to tutorials of regular expression.

Replies are listed 'Best First'.
Re: reading file
by tobyink (Canon) on Nov 29, 2012 at 10:37 UTC

    This is SwissProt format, right? There exist a number of existing SwissProt tools on CPAN. Have you investigated any of them? If they are not sufficient for your needs, then you could try peeking at their source code to see how they handle parsing.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: reading file
by marto (Cardinal) on Nov 29, 2012 at 10:39 UTC
Re: reading file
by tobyink (Canon) on Nov 29, 2012 at 11:15 UTC

    Here's a fun parsing example though...

    use MooX::Struct Record => [qw( $id $et @ac $os $pt @rs @ra )], Person => [qw( $surname $initials )], ; use Data::Dumper; my %IS_PERSON = ( ra => 1, ); my %IS_LIST = ( ac => 1, rs => 1, ra => 1, ); my %record; my @records; while (<DATA>) { chomp; my ($field, $value) = /^(..)\s*(.+)$/; $field = lc $field; if ($field eq 'id' and keys %record) { push @records, Record->new(%record); %record = (); # start new record } if ($IS_LIST{$field}) { push @{$record{$field}}, map { $IS_PERSON{$field} ? Person[split] : $_ } split m{,\s*}, $value; } else { $record{$field} = $IS_PERSON{$field} ? Person[split / /, $valu +e] : $value; } } # EOF, push last record push @records, Record->new(%record); print $records[1]->ra->[0]->surname; __DATA__ ID AaaI ET R2 AC RB00001; OS Acetobacter aceti ss aceti PT XmaIII RS CGGCCG, 1; RA Tagami H., Tayama K., Tohyama T., Fukaya M., Okumura H., Kawamura + Y., RA Horinouchi S., Beppu T.; ID AaaII ET R2 AC RB00001; OS Acetobacter aceti ss aceti PT XmaIII RS CGGCCG, 1; RA Horinouchi S., Beppu T.; ID AaaIII ET R2 AC RB00001; OS Acetobacter aceti ss aceti PT XmaIII RA Horinouchi S., Beppu T.; RS CGGCCG, 1;
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: reading file
by bart (Canon) on Nov 29, 2012 at 12:31 UTC
    I don't see why you need a regular expression for this. Unless your problem is more complex than what you described here... Here is basically what I'd do:
    my %row; while(<>) { chomp; my($key, $value) = split ' ', $_, 2 or next; $row{$key} = $value; }
    To test, store your data in a text file and use the file name as the argument for the test script.

    Now all data are in a hash. You can see what's in there:

    use Data::Dumper; print Dumper \%row;
    To put it in an SQL database, I prefer to use DBIx::Simple with support of SQL::Abstract, for which the code could simply be:
    # $db is the DBIx::Simple database connection handle object $db->insert($table, \%row);

    p.s. The article that got me on my way in regular expressions, is Tom Christiansen's newsgroup post "Irregular Expressions" which has been republished on the net and even on CPAN under the name "FMTEYEWTK (= Far More Than Everything You Ever Wanted To Know) about regexes". You can find a copy here.

    It's ancient (duh) and contains some obsolete remarks, but it's still excellent.

Re: reading file
by Anonymous Monk on Nov 29, 2012 at 10:37 UTC

    I know it it can be done in regular expression but I am unable to grep regular expression. Please also give some basic and advance links to tutorials of regular expression.

    Tutorials, perlintro, perlrequick