arrays of arrays

monkantar has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am a starter in PERL and new to this website. For linguistic research, I should convert textcorpora with data such as:

the/article book/noun
he/pronoun is/verb ill/adjective
[download]

into:

the book article noun
he is ill pronoun verb adjective
[download]

I thought writing a script for this function would be a piece of cake, but that wasn't the case...

I wrote a simple toy program with words and numbers to start :
the input is:
book 1 pencil 2 desk 3

the output is:
1 2 3 book pencil desk

input.txt consists of following textline:

book 1 pencil 2 desk 3

the perlscript:

open(MYFILE, ">output.txt") or die ("can not write to file\n");
open(INFILE, "input.txt") or die ("can not open file\n");
$w = "[a-z]";
while($line = <INFILE>) {
while($line =~ /(\d+)/g) {
push(@nums, $1);
}
while($line =~ /($w+)/g) {
push(@words, $1);
}
}
push(@nums, @words);
foreach $token(@nums) {
print(MYFILE "$token ");
}
[download]

This program works only for one line of text. For the moment I am reading about arrays of arrays as I want the program to work for each line in a text and push the selected items at the end of each line but I don't know if this approach will work. Does someone knows... or has written similar programs and can give me some advice for writing such a PERL script?

Thanks a lot!

Best regards

Monkantar

Comment on arrays of arrays Select or Download Code

Replies are listed 'Best First'.
Re: arrays of arrays by dwm042 (Priest) on Sep 07, 2007 at 15:28 UTC
I've been a little confused about all this too, so I googled textcorpora and came up with this Wikipedia link: http://en.wikipedia.org/wiki/Text_corpus in which case you can see that the data: `the/article book/noun he/pronoun is/verb ill/adjective` [download] are words tagged with the parts of speech they represent. So what he's wanting to do is deconvolute the word/part-of-speech pairs back into sentences followed by the equivalent parts of speech in the same order. `the book article noun he is ill pronoun verb adjective` [download] So what he's wanting is a program that would see (\w+)\/(\w+) pairs, split them, push each into an array and once the parse is complete, 'emit' the data in sequential order, first the array of words and second the array of parts of speech. This word-space-number example is just a step on the way to get his textcorpora stuff working. That's the explanation; I hope it helps This is my code example: `#!/usr/bin/perl use warnings; use strict; my @words; my @parts_of_speech; while(my $sentence = <DATA>) { @words = (); @parts_of_speech = (); while ($sentence =~ /(\w+)\/(\w+)/g ) { push(@words, $1) if $1; push(@parts_of_speech, $2) if $2; } print $_, " " for @words; print $_, " " for @parts_of_speech; print "\n"; } __DATA__ the/article book/noun he/pronoun is/verb ill/adjective` [download] The output is: `C:\Code>perl linguistic.pl the book article noun he is ill pronoun verb adjective` [download] Update: cleanup	[reply] [d/l] [select]
Re^2: arrays of arrays by jwkrahn (Abbot) on Sep 07, 2007 at 17:20 UTC
`my @words; my @parts_of_speech; while(my $sentence = <DATA>) { @words = (); @parts_of_speech = (); while ($sentence =~ /(\w+)\/(\w+)/g ) { push(@words, $1) if $1; push(@parts_of_speech, $2) if $2; } print $_, " " for @words; print $_, " " for @parts_of_speech; print "\n"; }` [download] The arrays `@words` and `@parts_of_speech` don't need to be in file scope, you should declare them inside the loop. The pattern `\w+` will always match at least one character so the only way it can be false is if that one character is `'0'` so the tests for `$1` and `$2` are superfluous. Your print statements are overly complicated, they could be simplified to: `print "@words @parts_of_speech\n";` [download]	[reply] [d/l] [select]
Re^2: arrays of arrays by monkantar (Initiate) on Sep 07, 2007 at 15:54 UTC
Dear grep, toolic, Gangabass and dwm04, thanks a lot for your advice!! I realize I still have to learn a lot, and this first visit to this website was very interesting. dwm04, your script does exactly what I want, thx! monkantar	[reply]
Re: arrays of arrays by toolic (Bishop) on Sep 07, 2007 at 13:52 UTC
Your question is a little ambiguous regarding the exact format of your input file, and your desired output, but I refactored your code as shown below: > cat input.txt book 1 pencil 2 desk 3 foo 5 bar 6 baz 7 > > cat 637646.pl #!/usr/bin/env perl use warnings; use strict; my (@n, @w); open INFILE, '<', 'input.txt' or die "can not open file $!\n"; while (<INFILE>) { my (@numbers) = /(\d+)/g; my (@words) = /([a-z]+)/g; push @n, @numbers; push @w, @words; } close INFILE; open MYFILE, '>', 'output.txt' or die "can not write to file $!\n"; print MYFILE "$_ " for @n; print MYFILE "\n"; print MYFILE "$_ " for @w; print MYFILE "\n"; close MYFILE; > > 637646.pl > > cat output.txt 1 2 3 5 6 7 book pencil desk foo bar baz > [download] The input file has a couple of lines, each with a few word-number pairs. The output was formatted with all numbers on one line and all words on the next line. The does not employ arrays of arrays because I do not think that is necessary. Hope this helps.	[reply] [d/l]
Re: arrays of arrays by grep (Monsignor) on Sep 07, 2007 at 13:48 UTC
You want to split the string - so use split. `##UNTESTED use strict; use warnings; my $string = 'the/article book/noun he/pronoun is/verb ill/adjective'; my @array = split( '/', split( /\s+/, $string ) );` [download] grep	[reply] [d/l]
Re: arrays of arrays by throop (Chaplain) on Sep 07, 2007 at 17:00 UTC
You've gotten several good answers on the problem you think you have. But you have a larger problem. If this Perl code is really meant for linguistic research, (as opposed to being a toy program for a computational-linguistics 101) don't roll your own parser / tokenizer. Already your approach embodies design-decisions that will bite you. It assigns each token to a single linguistic category; ignoring polysemia. What will it do with 'back', which can be a noun, verb, adverb, adjective or preposition? Sometimes you'll want to tokenize across whitespace. Eg, 'break up' is better classed as a verb than as a verb+preposition token. I encourage you to check out IBM's (free, open-standard) UIMA. throop	[reply]
Re: arrays of arrays by Gangabass (Vicar) on Sep 07, 2007 at 13:58 UTC
I don't fully understand what you need but you can try to push data like so: `#The $. is the current line number in file push @{ $data{$.}{nums} }, $1; and push @{ $data{$.}{words} }, $1;` [download] After that you will have hash with line number and numbers and words for that line	[reply] [d/l]


There's more than one way to do things
	PerlMonks