Removing Duplicates from a multiline entry

diamondsandperls has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing Duplicates from a multiline entry by blue_cowdawg (Monsignor) on Feb 27, 2013 at 13:51 UTC
So... what have you tried? The algorithm is fairly simple: `pseudo code: master_hash <- empty hash while more lines readline line contains /^Product\s+(\d+)/? temp_hash <- empty_hash index_key <- $1 #see capture above while in product record do read line throw away if contains /^[\-]+$/ if contains /=/ split line on '=' key <- field 0 value <- field 1 temp_hash[key] <- value end if done if line blank next if master_hash has $index_key master_hash[index_key] <- temp_hash done if last line` [download] Now, I've given you enough to chew on and come up with your own code without writing it for you. I await your try/fail attempts. :-) Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply] [d/l]
Re: Removing Duplicates from a multiline entry by trizen (Hermit) on Feb 27, 2013 at 14:44 UTC
I would suggest to set the input separator (`$/`) to paragraph mode (empty string) and get the product id from the beginning of every paragraph. See the code below: use strict; use warnings; local $/ = ""; my %product; while (<DATA>) { if (/^Product\h+(\d)/) { my $id = $1; my ($address) = /^fullStreet\h=\h(.+)/m; if (exists $product{$id}) { print "ID <$id> already exists. Address is <$product{$id}{ +address}>.\n"; # do some other stuff } else { print; } $product{$id} = {address => $address}; } else { warn "Invalid paragraph: <$_>\n"; } } __END__ Product 1 ------------------------------------------------------------------ storeId = 1001 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 9.12 city = some city fullStreet = some address Product 2 ------------------------------------------------------------------ storeId = 2117 phoneNumber = (111) 111-1111 availbilityCode = 2 stockStatus = In stock distance = 7.49 city = some city fullStreet = some address Product 3 ------------------------------------------------------------------ storeId = 2123 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.83 city = some city fullStreet = some address Product 1 ------------------------------------------------------------------ storeId = 1001 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.56 city = some city fullStreet = some address [download]	[reply] [d/l] [select]
Re: Removing Duplicates from a multiline entry by Kenosis (Priest) on Feb 27, 2013 at 19:21 UTC
...I need to remove duplicate product entries... Perhaps your mentioning the address comparison was only a solution proposal. If I'm understanding you correctly--that you only want to "remove duplicate product entries"--then consider the following: `use strict; use warnings; local $/ = ''; my ( %products, %records ); while (<>) { if (/(Product.+)/) { $products{$1}++; $records{$1} = $_; } } print $records{$_} for grep $products{$_} == 1, keys %records;` [download] Usage: `perl script.pl dataFile [>outFile]` Output on your data set: `Product 3 ------------------------------------------------------------------ storeId = 2123 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.83 city = some city fullStreet = some address Product 2 ------------------------------------------------------------------ storeId = 2117 phoneNumber = (111) 111-1111 availbilityCode = 2 stockStatus = In stock distance = 7.49 city = some city fullStreet = some address` [download] The script builds two hashes: one to track the number of times a product number occurs (`%products`) and one for the records (`%records`) keyed on the product number. A record is `print`ed only if the product number was seen only once. Hope this helps!	[reply] [d/l] [select]
Re: Removing Duplicates from a multiline entry by 7stud (Deacon) on Feb 27, 2013 at 18:55 UTC
I would suggest to set the input separator ($/) to paragraph mode (empty string) and get the product id from the beginning of every paragraph. Some explanation(if needed). A text file is really just one long string of characters, e.g.: line 1\nline 2\nline 3\n By default, perl reads a file line by line, where the definition of a line is to read all the characters up to and including a newline(\n). However, a paragraph is denoted by two newlines(\n\n): line1\nline2\n\nline1\nline2\n The double newline is what creates the blank line. Try it: type some text and at the end of the line hit RETURN, then hit RETURN again--you'll get a paragraph. Each time you hit RETURN when you are typing some text, a newline is entered in your text. Conveniently, perl allows you to change the definition of what a line is. You can tell perl that you want a line to consist of all the characters up to and including two consecutive newlines. That is known as paragraph mode, and you set paragraph mode by setting $/ to a blank string(yeah, it would make more sense to set it to "\n\n", but that's perl.). The neat thing about being able to set the definition of a line is that you can also read chunks of files that look like this: aaaaa bbbb ccccc .. ddddd eeeee fffffff ggggg .. For instance: `use strict; use warnings; use 5.012; $/ = "..\n"; while (my $line = <DATA>) { say '-' x 20; print $line; say '=' x 20; } __DATA__ aaaaa bbbb ccccc xx ddddd eeeee fffffff ggggg xx --output:-- -------------------- aaaaa bbbb ccccc .. ==================== -------------------- ddddd eeeee fffffff ggggg .. ====================` [download] The other common mode besides paragraph mode is slurp mode. If you set $/ to undef, then perl will read the whole file into a single string.	[reply] [d/l]
Re: Removing Duplicates from a multiline entry by sundialsvc4 (Abbot) on Feb 27, 2013 at 21:05 UTC
Problems such as this one are naturally solved by tools such as `awk`, which is one of the inspirations of Perl. Therefore, the same general solution strategy may apply. Looking at this text-file, we see that we can describe it as consisting of four general types of lines: Product n A line of one-or-more dashes. keyword = value Entirely blank line (or end-of-file). A general solution to this problem might be described as, “first, read lines, accumulating information from each of them, until you reach a line that signals you that it’s time to disgorge some output.” When you encounter a line #1, for example, you might capture the product-number and forget any cached information. Line #2 is not interesting. Line #3 provides a keyword and a value to be added to the cache. Line #4 (or end-of-file) is your signal to generate a new output record. I would think offhand that you probably first want to deal with the task of parsing the file successfully, then, perhaps after stuffing the data into some kind of database, go back and deal with the duplicates. (Whatever you decide a “duplicate” ought to be.) I make this two-part suggestion partly because, in my experience, “it might not be so easy.” You might have to be able to make some decision ... even a human decision or a case-by-case one ... about what record to discard and what record to keep. Therefore, the “parsing” problem and the subsequent “de-duping and output” problem might need to be separated from one another.
Re: Removing Duplicates from a multiline entry by karlgoethebier (Abbot) on Feb 28, 2013 at 09:07 UTC
Assuming that the first found product entry is valid (is it?), i would do it like this: Update: I don't know how often i posted this wrong/useless M$dog shebop/shebang... ~~#!/c:/perl/bin/perl.exe~~ #!c:/perl/bin/perl.exe use strict; use warnings; $/ = ""; my $file = shift; open my $fh, "<", $file \|\| die $!; my @records = <$fh>; close $fh; my %records; for (@records) { $_ =~ m/(Product \d+)(.+)/s; my $product_id = $1; next if exists $records{$product_id}; $records{"$product_id"} = $2; } for (sort keys %records) { print qq($_ $records{$_}); } __END__ Product 1 ------------------------------------------------------------------ storeId = 1001 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 9.12 city = some city fullStreet = some address Product 2 ------------------------------------------------------------------ storeId = 2117 phoneNumber = (111) 111-1111 availbilityCode = 2 stockStatus = In stock distance = 7.49 city = some city fullStreet = some address Product 3 ------------------------------------------------------------------ storeId = 2123 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.83 city = some city fullStreet = some address [download] Regards, Karl �The Crux of the Biscuit is the Apostrophe�	[reply] [d/l]
Re^2: Removing Duplicates from a multiline entry by diamondsandperls (Beadle) on Mar 01, 2013 at 01:39 UTC
This is partly my fault I have tried all the examples none are working I do need to keep the first entry but every entry after that is a duplicate do not print this entry. so the last example has what i am looking for there could be like 10 duplicate entries or more but i just need one of each to print if there is only one entry then great print this entry then go to the next one but detect if an entry has been processed i suppose. Thanks to everyone who has helped thus far	[reply]
Re^3: Removing Duplicates from a multiline entry by karlgoethebier (Abbot) on Mar 01, 2013 at 08:03 UTC
In other words: it's working now? Best regards, Karl �The Crux of the Biscuit is the Apostrophe�	[reply]


The stupid question is the question not asked
	PerlMonks