Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Removing Duplicates from a multiline entry

by diamondsandperls (Beadle)
on Feb 27, 2013 at 13:39 UTC ( [id://1020889]=perlquestion: print w/replies, xml ) Need Help??

diamondsandperls has asked for the wisdom of the Perl Monks concerning the following question:

I have the following below where the entries are multiline I need to remove duplicate product entries the entries are multiline. Product 1 will be random though but before the dashed line so maybe a look behind on the dashed line to grab product name and do some kind of address comparison since this is where the entries end.

Product 1 ------------------------------------------------------------------ storeId = 1001 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 9.12 city = some city fullStreet = some address Product 2 ------------------------------------------------------------------ storeId = 2117 phoneNumber = (111) 111-1111 availbilityCode = 2 stockStatus = In stock distance = 7.49 city = some city fullStreet = some address Product 3 ------------------------------------------------------------------ storeId = 2123 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.83 city = some city fullStreet = some address Product 1 ------------------------------------------------------------------ storeId = 1001 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.56 city = some city fullStreet = some address

Replies are listed 'Best First'.
Re: Removing Duplicates from a multiline entry
by blue_cowdawg (Monsignor) on Feb 27, 2013 at 13:51 UTC

    So... what have you tried? The algorithm is fairly simple:

    pseudo code: master_hash <- empty hash while more lines readline line contains /^Product\s+(\d+)/? temp_hash <- empty_hash index_key <- $1 #see capture above while in product record do read line throw away if contains /^[\-]+$/ if contains /=/ split line on '=' key <- field 0 value <- field 1 temp_hash[key] <- value end if done if line blank next if master_hash has $index_key master_hash[index_key] <- temp_hash done if last line
    Now, I've given you enough to chew on and come up with your own code without writing it for you. I await your try/fail attempts. :-)


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: Removing Duplicates from a multiline entry
by trizen (Hermit) on Feb 27, 2013 at 14:44 UTC
    I would suggest to set the input separator ($/) to paragraph mode (empty string) and get the product id from the beginning of every paragraph.

    See the code below:
    use strict; use warnings; local $/ = ""; my %product; while (<DATA>) { if (/^Product\h+(\d)/) { my $id = $1; my ($address) = /^fullStreet\h*=\h*(.+)/m; if (exists $product{$id}) { print "ID <$id> already exists. Address is <$product{$id}{ +address}>.\n"; # do some other stuff } else { print; } $product{$id} = {address => $address}; } else { warn "Invalid paragraph: <$_>\n"; } } __END__ Product 1 ------------------------------------------------------------------ storeId = 1001 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 9.12 city = some city fullStreet = some address Product 2 ------------------------------------------------------------------ storeId = 2117 phoneNumber = (111) 111-1111 availbilityCode = 2 stockStatus = In stock distance = 7.49 city = some city fullStreet = some address Product 3 ------------------------------------------------------------------ storeId = 2123 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.83 city = some city fullStreet = some address Product 1 ------------------------------------------------------------------ storeId = 1001 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.56 city = some city fullStreet = some address
Re: Removing Duplicates from a multiline entry
by Kenosis (Priest) on Feb 27, 2013 at 19:21 UTC

    ...I need to remove duplicate product entries...

    Perhaps your mentioning the address comparison was only a solution proposal. If I'm understanding you correctly--that you only want to "remove duplicate product entries"--then consider the following:

    use strict; use warnings; local $/ = ''; my ( %products, %records ); while (<>) { if (/(Product.+)/) { $products{$1}++; $records{$1} = $_; } } print $records{$_} for grep $products{$_} == 1, keys %records;

    Usage: perl script.pl dataFile [>outFile]

    Output on your data set:

    Product 3 ------------------------------------------------------------------ storeId = 2123 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.83 city = some city fullStreet = some address Product 2 ------------------------------------------------------------------ storeId = 2117 phoneNumber = (111) 111-1111 availbilityCode = 2 stockStatus = In stock distance = 7.49 city = some city fullStreet = some address

    The script builds two hashes: one to track the number of times a product number occurs (%products) and one for the records (%records) keyed on the product number. A record is printed only if the product number was seen only once.

    Hope this helps!

Re: Removing Duplicates from a multiline entry
by 7stud (Deacon) on Feb 27, 2013 at 18:55 UTC

    I would suggest to set the input separator ($/) to paragraph mode (empty string) and get the product id from the beginning of every paragraph.

    Some explanation(if needed). A text file is really just one long string of characters, e.g.:

    line 1\nline 2\nline 3\n

    By default, perl reads a file line by line, where the definition of a line is to read all the characters up to and including a newline(\n). However, a paragraph is denoted by two newlines(\n\n):

    line1\nline2\n\nline1\nline2\n
    

    The double newline is what creates the blank line. Try it: type some text and at the end of the line hit RETURN, then hit RETURN again--you'll get a paragraph. Each time you hit RETURN when you are typing some text, a newline is entered in your text.

    Conveniently, perl allows you to change the definition of what a line is. You can tell perl that you want a line to consist of all the characters up to and including two consecutive newlines. That is known as paragraph mode, and you set paragraph mode by setting $/ to a blank string(yeah, it would make more sense to set it to "\n\n", but that's perl.).

    The neat thing about being able to set the definition of a line is that you can also read chunks of files that look like this:

    aaaaa
    bbbb
    ccccc
    ..
    ddddd
    eeeee
    fffffff
    ggggg
    ..
    

    For instance:

    use strict; use warnings; use 5.012; $/ = "..\n"; while (my $line = <DATA>) { say '-' x 20; print $line; say '=' x 20; } __DATA__ aaaaa bbbb ccccc xx ddddd eeeee fffffff ggggg xx --output:-- -------------------- aaaaa bbbb ccccc .. ==================== -------------------- ddddd eeeee fffffff ggggg .. ====================

    The other common mode besides paragraph mode is slurp mode. If you set $/ to undef, then perl will read the whole file into a single string.

Re: Removing Duplicates from a multiline entry
by sundialsvc4 (Abbot) on Feb 27, 2013 at 21:05 UTC

    Problems such as this one are naturally solved by tools such as awk, which is one of the inspirations of Perl.   Therefore, the same general solution strategy may apply.   Looking at this text-file, we see that we can describe it as consisting of four general types of lines:

    1. Product n
    2. A line of one-or-more dashes.
    3. keyword = value
    4. Entirely blank line (or end-of-file).

    A general solution to this problem might be described as, “first, read lines, accumulating information from each of them, until you reach a line that signals you that it’s time to disgorge some output.”   When you encounter a line #1, for example, you might capture the product-number and forget any cached information.   Line #2 is not interesting.   Line #3 provides a keyword and a value to be added to the cache.   Line #4 (or end-of-file) is your signal to generate a new output record.

    I would think offhand that you probably first want to deal with the task of parsing the file successfully, then, perhaps after stuffing the data into some kind of database, go back and deal with the duplicates.   (Whatever you decide a “duplicate” ought to be.)   I make this two-part suggestion partly because, in my experience, “it might not be so easy.”   You might have to be able to make some decision ... even a human decision or a case-by-case one ... about what record to discard and what record to keep.   Therefore, the “parsing” problem and the subsequent “de-duping and output” problem might need to be separated from one another.

Re: Removing Duplicates from a multiline entry
by karlgoethebier (Abbot) on Feb 28, 2013 at 09:07 UTC

    Assuming that the first found product entry is valid (is it?), i would do it like this:

    Update: I don't know how often i posted this wrong/useless M$dog shebop/shebang... #!/c:/perl/bin/perl.exe

    #!c:/perl/bin/perl.exe use strict; use warnings; $/ = ""; my $file = shift; open my $fh, "<", $file || die $!; my @records = <$fh>; close $fh; my %records; for (@records) { $_ =~ m/(Product \d+)(.+)/s; my $product_id = $1; next if exists $records{$product_id}; $records{"$product_id"} = $2; } for (sort keys %records) { print qq($_ $records{$_}); } __END__ Product 1 ------------------------------------------------------------------ storeId = 1001 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 9.12 city = some city fullStreet = some address Product 2 ------------------------------------------------------------------ storeId = 2117 phoneNumber = (111) 111-1111 availbilityCode = 2 stockStatus = In stock distance = 7.49 city = some city fullStreet = some address Product 3 ------------------------------------------------------------------ storeId = 2123 phoneNumber = (111) 111-1111 availbilityCode = 1 stockStatus = Limited stock distance = 8.83 city = some city fullStreet = some address

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      This is partly my fault I have tried all the examples none are working I do need to keep the first entry but every entry after that is a duplicate do not print this entry. so the last example has what i am looking for there could be like 10 duplicate entries or more but i just need one of each to print if there is only one entry then great print this entry then go to the next one but detect if an entry has been processed i suppose. Thanks to everyone who has helped thus far

        In other words: it's working now?

        Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1020889]
Approved by ChuckularOne
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2024-04-26 01:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found