http://www.perlmonks.org?node_id=1020964


in reply to Removing Duplicates from a multiline entry

Problems such as this one are naturally solved by tools such as awk, which is one of the inspirations of Perl.   Therefore, the same general solution strategy may apply.   Looking at this text-file, we see that we can describe it as consisting of four general types of lines:

  1. Product n
  2. A line of one-or-more dashes.
  3. keyword = value
  4. Entirely blank line (or end-of-file).

A general solution to this problem might be described as, “first, read lines, accumulating information from each of them, until you reach a line that signals you that it’s time to disgorge some output.”   When you encounter a line #1, for example, you might capture the product-number and forget any cached information.   Line #2 is not interesting.   Line #3 provides a keyword and a value to be added to the cache.   Line #4 (or end-of-file) is your signal to generate a new output record.

I would think offhand that you probably first want to deal with the task of parsing the file successfully, then, perhaps after stuffing the data into some kind of database, go back and deal with the duplicates.   (Whatever you decide a “duplicate” ought to be.)   I make this two-part suggestion partly because, in my experience, “it might not be so easy.”   You might have to be able to make some decision ... even a human decision or a case-by-case one ... about what record to discard and what record to keep.   Therefore, the “parsing” problem and the subsequent “de-duping and output” problem might need to be separated from one another.