Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Removing Duplicates from a multiline entry

by sundialsvc4 (Monsignor)
on Feb 27, 2013 at 21:05 UTC ( #1020964=note: print w/ replies, xml ) Need Help??


in reply to Removing Duplicates from a multiline entry

Problems such as this one are naturally solved by tools such as awk, which is one of the inspirations of Perl.   Therefore, the same general solution strategy may apply.   Looking at this text-file, we see that we can describe it as consisting of four general types of lines:

  1. Product n
  2. A line of one-or-more dashes.
  3. keyword = value
  4. Entirely blank line (or end-of-file).

A general solution to this problem might be described as, “first, read lines, accumulating information from each of them, until you reach a line that signals you that it’s time to disgorge some output.”   When you encounter a line #1, for example, you might capture the product-number and forget any cached information.   Line #2 is not interesting.   Line #3 provides a keyword and a value to be added to the cache.   Line #4 (or end-of-file) is your signal to generate a new output record.

I would think offhand that you probably first want to deal with the task of parsing the file successfully, then, perhaps after stuffing the data into some kind of database, go back and deal with the duplicates.   (Whatever you decide a “duplicate” ought to be.)   I make this two-part suggestion partly because, in my experience, “it might not be so easy.”   You might have to be able to make some decision ... even a human decision or a case-by-case one ... about what record to discard and what record to keep.   Therefore, the “parsing” problem and the subsequent “de-duping and output” problem might need to be separated from one another.


Comment on Re: Removing Duplicates from a multiline entry

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1020964]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2014-07-10 10:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (205 votes), past polls