There is a mathematical technique called convolution which represents the similarity of two signals. In some contexts, it's called the correlation function. It's fairly easy to calculate with perl tools.
PDL is ideal for this. Take the FFT (Fast Fourier Transform) of each set of audio data. Multiply the transformed datasets term by term. Take the inverse FFT of the product dataset and you have the correlation function.
If the two signals are pretty much identical, the correlation function should approximate a delta function, having a strong spike at zero and vanishing elsewhere. For real world data, some scaling, filtering, and normalization might be desirable.
For this to work, the audio data should be raw signed numbers. You will likely need to decode to that for most popular audio formats. PDL::Audio may be of help.
