http://www.perlmonks.org?node_id=1201776


in reply to Re^6: Parsing HTML/XML with Regular Expressions (regex)
in thread Parsing HTML/XML with Regular Expressions

I ran your version of my code and got the same output you did.

Since I already discovered the embedded newlines in the elements list, I added tr/\n//d; at the top of the for loop:

for (@elements) { tr/\n//d;

After doing that, the id for Saturday picked up correctly. Also, out of curiosity, I removed the s/\W+//g; you added. The result was:

Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Foo= Sundaybbbdddeeeggg

So, Saturday is cleaned up.

I know why the id for Sunday is Foo, but still not sure why the "bbbdddeeeggg" is picked up. I will have to step through the code to see what's happening.

As for the  , that's encoding dependent. Not sure why it would get excluded other than by explicitly filtering out non-ASCII characters.

The y is the y in Sunday. Just requires entity decoding.