in reply to Re^6: Parsing HTML/XML with Regular Expressions (regex)
in thread Parsing HTML/XML with Regular Expressions
I ran your version of my code and got the same output you did.
Since I already discovered the embedded newlines in the elements list, I added tr/\n//d; at the top of the for loop:
for (@elements) { tr/\n//d;
After doing that, the id for Saturday picked up correctly. Also, out of curiosity, I removed the s/\W+//g; you added. The result was:
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Foo= Sundaybbbdddeeeggg
So, Saturday is cleaned up.
I know why the id for Sunday is Foo, but still not sure why the "bbbdddeeeggg" is picked up. I will have to step through the code to see what's happening.
As for the  , that's encoding dependent. Not sure why it would get excluded other than by explicitly filtering out non-ASCII characters.
The y is the y in Sunday. Just requires entity decoding.
|
---|