I apologise for the delay, and hope you're still willing to help me.
I understand you need more details about my script and resources, here they are :
- 8Go RAM (that is a problem for me now, as I tend to write quite greedy code, I'm getting out of memory errors)
- a 70Ko XML file, almost 2M lines long, each line corresponding to one word / node, like this :
<DocumentSet>
<document>
<w lemma="appeler" type="VER:pres">appelle</w>
<w lemma="quand" type="KON">quand</w>
<w lemma="gronder" type="VER:infi">gronder</w>
</document>
</DocumentSet>
- a 10 Ko tabulation separated text file, 150k lines long, looking like this :
tunisiennes tynizjEn tunisien ADJ f p 0,3 3,51
+ 0 0,2 0,2 undef
remplît R@pli remplir VER undef undef 61,21 81,42
+ 0 0,2 0,2 "sub:imp:3s;"
remuons R°my§ remuer VER undef undef 24,42 62,84
+ 0,2 0 0,2 "imp:pre:1p;ind:pre:1p;"
remuât R°m8a remuer VER undef undef 24,42 62,84
+ 0 0,2 0,2 "sub:imp:3s;"
renaudant R°nod@ renauder VER undef undef 0 2,64
+ 0 0,2 0,2 "par:pre;"
ébouriffées ebuRife ébouriffé ADJ f p 0,22 3,45
+ 0 0,2 0,2 undef
rendissent R@dis rendre VER undef undef 508,81 46
+8,11 0 0,2 0,2 "sub:imp:3p;"
I'm using the XML::Twig module to go through the XML tree and modify nodes. I use a foreach $w instruction to loop through each <w> node and then check if its content matches a word from the first column of the tab document. If so, I want to add some attributes from the other columns to the XML node for a result like this :
<w conjugaison="imp:pre:2s;ind:pre:1s;ind:pre:3s;sub:pre:1s;sub:pre:3s
+;" genre="" lemma="appeler" nombre="" type="VER:pres">appelle</w>
<w genre="" lemma="quand" nombre="" type="KON">quand</w>
<w conjugaison="inf;" genre="" lemma="gronder" nombre="" type="VER
+:infi">gronder</w>
Ask me for more info if needed.