-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Labels
Description
Processing of huge amounts of pre-tokenised FoLiA documents (for Nederlab) goes unexpectedly slow, despite disabling various modules (--skip=mcpa). In about 24 hours, about 90 documents have been processed.
Frog is called on a directory as follows (to eliminate initialisation overhead):
frog --skip=mcpa -override tokenizer.rulesFile=tokconfig-nld-historical --xmldir "." --threads 40 --testdir input/ -x
Log excerpt of a single document (rarity:/scratch/proycon/morr001cryp01_01.tok.folia.xml) in a long-running batch (days if not weeks):
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took: 0 seconds, 1 milliseconds and 510 microseconds
frog-:CGN tagging took: 42 seconds, 593 milliseconds and 430 microseconds
frog-:NER took: 30 seconds, 188 milliseconds and 829 microseconds
frog-:Mblem took: 0 seconds, 803 milliseconds and 711 microseconds
frog-:Frogging in total took: 89 seconds, 930 milliseconds and 350 microseconds
frog-:resulting FoLiA doc saved in ./mole002refe01_01_0005.tok.folia.xml
frog-:Frogging input/moll013albu01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took: 0 seconds, 16 milliseconds and 169 microseconds
frog-:CGN tagging took: 263 seconds, 846 milliseconds and 415 microseconds
frog-:NER took: 184 seconds, 152 milliseconds and 171 microseconds
frog-:Mblem took: 5 seconds, 328 milliseconds and 128 microseconds
frog-:Frogging in total took: 559 seconds, 893 milliseconds and 805 microseconds
frog-:resulting FoLiA doc saved in ./moll013albu01_01.tok.folia.xml
frog-:Frogging input/mont003zome01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took: 0 seconds, 47 milliseconds and 859 microseconds
frog-:CGN tagging took: 57 seconds, 762 milliseconds and 530 microseconds
frog-:NER took: 37 seconds, 773 milliseconds and 605 microseconds
frog-:Mblem took: 1 seconds, 304 milliseconds and 201 microseconds
frog-:Frogging in total took: 120 seconds, 269 milliseconds and 869 microseconds
frog-:resulting FoLiA doc saved in ./mont003zome01_01.tok.folia.xml
----------------------------- (my emphasis) ------------------------
frog-:Frogging input/morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took: 0 seconds, 31 milliseconds and 59 microseconds
frog-:CGN tagging took: 322 seconds, 971 milliseconds and 28 microseconds
frog-:NER took: 728 seconds, 839 milliseconds and 11 microseconds
frog-:Mblem took: 20 seconds, 219 milliseconds and 307 microseconds
frog-:Frogging in total took: 1494 seconds, 657 milliseconds and 222 microseconds
-----------------------------------------------------
frog-:resulting FoLiA doc saved in ./morr001cryp01_01.tok.folia.xml
frog-:Frogging input/moul004vowe01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took: 0 seconds, 7 milliseconds and 995 microseconds
frog-:CGN tagging took: 336 seconds, 695 milliseconds and 921 microseconds
frog-:NER took: 111 seconds, 161 milliseconds and 469 microseconds
frog-:Mblem took: 3 seconds, 0 milliseconds and 52 microseconds
frog-:Frogging in total took: 508 seconds, 408 milliseconds and 533 microseconds
frog-:resulting FoLiA doc saved in ./moul004vowe01_01.tok.folia.xml
frog-:Frogging input/mouw001brah01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took: 0 seconds, 37 milliseconds and 979 microseconds
frog-:CGN tagging took: 30 seconds, 709 milliseconds and 748 microseconds
frog-:NER took: 28 seconds, 536 milliseconds and 162 microseconds
frog-:Mblem took: 0 seconds, 966 milliseconds and 700 microseconds
frog-:Frogging in total took: 75 seconds, 622 milliseconds and 890 microseconds
frog-:resulting FoLiA doc saved in ./mouw001brah01_01.tok.folia.xml
frog-:Frogging input/muld014janf01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
(full log in rarity:/scratch/proycon/frog.log)
Comparison; a standalone run on only the highlighted document (without -nostdout) :
frog-:Frogging morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took: 0 seconds, 17 milliseconds and 448 microseconds
frog-:CGN tagging took: 17 seconds, 26 milliseconds and 152 microseconds
frog-:NER took: 65 seconds, 981 milliseconds and 710 microseconds
frog-:Mblem took: 28 seconds, 174 milliseconds and 786 microseconds
frog-:Frogging in total took: 487 seconds, 310 milliseconds and 584 microseconds
I have some minor suggestions for better debugging:
- Let frog output a date/timestamp before and after processing (makes it easier to inspect a long logs running for days)
- Is some normalisation of timings possible (in addition to the absolute numbers), e.g. divided by the total amount of tokens?
Other possibility for testdir:
- Parallellise document processing instead of modules? (if not too complicated)