Authors:
(1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences;
(2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences;
(3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences;
(4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.
Editor’s note: This is Part 7 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below.
Table of Links
Abstract and 1. Introduction and related works
- NLPre benchmarking
2.1. Research concept
2.2. Online benchmarking system
2.3. Configuration
- NLPre-PL benchmark
3.1. Datasets
3.2. Tasks
- Evaluation
4.1. Evaluation methodology
4.2. Evaluated systems
4.3. Results
- Conclusions
- Appendices
- Acknowledgements
- Bibliographical References
- Language Resource References
4. Evaluation
4.1. Evaluation methodology
To maintain the de facto standard to NLPre evaluation, we apply the evaluation measures defined for the CoNLL 2018 shared task and implemented in the official evaluation script.[11] In particular, we focus on F1 and AlignedAccuracy, which is similar to F1 but does not consider possible misalignments in tokens, words, or sentences.
In our evaluation process, we follow default training procedures suggested by the authors of the evaluated systems, i.e. we do not conduct any optimal hyperparameter search in favour of leaving the recommended model configuration as-is. We also do not further fine-tune selected models.
[11] https://universaldependencies.org/conll18/conll18_ud_eval.py