Table of Links
-
Abstract and Introduction
-
Domain and Task
2.1. Data sources and complexity
2.2. Task definition
-
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
3.3. Text mining and NLP for procurement
3.4. Conclusion from literature review
-
Proposed Methodology
4.1. Domain knowledge
4.2. Content extraction
4.3. Lot zoning
4.4. Lot item detection
4.5. Lot parsing
4.6. XML parsing, data joining, and risk indices development
-
Experiment and Demonstration
5.1. Component evaluation
5.2. System demonstration
-
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
6.3. The dilemma of algorithmic choices
6.4. The cost of training data
-
Conclusion, Acknowledgements, and References
6.3. The dilemma of algorithmic choices
Our classification method is based on ‘classic’ machine learning algorithms and does not utilise the more recent deep neural networks trained on very large datasets as ‘language models’, such as the well known BERT (Delvin et al., 2018). These models have been shown to set new benchmarks for a wide range of NLP tasks. However, we did not use BERT or similar language model based methods for a number of practical reasons.
First, such models work in slightly different ways from classic algorithms in the sense that they take over the feature extraction process through a ‘text reading’ process analogous to humans. The idea is that through training such models using enormous data, they are expected to have a certain level of ‘understanding’ of human language, and therefore, can learn to automatically extract useful features from the input texts. This makes feature engineering not very compatible with such models, although one can still run such models as a ‘feature extractor’ to generate a feature representation vector, and combine it with other predefined features.
Second, such models are usually very resource intensive due to its complex architecture and the amount of data used to train them. As a result, they usually can only take a limited length of text. Popular options are a length of 256 or 512 tokens, with the second typically requiring significantly higher computing resources. This means that some of our tasks, such as page or table classification, may not fit well when the content is longer than this length, particularly the smaller option which is more affordable.
Third, we experimented with a generic BERT model (English uncased, 256) on the lot item detection task using English training data only, and compared its performance against our best performing model, random forest. However, we did not notice significant gain in F1 by BERT (under 2 percentage points). We assume that a potential reason is that the generic BERT model is trained on general purpose corpus that may not be closely related to healthcare. Studies (Alsentzer et al., 2019; Beltagy et al., 2019) have shown that usually, language models need to be further trained using domain specific corpora when they are to be applied to domain specific contexts. This again requires significant data and computing resources, especially when we have to deal with multiple languages.
Therefore, the next lesson we learned is that for industry projects, practicality is an important factor when it comes to choosing the methods. Resource constraints often outweigh algorithmic superiority, especially if the performance gain is small given the potential resource investment needed.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).