Natural Language Processing (NLP) is a field of study which aims to program computers to process and analyze large amount of natural language data. In order to accurately and effectively utilize datasets in NLP systems, labeled datasets are a must. In cases like pathology reports, the sub-parts of the report are not programmatically labeled. To solve the unlabeled dataset problem, LLNL researchers have developed a software that implements an active learning framework for NLP systems called AL-NLP. It is intended to be applied on scenarios where a limited amount of labeled data is available to train a machine learning-based NLP classification system, but a large set of unlabeled documents exist such as is the case with pathology reports. AL-NLP identifies which unlabeled document should be labeled next so that the overall performance of the classifier is improved. This leads to better NLP systems and ultimately an understanding of the dataset at large. In the case of pathology reports, it helps to label the data within the report so that NLP systems can assist a doctor in analyzing all of a patient's information, leading to a more informed treatment decision.

AL-NLP, Open-sourced software licensed under the MIT license (LLNL internal case # CP02231)