Machine Learning Methods Technologies

Natural Language Processing (NLP) is a field of study which aims to program computers to process and analyze large amount of natural language data. In order to accurately and effectively utilize datasets in NLP systems, labeled datasets are a must. In cases like pathology reports, the sub-parts of the report are not programmatically labeled. To solve the unlabeled dataset problem, LLNL researchers have developed a software that implements an active learning framework for NLP systems called AL-NLP. It is intended to be applied on scenarios where a limited amount of labeled data is available to train a machine learning-based NLP classification system, but a large set of unlabeled documents exist such as is the case with pathology reports.

Conventional dimension reduction methods aim to maximally preserve the global and/or local geometric structure of a dataset. However, in practice one is often more interested in determining how one or multiple user-selected response function(s) can be explained by the data. To intuitively connect the responses to the data, LLNL scientists developed function preserving projections (FPP), a scalable linear projection technique for discovering interpretable relationships in high-dimensional data. FPP constructs 2D linear embeddings optimized to reveal interpretable yet potentially non-linear patterns of the response functions.

Clinical images have a wealth of data that are currently untapped by physicians and machine learning (ML) methods alike. Most ML methods require more data than is available to sufficiently train them. In order to obtain all data contained in a clinical image, it is imperative to be able to utilize multimodal, or various types of, data such as tags or identifications, especially where spatial relationships are key to identification of a clinical diagnosis. To this end, LLNL scientists have developed a method for embedding representations into an image for more efficient processing. Elements within an image are identified, and their spatial arrangement is encoded in a graph. Any machine learning technique can then be applied to the multimodal graph, as representations of the images.

Some COVID-19 diagnoses are utilizing computed tomography (CT)-scans for triage. CT-scans produce immediate results with high sensitivity. The digital images produced by a CT-scan require physicians to identify objects within the image to determine the presence of disease. Object identification can be done using machine learning (ML) techniques such as deep learning (DL) to improve speed and accuracy of disease identification in CT images. Current techniques require images to be the same size and resolution in order to properly train DL algorithms. LLNL scientists have developed a technique which automatically samples across various views and backgrounds to pre-select possible objects of interest.

Drug discovery could be significantly sped up by the integration of in silico methods. To this end, LLNL scientists along with other ATOM Consortium members created the ATOM Modeling PipeLine (AMPL). AMPL is an open-source, modular, extensible, end-to-end software pipeline for building and sharing models. It extends the functionality of DeepChem and supports an array of machine learning and molecular featurization tools. AMPL has been benchmarked on a large collection of pharmaceutical datasets covering a wide range of parameters and has been shown to generate machine learning models that can predict key safety and pharmacokinetic-relevant parameters.

When analyzing a dataset, one must not only understand the relationship between the data points, but also the underlying structure of the set. The underlying structure of a dataset is generally estimated from the data on hand, leading to assumptions and less accurate predictions. In order to improve structure learning, LLNL scientists have developed an open source software suite called MTL. Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously. This software suite can handle any type of data and consists of multitask learning methods and a framework for easy experimentation with machine learning methods, leading to more accurate assumptions and predictions.

Image
MimicGAN data set example

MimicGAN represents a new generation of methods that can “self-correct” for unseen corruptions in the data out in the field. This is particularly useful for systems that need to be deployed autonomously without needing constant intervention such as Automated Driver Assistance Systems. MimicGAN achieves this by treating every test sample as “corrupt” by default. The goal is to determine (a) the clean image and (b) the corruption both of which are unknown to the system at test time. MimicGAN solves this by making alternating guesses between what the clean sample should look like and what corruption might make it look like the observed corrupted sample. If there is no corruption at all, MimicGAN simply learns the corruption to be an identity transform – i.e., no corruption.

LLNL has developed a new system, called the Segmentation Ensembles System, that provides a simple and general way to fuse high-level and low-level information and leads to a substantial increase in overall performance of digital image analysis. LLNL researchers have demonstrated the effectiveness of the approach on applications ranging from automatic threat detection for airport security, to natural images and cancer detection in medical CT images. Furthermore, LLNL’s approach naturally leads to a big data type approach for unsupervised problems able to exploit massive amounts of unlabeled data in lieu of ground truth data, which is often difficult and expensive to acquire. LLNL has filed a patent application on the new system and is interested in continuing development focused on…