Analyzing the performance and efficiency of complex facilities with modern instrumented components – and the performance of regional networks of such facilities - is a daunting task. Increasingly, facilities collect data from manually input systems as well as diverse Internet of Things sensors and monitoring tools for specialty equipment, storage systems, computing networks, and power/cooling infrastructure. Analyzing the disparate collected data can be intractable. Similar to data from complex hospital facilities, LLNL’s high performance computing center data comprises different formats, granularities, and semantics. Handwritten data processing scripts no longer suffice to transform the data into a digestible form. To aid in solving this issue, LLNL developed Scrubjay, an open source, intuitive, scalable framework for automatic analysis of disparate data. Users can describe the datasets (files, formats, database tables), then describe the integrated dataset(s) desired, and then let ScrubJay derive it in a consistent and reproducible way. Scrubjay may be useful for COVID-19 recovery efforts as an infrastructure analysis tool – such as for performance/availability analysis of large-scale medical facilities.

ScrubJay is distributed under the terms of both the MIT license and the Apache License (Version 2.0). Users may choose either license, at their option. The source code is available at (LLNL-CODE-759938)