A. C. Allori1,6, E. Le7, B. A. Goldstein5,6, J. Hurst6, J. Vissoci10,11, J. Routh8, H. Bosworth9, A. Drake3, J. Wood2, L. David4, A. C. Allori1,6 1Duke University Medical Center,Plastic Surgery,Durham, NC, USA 2University Of North Carolina At Chapel Hill,Plastic Surgery,Chapel Hill, NC, USA 3University Of North Carolina At Chapel Hill,Otolaryngology,Chapel Hill, NC, USA 4Wake Forest University School Of Medicine,Plastic Surgery,Winston-Salem, NC, USA 5Duke University Medical Center,Biostatistics,Durham, NC, USA 6Duke University Medical Center,Children’s Health & Discovery Initiative,Durham, NC, USA 7Duke University Medical Center,School Of Medicine,Durham, NC, USA 8Duke University Medical Center,Urology,Durham, NC, USA 9Duke University Medical Center,Population Health Science,Durham, NC, USA 10Duke University Medical Center,Global Health,Durham, NC, USA 11Duke University Medical Center,Neurosurgery,Durham, NC, USA
Introduction:
Population-health research — whether observational studies, pragmatic clinical trials, or dissemination and implementation science — requires a great deal of data infrastructure and management. Many different data sources must be aggregated and linked in order to answer the research questions at hand. Often, the data from one source are vastly different from, and possibly incompatible with, those from other sources. One promising method of dealing with such data interoperability challenges is to defining project-specific data schemas as expansions to a "common data model" (CDM). In this project, we describe the specific use case of linking disparate data sources related to cleft lip/palate in order to facilitate epidemiologic and health-services research queries.
Methods:
Raw data were obtained from North Carolina Birth Defect Monitoring Program (NCBDMP), Patient-Centered Outcomes Research Network (PCORnet), hospital electronic health records (EHR), and a condition-specific outcomes registry (CleftKit). A project-specific schema was defined following the PCORnet CDM with further inspiration from other common systems (e.g., PEDSnet and OMOP CDM and HL7 FIHR interoperability standards). Structured data were extracted from each raw source, transformed, and loaded (ETL) into a relational database (PostgreSQL/PostGIS) according to this schema, and unstructured data were stored in a parallel document database (MongoDB). Data linkage and deduplication were performed using retained PHI and/or statistical matching. A custom API was programmed in Python to facilitate clinical queries and exploratory analysis. Best practices of software development and reproducible research were followed, including use of version control for code stability, virtual environments and containers for reproducibility, Jupyter notebooks for exploration and communication, and use of highly tested open-source software.
Results:
The PCORnet CDM provides a robust yet extensible backbone upon which to build project-specific data architectures. By following PCORnet CDM design, it was possible to link data from several different data sources. The robust CDM-based project schema greatly facilitated the process of performing a variety of queries and analyses, ranging from epidemiologic description to network analysis to geostatistical analysis.
Conclusion:
This proof-of-concept investigation demonstrates that it is possible to use a Common Data Model (CDM) as the foundation for condition-specific pragmatic research, effectiveness research, implementation science, and collaborative quality-improvement programs. The next step is to build API-driven pipelines that automatically digest data from original sources in realtime, thus performing the ETL process "in the cloud" and "bringing the analysis to the data, rather than the data to the analysis." This will be important for scaling analysis from typical large databases, as used here, to truly Big Data.