G. J. Eckenrode1, H. Yeo1 1Weill Cornell Medical College,Surgery,New York, NY, USA
Introduction:
Clinical documentation written in a free-text or natural language format represents the richest source of information regarding a patient’s medical care. These documents are routinely used in small-scale research studies via manual chart review but their utility has been limited in large-scale applications due to the difficulty synthesizing them into tabular data. Fortunately, many clinical documents with free-text components are entered using flexible yet widely standardized structural formats. This mix of structure and free-text results in a semi-structured document format which would enable automated synthesis of these documents if the structure were known. Unfortunately, no widely available method currently exists to automatically extract structural elements from the documents themselves.
Methods:
We used a database of text-based clinical documents extracted from the electronic medical record of a single institution to obtain semi-structured documents by document type. “Brief Op Note” were selected for the initial trial. Using Natural Language Processing (NLP) techniques, these notes were divided into their lexical components and compared with each other to find sequences that were repeated across documents. Commonly repeated sequences were then compared and grouped into clusters based on similarity. A human reviewer then assigned clinical meaning to each cluster and ensured cluster accuracy. The reviewed concept clusters were then imported and NLP techniques were used to scan individual documents for the text associated with each clinical concept. Each document was synthesized into a human-readable report of its clinical content for human review and verification.
Results:
We evaluated 5000 randomly selected notes from the database. The algorithm found 16 clinical concept clusters which were reviewed, confirmed, and assigned meaning by a human reviewer. These included expected concepts, such as “Procedure Date” and “Procedure Name”, as well as document control concepts, such as “Electronic Signatures” and “Last Updated”. 1000 of these documents were randomly selected and decomposed by the system for evaluation by a human reviewer. The appropriateness for each data selection to concept assignment was evaluated to calculate sensitivity and specificity of topic matching. Preliminary analysis suggests a high degree of accuracy in matching document content to appropriate clinical concepts. Additional evaluations are currently being carried out for validation.
Conclusion:
Overall, this is a novel algorithm for content analysis that enables wide-scale synthesis of important clinical information from previously inaccessible portions of the medical record. These data can now be incorporated into any project requiring chart review, regardless of scale. It also has broad applicability to any task involving semi-structured documentation, including procedure notes, medical billing, quality assessment, and literature review.