Storage Considerations for Healthcare Text Data

Get a consultant’s take on a typical big data storage use case conundrum

Owner, Forest Rim Technology

A representative from a small group of hospitals recently solicited my advice about acquiring a data warehouse for her organization. The rep expressed that the team in charge of procurement was a bit confused over the issue, and of course cost was a big concern. On the one hand, the team believed a data warehouse was needed but didn’t think the organization could afford one. In fact, the rep admitted that her team really wasn’t sure what a data warehouse is, but it sounded like something they needed to handle the organization’s swelling data intake. On the other hand, her team had recently met with a consultant who told them that by implementing a big data solution, their organization didn’t need a data warehouse—hence the team’s confusion and the reason it was reaching out for a second opinion.

The not-so-simple answer for this group is that it all depends on what they want. If an organization such as this healthcare management company wants merely to store data easily and cost-effectively, then a big data solution is the way to go. However, if they want to ever use the data they have stored, then they need to think things through a little more clearly and consider a different infrastructure. Regardless of how cost-effective the data storage is, and assuming that the team wanted to actually use the information they intended to store, then there are several obstacles they need to overcome.

Consider the data itself

Assume the team in charge of this hospital group’s data management ended up keeping their medical healthcare records in big data storage and did nothing else with the data. Storing records—particularly over long periods of time—is a good thing to do, and big data technologies are well suited for that kind of storage. Big data offers a solution for the organization to effectively address the cost challenges of data storage. But the records need to be stored carefully. The data, as it is being placed into big data storage, needs at the very least to have context added to the text, if the text is going to be meaningfully used in an analytical manner. After a few years of such storage, records of interest may be hard to find if they are not edited and contextualized upon entry into big data storage.

Moreover, because the stored records are mainly in the form of text, the data would essentially be almost impossible to access. Having healthcare data in the form of textual data is advantageous for physicians who need to read the notes that pertain to their patients. However, because those notes are in narrative form, computerized analytical software cannot be used to analyze this data.

For instance, comparing one patient to another patient, looking at hundreds of patients collectively, and easily relating numerous healthcare records to each other would all be extremely challenging, if not impossible. As a result, while big data allows the organization to solve its cost problem, doing so introduces the issue that they may not be able to do anything really useful with all that data that is stored cost-effectively.

And even if access to the textual data were possible, the data would not be integrated. For example, one department may identify a drug by one name, while another department uses another name for the same drug. Or a procedure may be referenced in different ways by multiple departments. Varied naming conventions or referencing isn’t really a problem when a human encounters them while reading information. A human can usually understand that something is being referenced in different ways based on the context of the information. However, the same information processed in a computer is quite different. The computer doesn’t know that the same drug, instrument, or procedure actually can be referenced by different names. When tasked to carry out an analysis of text data containing multiple references to the same thing, the computer encounters problems.

A similar challenge arises when it comes to processing dates. For example, three different departments within the organization may use three different formats for dates. One department may use July 20, 1945 for a patient’s birth date, another department may use 07/20/1945, and the other department may use 45/20/07. Humans generally don’t have a problem reading these different date formats. However, accounting for them in computer-based analysis can present challenges.

Another area of concern for medical records is security, especially when it comes to Health Insurance Portability and Accountability Act (HIPAA) compliance in which data must sometimes be handled in an unorthodox fashion. Any identifying information must be carefully protected, and any protection measures taken may exacerbate data access concerns.

Understand the data’s use

In consulting with this particular small healthcare organization, the issues touched on here are really just the tip of the iceberg. For any organization having as its primary objective to store its data cost-effectively, big data technologies offer an appropriate solution. However, organizations that require the capability to access and use the data they are storing need to consider an entirely different kind of infrastructure that can meet their data storage and analytics needs.

Please share any thoughts or questions in the comments.

[followbutton username='IBMdatamag' count='false' lang='en' theme='light']