I’m sure you’ve heard about our Wicked Winter in Boston. The snowfall this past weekend pushed the Boston total to 108.6 inches, or over 9 feet of snow. Nudging the winter of 1995-96 from first place, 2014-15 became the snowiest Massachusetts winter on record. As a benchmark, the average snowfall at Logan airport in Boston is 43.5 inches.
While my yard is still covered in a foot of snow I have talked to a number of folks in the past week who are enjoying warm spring weather with budding trees and blooming flowers in Arizona, California, and Florida. Last week in Boston our thermometer zipped up to 40 degrees Fahrenheit, which resulted in some melting of the mountains of snow collected in parking lots and roadsides. This melting snow yields rivers and lakes that fill up the Cambridge Reservoir, which is good, but it also floods low living areas, which is bad.
What to do?
Water and reservoirs lead me to the big data topic of the week: Data Lakes. As big data streams in with each new spring day, complex issues arise. How can we store it all? If we can open the flood gates to let all of the data in, then how can it be organized in a useful way? Considering that properly stored massive data is very valuable, who will manage access to it? How do we secure it? Seamless aggregation, easy movement, and pooling of data are all attributes of Data Lakes and Liquid Analytics. And since these Big Data Ecosystems are likely to be cloud-based, expect your mechanical room server to go the way of the desktop computer in the next few years.
Letting the data in
Big data streaming in from claims, clinical systems, and personal health devices can overwhelm the pre-processing and preparation required for traditional data storage in well-organized relational databases. In recent years Data Lakes, massive data storage reservoirs, have been created to accept vast data input. While it seems simple to just open the door and let the data in, there are still major hurdles for data security, tracking data sources, and having enough data organization to allow queries and analysis. Disorganized or improperly managed Data Lakes result in unusable data repositories often referred to as Data Swamps.
To provide structure, Data Lakes often use an open source software framework called Hadoop, named after the toy elephant belonging to the son of one of the Hadoop developers. (Dr. Seuss fans will appreciate that Hortonworks is the leading distribution and support company for Hadoop and associated software.) Hadoop software provides a framework for distributed computing and processing, so that huge amounts of data may be quickly stored, yet remain organized enough to be useful. Many Hadoop-related tools support extracting, organizing, and using data stored in Hadoop clusters.
Aggregating the data
Previous blog posts have highlighted the value of aggregated big data. Some examples include digital epidemiology, which uses social media and dispersed data to identify disease outbreaks and clusters before patients appear in clinics and emergency rooms. As noted in the June 19 blog last year, health care institutions share data as part of the High Value Healthcare Collaborative to create a “learning network” for disease management. With an announcement about the Pittsburgh Health Data Alliance on March 15, the Pittsburgh health care and university community joined the fray with a “big bet on big data.”
An alliance for health care analytics
The University of Pittsburgh Medical Center (UPMC), the University of Pittsburgh, and Carnegie Mellon University established the Pittsburgh Health Data Alliance not only to share data, but also to create a dynamic alliance for health care analytics. Like Disney these partners plan to “re-imagine” health care by pooling EHR, prescription, diagnostic image, genome, claims, and personal health device data. The alliance will leverage their collective expertise in health science research, computer science and machine learning, and consumer commercialization. Carnegie Mellon University President Subra Suresh says, “Through this collaboration, we will move more rapidly to immediate prevention and remediation, further accelerate the development of evidence-based medicine, and augment disease-centered models with patient-centered models of care.” This big promise cannot happen with traditional clinical research alone.
To support data-driven medicine and innovation, the Pittsburgh Health Data Alliance has launched the Center for Machine Learning and Health (CMLH). This group will work at the intersection of big data analytics by developing personalized medicine and disease modeling while also determining ways to manage the privacy, security, and compliance issues of sharing a big health-related data set.
This is truly an exciting time when advances in health care can be made on all fronts, not just in the traditional lab benches of medical science. Continuous, customized, preventive care will be a leap forward from the episodic, disease-centered, reactive health care common today. As you labor to enter data in your EHR, consider how valuable that data will be once it hits the Data Lake. I’m pretty sure if a personified health care were standing next to a humanized big data she would give him a hug and say, “I love you, man!” as Big Data analytics may be the life water for a new world of preventive health care and human wellness.
Dugan Maddux, MD, FACP, is the Vice President for CKD Initiatives for FMC-NA. Before her foray into the business side of medicine, Dr. Maddux spent 18 years practicing nephrology in Danville, Virginia. During this time, she and her husband, Dr. Frank Maddux, developed a nephrology-focused Electronic Health Record. She and Frank also developed Voice Expeditions, which features the Nephrology Oral History project, a collection of interviews of the early dialysis pioneers.