Jeffrey Kim, Arnie Lund, Caroline Dombrowski
Since as far back as at least 1995, researchers have worried about how to analyze ever-growing, overwhelming, gargantuan data. For much of big data research and analysis, computational feasibility is no longer the challenge. Peter Huber's taxonomy of dataset sizes classified "ridiculous" as 10 to the 12th power number of bytes, with storage on "robotic magnetic tape" . Times have changed.
Complex computations do require much more processing power, but now the problem more often is that little is being done with big data, or only rudimentary, surface-level analysiswhich is a shame, because larger, more complex datasets are also, according to Edward Wegman, "lumpier and richer in structure" . Since Wegman's article, the sheer amount of data has continued to increase, and still "traditional statistical methods ... are doomed to failure" . Interrelated petabytes of data are now being analyzed by many software applications, ranging from Hadoop to Dremel to Drill to SAS. These applications are diverse and sometimes incomparableit's the Wild West for data analysis. This large-scale data has offered enormous insight into productivity, social norms, the economy, public opinion, and many other areas. Yet conclusions are still sparse.
When researchers consider ocean acidification, or the changing pH of the ocean induced by human effects, data extends almost as far back as the earth. Ocean pH data is inferred for some samples, directly measured in others, and differs depending on geography, time, and local effects. There is no single pH of the ocean, and scientists do not measure every part of the ocean. However, despite the lack of completeness of data, an overwhelming quantity of data still exists. In many ways, ocean data epitomizes the challenges of big data for all researchers: Data is incomplete and evolving, quality varies, and metadata is often non-standard. Yet this continually growing stream of data can inform some of the biggest policy, business, and scientific decisions and discoveries.
Current big data is just the tip of the iceberg. New technologies and increased accessibility are creating yet another wave of big data input: user-generated and sensor-created transient data. Transient data provides bursts of new datalots and lots of bursts of data. Rather than being exponential in a single data stream (like blogs), it will provide temporally localized, short-term, specific data. Examples include personal health data from applications like FitBit or cardiac defibrillators, as well as new potential data sources from dissolving, temporary data-collection units  and online game players' data . Sensors in carpets that detect gait, CCTV footage, constant heart-rate monitors, localized temperature measurements, and other continuous data-collection methods are becoming common.
Making a huge set of data meaningful and powerful usually requires telling stories. Ocean acidification has many effects that, for instance, propagate throughout the food web and affect coastal-city economies. A small creature called the sea butterfly (a kind of water snail) is embedded in the Arctic food chain, which supplies much of Alaska and Canada's fishing revenue. Ocean acidification was recently discovered to pit and harm the shells of sea butterflies . That discovery has led to a wave of data analysis regarding coastal economies, pollutants, storm-water runoff, biological communities, and localized pH levels. For instance, Washington State just launched a statewide initiative to study and reduce ocean acidification because of the cost of losing fishing-based income. Big ocean data is analyzed by many scientific groups, NOAA, and a variety of crowdsourcing applications, including marineexplore.com (a public platform to statistically analyze ocean, buoy, and weather data) and seafloorexplorer.org (where volunteers classify ocean-floor images to improve visual-detection algorithms and identify sea species).
For some privileged individuals, genomic analysis has provided personalized health recommendations. The insurer Aetna takes population-level diagnostic data, analyzes genomic information, and provides personalized health plans intended to reduce insurance claimsideally, improving the health of participants and reducing cost to the system. The U.S. Department of Veteran Affairs is creating an electronic health record database to investigate and understand genomic and diagnostic data for personalized health prediction. This can produce emotionally relevant, personalized care stories that affect behavior. The depth of information available for individuals, contextualized with predictive and aggregated data, also provides opportunities, and those benefits include the ability to tell more and better stories.
In another realm, journalists, governments, and nonprofits have already seized upon the potential to use interactive crowdsourcing to research and write stories. Now, public datasets allow naive and virtual individuals and teams to analyze and produce results. And not only that, but crowdsourced data interpretation has also been found to be reliable and useful . Relative novices can access, map, correlate, and produce graphs and figures that demonstrate various stories using fairly simple online tools. This allows for powerful, interactive, customized storytelling using big data.
The incredibly deep data available on individuals, groups, communities, nations, and people makes it possible to tell more stories and tell them more authentically, more completely, and with more evidence. A few more examples:
- The Global Storytelling Survey collects stories about nonprofits changing people's lives: as of this writing, 46,366 stories. They include all kinds of metadata, such as the size of the problem the story is about, how typical the story is, and how often stories like it occur. Their goal is turning "anecdotes into usable data."
- The Human Face of Big Data project, spearheaded by Rick Smolan, uses smartphones to track small pieces of data about individualshow old they were when they got their first pet, how far they travel every day, and so on. That aggregated data is used to identify norms and outliers.
- Researchers at the University of Washington are designing uses for thousands of pressure observations from ordinary smartphones.
- The U.N.'s Global Pulse project does sentiment analysis to identify geographical areas for targeted aid or social services. It also offers a toolkit of free and open source software applications for data mining.
Typical news broadcasting, in which stories are created and disseminated, is now being seriously challenged. Companies that are trying to hold onto previous methods of news storytelling, such as The Post, CBS News, and other journalistic enterprises, struggle when they don't adapt to new technological and economic realities. Hoarding and protecting data now reduces its value, and authorship of stories has proliferated to create new consumer and producer relationships. The question becomes, how are these stories created from high-volume, heterogeneous, and fast-moving data?
Immersion is the key in big data storytelling. More data is not enough. Data keeps changing in a distributed and decentralized fashion. A constant stream of input enriches understanding, although it challenges organizations used to owning data siloes. Instead, conscious immersion is needed, as with the company Numenta, whose product works on the fly, constantly analyzing a stream of information, suggesting that most data storage is obsolete. Another key consideration in big-data analysis is that much data is public, by intention or by accidents of design. Amateur analysis is on the rise, as is semi-professional and professional data analysis. Vast silos of proprietary data exist, and some of it is even being analyzed, but emergent techniques are likely to come from the public sector, where analysis is transparent. For that reason, the interface between analysts and data is crucial.
Storytellers are focused not on the file structure, but rather on the data descriptions, quality, and meaning. They are deeply into understanding and using data to build their stories. Therefore, information must be present in the metadata, and not just in intelligible file names, but in layers of metadata that provide opportunities for informed analysis, too. For instance, "alder pollen count per hour" might be a file that chronicles alder pollen count. But through the addition of tags like "air quality" and "pollinators" and "wind," new kinds of questions emerge. Being able to interact with the data requires conceptual frameworks for interaction. U.S. Census data has long been a leader in creating access and interfaces for interacting with its data.
The stories that emerge out of this data are not coherent wholes, but rather are constructed from partial and disconnected information. People can create many stories, even from the same data. Moreover, all stories focus on particular information, but stories told from big data must ignore huge chunks of data. There is simply too much data, and at too many units of analysis.
Visualization can allow humans to interface with and make sense out of a large amount of ever-changing big data. After the 2011 tsunami, many Americans became worried about radiation from Japan's Fukushima nuclear reactor disaster. Data was available, but non-experts struggled. U.S. graduate students imported the scientific data and built a visualization map that displayed the intensity of atmospheric radiation relative to the normal U.S. levels. By presenting data visually and with appropriate metadata, the nuclear data became visible, public, real time, and immensely accessible.
That interface, where people are able to examine and analyze data, is the point where the human urge for storytelling and creativity allows the construction of stories. As ocean-acidification research shows, big data permits the contextualization of discrete scientific findings (such as sea butterfly shell erosion) with macro-level analysis (such as the effects of coastal fishing on the U.S. economy and food supply).
Creating meaningful information and stories out of big data requires retrieval and analysis of relevant dataand relevance is determined by understanding what the data is about. Metadata, or data about data, provides that conceptual handle. Metadata can inform about the type, quality, and content of data. Context matters.
The key to human interaction with big data will be metadata. The terms and information that frame big data are vital to reducing wasted time, increasing usefulness of results, and encouraging experts and citizens alike to make meaning out of data. New presentations of data will be needed. At McGill University, genomic researchers have developed a color-coded game (Phylo) that assists in solving complex gene analysis. By crowdsourcing complex analysis with simple, familiar color as the metadata, they have increased the speed and depth of genomic research. The researchers say humans have innovative and unique solutions that can assist in understanding DNA patterns and DNA activation.
Whether a standardized taxonomy or a free-for-all folksonomy, metadata provides the framework around which analysis can happen. Quality of data matters, and data scrubbing is a widely recognized need. But metadata provides an entry point to understand and work with data. Without metadata, big data is useless. Good metadata allows a user to segment data so as to analyze only what is relevant, and provides linkages across conceptually related records. As the data (and metadata) continue to expand, metadata becomes more and more important for access and use of data to tell better and more powerful stories.
Healthcare has recognized this need and met it head-on, with a variety of taxonomies, codes, and connective, translational metadata. Yet healthcare still struggles with big data, and is in the process of standardizing and integrating yet more data and metadata. This speaks volumes as to the future of big data research.
There are concerns about how big data will be used, and how people's privacy and rights will be respected (or not). However, for critical audiences, other storytelling concerns arise. For instance, having more data can lead to more spurious or unwarranted stories. Even correctly applied analyses with well-tested tools can yield nonsensical or useless resultswitness the dead fish put into an MRI, the data from which seemed to reveal that it responded to photographs !
This brings up an important point. More stories can be a wonderful thing, bringing to light new and amazing features of individuals, communities, nations, and the world. However, the context of that data and the process used to reach that story will drastically influence the stories told. Creating useful interfaces that connect people to data and that track quality, source, and other information is crucial. Interfaces will also allow more and better stories to be told with innovative and unanticipated appropriations of technology and data.
Being able to connect data points and tell stories will be one of the biggest adventures with big data. Funding and constructing those interfaces will lay the groundwork for life-changing and dramatic storytellingfor audiences, participants, and those telling the stories.
Our top three recommendations for industry are:
- Immerse in real-time data, assuming constant change. Ditch the siloes.
- Make your data public with good metadata. Transparency is valuable, and you can add value to your data with rich metadata and strong visualizations.
- Focus on the user. Encourage creative, individual interaction with data and technology.
2. Gallagher, J. 'Melt in the body' electronics devised. BBC News Health. Sept 27, 2012; http://www.bbc.co.uk/news/health-19737125
6. Bennett, C., Baird, A., Miller, M., and Wolford, G. Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Poster at the Human Brain Mapping Conference 2009.
Jeffrey Kim (Ph.D.) studies technology innovations in engineering and digital games. He has been at the University of Washington for 13 years. His research has focused on the role of storytelling in innovation and collective problem solving, as demonstrated through digital games and other social media interactions.
Arnie Lund is the GE Global Research Digital Industries Technology Lead and UX Innovation Lab manager. He is a member of the ACM SIGCHI Academy and is an HFES Fellow. His current work is focused on natural user experiences, decision support and analytics, and social computing as they enable the industrial Internet.
Caroline Dombrowski (MLIS, M. Phil.) predicts that digital game designers will be gurus for interactive online experience in the next few years. She is a research project manager at the University of Washington. Her background is in nonprofits, healthcare, and taxonomies.
©2013 ACM 1072-5220/13/05 $15.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.