Hello and welcome to my new column! I am an assistant professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. I love to study how people interpret and manipulate visualizations to make sense of large, complex datasets. In this column, I will explore alternative ways of thinking about performance in data analytics spanning multiple areas of computer science, including visualization, human-computer interaction, and data management. I am particularly interested in exploring how our understanding of human behavior can inform the design of new tools to empower a more diverse range of people to achieve their data analysis goals.
Many researchers and developers are interested in democratizing data science by creating tools that anyone can use to analyze, model, and generally make sense of complex data [1,2]. The motivation behind this objective is to increase the general public's access to data by giving them the ability to extract insights from data by themselves. From addressing climate change to preserving Indigenous knowledge to holding police officers accountable, if more of us could enrich our stories with supporting data, we could further emphasize the urgency of our causes and illustrate the viability of potential solutions.
To broaden access to data science, we have to first understand the challenges that people run into with current data science tools. With a deeper understanding of the people we want to help, we can develop new technology to alleviate current barriers to entry in data science. My focus is on broadening the user base for visualization languages, including D3, ggplot2, and Plotly, which enable people to write programs to create visualizations such as bar charts, scatterplots, and geographic maps. I am analyzing D3 as part of my research. Originally published in 2011 , D3 revolutionized the way visualization languages are designed. By enabling users to implement visualizations directly within the Document Object Model (DOM) structure of a webpage, D3 could integrate with the front-end development stack in a way that older visualization languages could not. At 10 years old, D3 is also one of the most mature languages produced by the visualization research community that is still actively used.
If more of us could enrich our stories with supporting data, we could further emphasize the urgency of our causes.
I also find it interesting that much of what researchers know about D3 is collected in an informal, haphazard way. For example, despite D3's steep learning curve, researchers have not run rigorous experiments to measure these difficulties. Many researchers consider it common knowledge that D3 users program new D3 visualizations by copying and adapting existing examples, but no academic studies have been published to document this implementation process. In other words, we researchers have a lot of hypotheses about how people use D3 that haven't turned into rigorous research yet.
Despite D3's steep learning curve, researchers have not run rigorous experiments to measure these difficulties.
My work fills this gap in our formal knowledge of how people use D3  and provides an empirical answer to the question: What is hard about using visualization languages like D3 and what are specific actions that researchers and language developers can take to address these challenges? To answer this question, we can analyze the wealth of information that D3 users have already generated on the Web over the past 10 years. To this end, I downloaded 37,815 posts from Stack Overflow to form a corpus for my analysis. I then followed a four-step process to analyze this corpus (Figure 1).
|Figure 1. Four major phases for analyzing D3 usage through Stack Overflow posts.
|Figure 2. Examples of D3 visualizations discussed on Stack Overflow.
My 817-post sample, however, is just a fraction of the full 37,815-post dataset, so how do I know that D3 users actually follow these implementation practices? To answer this question in the filter phase, I used a mix of natural language processing and data mining techniques to turn my manual analyses into scripts that could be executed on every post in the full corpus. For example, to see whether existing examples really are central to the D3 implementation process, I searched for Observable, Bl.ocks.org, and D3 documentation links within the HTML content of each post. In the count phase, I found that 14 percent of posts in the full corpus include references to Bl.ocks.org, Observable, or the D3 gallery visual index, which is consistent with my manually coded dataset. To further test this hypothesis in the compare phase, I analyzed the types of visualizations that posters mentioned on Stack Overflow and compared them with the visualizations in the D3 design gallery referenced in the official D3 documentation. I found that the top visualizations discussed on Stack Overflow were the same as those in the D3 gallery, the top five being network graphs, trees, geographic maps, bar charts, and line charts (Figure 2). I observed relatively few Stack Overflow users experimenting beyond these five visualization types, even though the visualization community has created more than 30 kinds of visualizations that could be of interest to these users. Admittedly, correlation does not mean causation, but nonetheless there is a clear relationship between what D3 users discuss on Stack Overflow and the types of visualizations they regularly see in the D3 documentation.
There is an important lesson for the visualization community to learn from these results: We (probably) influence the breadth and complexity of visualizations that users create through our visualization example galleries. If we want to see more users take advantage of the full expressiveness of D3, then we may want to start by broadening the range of visualizations represented in existing galleries, since these examples are often the source material for creating new visualizations. This takeaway is somewhat surprising from a visualization languages perspective, because it suggests that enhancing a user's experience with a language may not require modifying the user's toolset nor the language itself. My hope is that this research will help lead researchers and developers in new directions for broadening access to visualization languages in the future, as part of the larger effort to democratize data science.
1. Shang, Z., Zgraggen, E., Buratti, B., Kossmann, F., Eichmann, P., Chung, Y., Binnig, C., Upfal, E., and Kraska, T. Democratizing data science through interactive curation of ml pipelines. Proc. of the 2019 International Conference on Management of Data. ACM, New York, 2019, 1171–1188.
5. Satyanarayan, A., Lee, B., Ren, D., Heer, J., Stasko, J., Thompson, J., Brehmer, M., and Liu, Z. Critical reflections on visualization authoring systems. IEEE Trans. on Visualization and Computer Graphics 26, 1 (2019), 461–471.
6. Battle, L., Feng, D., and Webber, K. Exploring visualization implementation challenges faced by D3 users online. arXiv preprint arXiv:2108.02299 (2021); https://arxiv.org/abs/2108.02299
Leilani Battle is an assistant professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. Her research focus is on developing interactive data-intensive systems that aid analysts in performing complex data exploration and analysis. She holds a Ph.D. and an M.S. in computer science from MIT and a B.S. in computer engineering from UW. [email protected]
Copyright held by author
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.