Authors:
Laura Koesten, Jude Yew, Kathleen Gregory
We live in data-centric times. Many of the world's greatest challenges, from advancing science to improving government services and tackling climate change, require access to large amounts of data. In recent years, avenues that allow individuals to share and use data online have proliferated. Scientists use open platforms, such as GitHub or Zenodo, to collaborate on data projects and archive them. Governments establish data hubs for publishing and facilitating data use, and policymakers and industry develop "data spaces" for safe sharing across multiple parties. There are also more-informal platforms for data sharing and coding within the data science community, such as Kaggle for machine learning competitions, or Hugging Face, which focuses more on natural language processing technologies.
Using data created by others, however, is still challenging. Despite the increase in the number of data-sharing websites and platforms, there is still a lack of understanding about social interactions and collaborations when working with data. We argue that attention should be paid to these and other aspects of human-data interaction, focusing on the issues and affordances of how people actually use and interact with data.
→ Shifting focus from static data use to dynamic interaction enables more diverse ways of exploring and conceptualizing how people work with data.
→ Contexts are constantly evolving, and data evolves with them. Capturing the dynamic nature of data processes requires rethinking how to document them.
In this article, we draw on insights gained from our analysis of data practices on Kaggle [1]. This included creating a questionnaire to examine data discovery, evaluating behaviors and observing platform capabilities, and analyzing metadata. We also draw on prior work by ourselves and others [2,3,4] to illustrate how data interaction could be facilitated and to suggest the idea of data as a process.
We start with Kaggle as an example of a platform where diverse data interactions occur. We structure our argument around four themes, addressing factors related to the data, such as descriptions and quality conceptions, and factors related to engagement (e.g., discussion forums and colocated tools). We conclude by offering recommendations for data publishers, platforms, and the data science community, moving away from seeing data as static objects in order to facilitate more-fluid interactions.
Founded in 2010, Kaggle now has more than 15 million users. It initially focused on machine learning competitions and has since transformed into a platform for data science learning and collaboration. Users exchange datasets and code, collaborate on projects, and compete in data science challenges.
Using data for different purposes requires transformation and remixing—not just of data variables but also of context and meaning.
As of January 2024, more than 288,000 public datasets of different formats and more than 5,000 competitions have been hosted on Kaggle. The platform offers integrated notebooks for collaborative cloud-based data analysis, enabling users to create, share, and collaborate on code, data, and text. They can also browse, filter, and search datasets based on various criteria like file size, type, license, and tags. Kaggle is an interesting case study for several reasons:
- It is the largest platform of its kind.
- It hosts a large data science community.
- It directly supports data work on the platform itself.
- Other platforms, such as GitHub, focus on code or data publishing.
- It implements a diverse range of user-centric features.
Kaggle offers users a setting to share data and tools, collaborate on analyses, and sometimes team up to compete. In contrast to other online communities, such as peer-production systems, citizen science, or Q&A sites, there is limited literature on data platforms. We use Kaggle as an example to explore human-data interactions, highlighting the use of data beyond its original purpose.
From Data Use to Data Interaction
Traditionally, users first find data, download it, and work on it locally, possibly in a team. In this context, a dataset is understood as a resource shared by someone online or in a repository with the purpose of making it accessible to others. The data reuse we see on Kaggle is more fluid. Data, code, documentation, and reviews are available in the same place in the cloud. This allows potential users to try out data on the fly, talk to others about it, and make more-confident decisions about its suitability for new tasks.
Kaggle offers an example of how such an environment could be designed. Our analysis illustrates how the design of data-sharing and reuse platforms affects data use itself. The most popular datasets on Kaggle have colocated code and long, diverse discussions attached to them, suggesting that combining data, code, documentation, and discussion can lead to a higher degree of use. We also see that Kaggle and similar platforms do not cater only to experts. They open up data science practices to a wider range of learners by providing more real-time context, descriptions, and interactions with community members.
These types of discussions and community interactions are one way of providing context to data. Research has repeatedly shown that datasets cannot be isolated from their context without losing meaning [2]. They are situated within a platform, a community, and norms of production and use [5]. Context is also important in meeting what can be thought of as an individual's data needs, which are shaped by relationships between roles, professional norms, and intended uses for data [6].
Clearly defining context as it pertains to data remains challenging [2], although there is general consensus that data use requires contextual documentation [3,7]. This includes information from the data, such as relevance, usability, and quality criteria, as well as information about the data, including where and how it was created [6]. Contextual details can take the form of metadata, information about the reputation of the publishing institution, repository, or individual, and indications of prior data use, as well as social context provided via user interactions. We observed that context is also documented via interactions on Kaggle, where people review and discuss data just like any other digital resource. These interactions provide the social context of that data, such as its popularity and how frequently it is used. Kaggle enables dataset ranking through community feedback and provides integrated notebooks for browser-based data exploration. The dynamic nature of context is visible through these features.
Documenting Dynamic Contexts: Toward Data as Processes
On platforms such as Kaggle datasets are used in varied ways, including cleaning, merging, and placing them back online. Data changes as it is updated and extended but also as it is used in other contexts (e.g., joining tables to produce new datasets that will find new uses). This suggests that describing data once—at the point of creation or publication—might not suffice. Metadata and context also shift when data takes on different meanings in different data-use scenarios. This can become particularly challenging when integrating data from diverse sources and levels of granularity as different contexts collide. Transformations of data often remain hidden. But some transformations, such as the remixing of different data or the details of what gets added or removed during different uses, can potentially be traced automatically. Changes in context and meaning, however, often remain invisible and are difficult to document.
The different questions that data is used to answer might change the meaning of the data beyond the purpose for which it was originally created. Literature that has examined the reuse of scientific data illustrates the extent to which data is influenced by contexts of creation and use, which define what counts as data, what gets disregarded, and how data is organized and processed to fit various purposes [7]. Using data for different purposes requires transformation and remixing—not just of data variables but also of context and meaning. Current metadata schemas arguably lack elements necessary to capture these interactional elements. These different types of changes—meaning, context, and the structure of data—foreground the idea of what we refer to as "data as a process," demonstrating how data is continually changing as it is being used.
How these changes can be documented in a meaningful way is a key question for the future of data reuse. This requires not just interaction with the data but also social interactions, such as asking questions and engaging in a dialogue around the data, and social signals (e.g., the form ratings). These social interactions and conversations themselves make the dynamic nature of data visible. In many established data repositories, such as in science or government, however, such conversations are less supported and also less common on the platforms themselves.
Some platforms in different environments have adapted to the need of collaboratively building up contextual knowledge and adapting content, such as for open source code. Kaggle offers some options for documenting dynamic contexts of data, but we believe there are other interesting avenues to consider. Just like audio or video, data should be easily findable, flexibly partitioned, remixed, repurposed, and shared as part of the usage cycle, beyond notebooks and metadata, discussions, and documentation. We should be able to find, reuse, and provide attribution for a dataset, and track its evolution in the same way that we can for a Wikipedia article, for example.
Current data use practices and tools attached to them feel too static and rigid to create the same type of effects we have noticed in other user-generated content that has democratized production and online consumption. One new possibility could be to document different levels of reuse, depending on the number of times data is transformed based on prior analysis efforts. To some extent, versioning also allows us to keep track of such interactions. Another way of making changes in data visible could be to indicate how a particular dataset was derived, by showing and summarizing the differences between the reused dataset and its original source alongside documentation, to facilitate more-fluid usage of data, going beyond existing efforts for data summaries [8].
We know that there are many different ways of working with data, due in part to the size of the data and storage requirements. However, we believe that bringing data and its uses together by making processes more transparent can support human-data interactions.
This can involve working with data in the cloud, where data is dynamically updated in notebooks or code. It can also involve thinking about how to introduce transparency for data that is not (or cannot be) openly shared. Shared data analyses of confidential data, at an aggregate level, are one way to make data use more visible. These analyses could be colocated with openly available metadata. Even if the data cannot be shared, data elements and traces of uses could be colocated. Bringing together data and its different uses could also help address challenges like inaccessibility due to missing or broken download links and the spread of data across multiple locations.
Kaggle is mainly used for learning exploratory data analysis and creating prototype solutions [1]. This model, however, does not directly apply to complex production environments that require debugging, testing, and deployment workflows integrated into existing IT infrastructure. While notebook environments are useful for initial and small-scale projects, they may not fully handle the complexity of real-world data science projects. Overall, we see the role of Kaggle as not only a competition platform but also a hub to gain access to useful learning resources and expertise in data science. Learning environments might not directly transfer to professional environments in which data is used in new contexts. At the same time, the use of Kaggle as a place to find and experiment with data, often in teams, resonates with other data-sharing platforms, such as GitHub, open government, and open science portals [9,10].
Thinking about the importance of human-data interaction and viewing data as a process has concrete implications for data publishers, as well as for environments supporting data science practices and teamwork. We believe Kaggle is an interesting example beyond its specific purpose, as many data portals are reconsidering their goals and efforts to shift from publishing more data to supporting data use and building communities. Within organizations (or across them, along supply chains and partner networks) the use of data is playing a bigger role. The challenges are in many ways similar to those of public platforms. There are decisions to be made about what data to share with other business units or partners and how to design an environment where people can make sense of shared datasets and put them to use. Viewing data as a process can help shape our thinking about how communities can form around datasets, how sensemaking as an iterative process can be supported by interaction design, and how data reuse itself is dynamic.
Implications for data publishers. Our arguments have direct implications for those developing data publishing software or managing existing data publishing programs. We advocate for a more user-centric idea of context and particularly of data quality to make it easier for people to judge whether a dataset fits their purpose. We recommend that other parties involved in the supply of data experiment with platform capabilities, including doing the following:
- Encouraging interaction/community development by defining clear tasks or questions provided alongside a dataset, open to a community who can engage around these tasks.
- Allowing and encouraging community ratings, such as votes.
- Setting up infrastructure and tools to capture and analyze user engagement and interaction flows.
- Considering meaningful data use indicators and metrics, assessing data use continuously, and acting on the insights.
- Providing rich documentation and other resources to describe data, including text explanations, key column headers, and their expected value ranges, known issues, and so on.
Data providers have arguably so far missed the opportunity to use approaches and techniques that have consistently proved their value in consumer-oriented domains, from online search to retail and user-generated content platforms. Our understanding of what makes data easier to use is still limited, especially when it comes to tangible evidence on the role of various publishing guidelines and platform affordances on engagement [9].
Implications for supporting human-data interaction in data science environments. Many professionals are not trained data scientists, but rather increasingly have to use data on the job and need to gain more-specialized skills to be productive. As a community of data science learners, Kaggle is not that different from any organization adapting its staff skills to meet demands raised by digitalization and remote work. We recommend that organizations undergoing such transformations do the following:
- Facilitate learning through shared resources and community engagement.
- When designing for dataset discovery, such as in an enterprise data lake, remember that users have varied search criteria based on their data needs, especially inexperienced users.
- Colocate data and tools to enable ease of understanding, access, and real-time feedback.
- Provide (personalized) descriptions and explanations of datasets (e.g., a dataset's biography, including accounts of prior usage and example code).
- Invest in cleaning key datasets to allow learners to focus on data exploration and analysis.
Open data platforms can play a role in training the next generation of data practitioners. In many ways, Kaggle looks like an online community of practice, but there is not much research examining how existing frameworks and insights on learning and apprenticing in practice-based communities apply to data science. As much data science education happens online, there is a wealth of experience and best practices to draw from to inform the design of data publishing platforms.
Implications for data science teamwork. This applies to the broader context of remote data-centric work with dispersed team members and diverse expertise. The scale of problems and infrastructure requires us to think about solutions that support people with a range of backgrounds and expertise, working together with minimal direct contact. This and similar efforts prepare organizations for future challenges in remote data science work. Our recommendations for organizations using data science are the following:
- Facilitate discussions about access and visibility levels.
- Move away from designing for typical data users but think about typical data needs.
- Recognize diverse data needs within teams.
- Provide a platform for multiple perspectives to converge on understanding data risks, value, affordances, and ethical considerations.
- Consider competitions as a framework to encourage teamwork and learning.
1. Koesten, L., Yew, J., and Gregory, K. Kaggle data community survey: Data summary/analysis report. Zenodo, Apr. 19, 2024; https://zenodo.org/records/10998910
2. Faniel, I.M., Frank, R.D., and Yakel, E. Context from the data reuser's point of view. Journal of Documentation 75, 6 (2019), 1274–97.
3. Birnholtz, J.P. and Bietz, M.J. Data at work: Supporting sharing in science and engineering. Proc. of the 2003 ACM International Conference on Supporting Group Work. ACM, New York, 2003, 339–48.
4. Koesten, L., Gregory, K., Groth, P., and Simperl, E. Talking datasets—Understanding data sensemaking behaviours. International Journal of Human-Computer Studies 146 (2021), Article 102562.
5. Neff, G., Tanweer, A., Fiore-Gartland, B., and Osburn, L. Critique and contribute: A practice-based framework for improving critical data studies and data science. Big Data 5, 2 (2017), 85–97.
6. Gregory, K. and Koesten, L. Human-Centered Data Discovery. Springer Nature, Cham, Switzerland, 2022.
7. Borgman, C.L. Big Data, Little Data, No Data: Scholarship in the Networked World. MIT Press, Cambridge, 2015.
8. Phillips, D. and Smit, M. Toward best practices for unstructured descriptions of research data. Proc. of the Association for Information Science and Technology 58, 1 (2021), 303–14.
9. Koesten, L., Vougiouklis, P., Simperl, E., and Groth, P. Dataset reuse: Toward translating principles to practice. Patterns 1, 8 (2020), Article 100136.
10. Sicilia, M.-A., García-Barriocanal, E., and Sánchez-Alonso, S. Community curation in open dataset repositories: Insights from Zenodo. Procedia Computer Science 106 (2017), 54–60.
Laura Koesten is a postdoctoral researcher at the University of Vienna and an affiliate researcher at King's College London. She researches human-data interaction, aiming to understand data-centric work practices, data reuse, sensemaking of data and models, and collaboration among different user groups. She has a background in human factors and a Ph.D. in computer science from the University of Southampton. [email protected]
Jude Yew is a user experience researcher at Google working on Fuchsia, a next-generation operating system. His current work focuses on improving developers' experiences on Fuchsia through tooling and workflows. [email protected]
Kathleen Gregory is a researcher at the Centre for Science and Technology Studies at Leiden University in the Netherlands. Her current research focuses on critically investigating scholarly communication practices and infrastructures, including those related to open science, research data, and peer review. [email protected]
This work is licensed under a Creative Commons Attribution International 4.0 License.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2025 ACM, Inc.
Post Comment
No Comments Found