Data in the wild

Authors:
Chee Ang, Ania Bobrowicz, Diane Schiano, Bonnie Nardi

In recent years, the proliferation of online services such as social networking, gaming, Internet fora, and chat rooms has provided academic and corporate researchers opportunities to acquire and analyze large volumes of data on human activity and social interaction online. For example, massive corpora from Facebook, Twitter, and other user-generated data sources are being harvested "in the wild" on the Internet. Research based on such "found data" is increasingly common, as software tools become sophisticated enough to allow researchers to forage for data at fairly low cost. Researchers can now "not merely do more of the same, but in some cases conduct qualitatively new forms of analysis" [1]. Novel computational methods are being developed to integrate multiple distinct and often heterogeneous datasets (e.g., mobile location data and Twitter feeds) in the hope that important new relationships will emerge that cannot be found using a single data source. The growing tendency to apply new machine learning and other data-mining techniques to search for emerging patterns in existing datasetsrather than generate new data for planned hypothesis testing or qualitative explorationis transforming the way research is being conducted.

When we say "data in the wild," we point to the fact that the datasets are not constructed and designed with research questions in mind, as in conventional surveys, censuses, interviews, logs, observational studies, and experimental studies. Conventional datasets are generally planned according to conceptual or theoretical interests, with articulated research questions. Researchers decide which attributes, variables, and data types are of interest prior to data collection. Conversely, if all the researcher has is, for example, a Twitter feed, it is not possible to ask questions such as "What is the political affiliation of this poster?" Using data-mining approaches, researchers can ask questions of the data, but the questions must be scoped to exactly what is available. Adhering to material that emerges in foraged data rather than more expansively creating data apropos a research question leads to difficulties deploying theory. Theories propose elements and relations that are tested with hypotheses or checked against ordered observations. Data must speak to the semantics of the elements and relations. A found corpus may or may not have pertinent data.

Trivial or obvious results may be reported in research centered on data-mining analysis because often such results are all the investigator could squeeze from the data. Findings that more contextualized methodologies easily capture may require more extensive processing of a large dataset or may evade the researcher altogether. Even grounded theory, which begins with data, often requires further data collection once interesting theoretical problems emerge. With found data, a loss of philosophical and epistemological grounding occurs as research is conducted with data over which the investigator has little control. Simple things, such as knowing the age, gender, educational background, and so forth, of those in a sample are often impossible. This is not to say that data mining does not produce valuable results, but rather to temper uncritical enthusiasm with some observations on its limitations.

Data in the wild may be copious, but not necessarily adequate to address important topics of inquiry. boyd and Crawford [2] note that even huge datasets lack certain kinds of data; for example, private messages on game servers that are not logged in the chat function will not be in the corpus. An ethnographer or linguist knows that in online games, for example, private messages range from gossip to appeals for practical help to cybersex, and that they are central to the social experience of gaming. A sociologist conducting a survey of Twitter use cannot ask people about their level of education, experience with other computational artifacts, and so on. Researchers must decide whether what is available is adequate to address significant research topics. In the wild, we do not have the ability to "backfill" data not already present in a dataset. With conventional methodologies it is possible to progress to another sequence of data collection if we recognize that we missed critical data, or if new topics of inquiry open upon a first round of analysis, as they so often do (see [3]). Bold statements heralding a new era of social science in which we can sweep away those pesky, cumbersome methods requiring us to collect the data ourselves, must be seen in the context of the significant limitations of found data, including a lack of flexibility (see [2,3]). Easy acquisition of vast volumes of naturalistic data has a price in terms of increased analytical complexity and interpretive uncertainty [4]. Moreover, new ethical and legal challenges arise, some of which are discussed here.

An Ethical Challenge

One challenge is characterized by a shift in relationships between researchers and subjects of study. Subjects no longer "participate" in research or give consent to be studied. Instead, they are treated as the source of data in a large online archive, even though they are not aware that the data they produce (if indeed they even imagine they are "producing data") could be useful to some researchers. By contrast, if taking part in a survey or census, even at very large scale, the subject understands that someone will use the information they supply. Logistically, it would be impossible to ask for consent from every single person represented in data such as that collected from Facebook, Twitter, chatrooms, blogs, game logs, and so on, which, for a variety of obvious reasons, preclude gaining participant consent. Many researchers dealing with such data work on the assumption that since the data is available in the "public domain," it can be used for research purposes without consent. But game worlds, virtual worlds, Second Life, and so on are not public in the way that walking down the street is public. For example, participants may purchase monthly subscriptions for a virtual world, which they thus conceive of as a privileged space for which they are paying good money.

The ethical challenges of harvesting digital data have yet to be worked out. Our goal here is to draw attention to the issues in a nuanced way rather than to definitively answer them. We note that those supplying data in social media and email or by visiting websites, game worlds, and so on are providing unrecompensed labor to corporate and academic researchers. We may decide this form of labor is acceptable, but it should be recognized for what it is. We should consider certain questions. What if we offered people the ability to opt out of research even if their identities were kept anonymous? Or a micropayment in exchange for participation? If there is a micropayment, how should the labor be valued? What role should human subjects review committees play? These committees generally demand informed consent, a concern that grew out of abuses in Nazi Germany and the Tuskegee experiments (and other research; see [3]). Should we be thinking of ways to recast and reconfigure ethical arguments? The issues have been profoundly altered by digital technologies, and it may be imperative to find new ways to talk about them. For instance, Google's Street View allowed Internet users to see inside the front windows of some houses that had been photographed [5]. What if someone is unaware that his or her house is on view, or does not know the channels through which to appeal to have his or her privacy restored? In Europe, the creation of Google Street View may not be legal in all jurisdictions. But its images may end up in found data, and researchers may inadvertently use them against the wishes of those who appear in them. By the lights of any human subjects review committee, this is an ethical violation. Because of the vague provenance of found data, the controls that human subjects committees have labored to institute over the past 60-plus years are diminished. Some European countries have laws prohibiting filming a person in public for the purpose of public display without the person's consent. While most countries do not have such laws, human subjects review committees would not approve the use of images that violated local standards and laws. But such laws may disappear from view in found data.

A Legal Challenge

A second challenge is legal. Using found data could potentially put researchers at risk of legal liability. While the majority of uses of online communication services are legitimate, it is widely known that such platforms can be used for illegal activities including violation of intellectual property rights, copying and downloading copyrighted material, invasion of privacy through identity theft, and spamming, all the way to more sinister activities such as luring unsuspecting visitors (sometimes children) into illegal or dangerous online and offline activities. All these things can and do happen on the Internet every day, and illegal material could find its way into a researcher's database. It is not feasible to manually inspect a very large database for such material, or to trust an automated filtering system to capture illicit material (especially visual material). Therefore, the researcher who harvested the data may potentially be held liable for breaking the law if in possession of data that violates a law.

Some of the authors of this article started a process of sampling screenshots of live webcam streams from a publicly available social networking site in order to analyze issues of self-disclosure and privacy. As a result, a large amount of visual data was captured. However, storage became problematic, as there was no way of knowing what the dataset actually contained. It may have included, for example, images that could be deemed either illegal or disturbing (e.g., child pornography or images possibly related to terrorism). This type of data collection raises not only the ethical considerations, but also the legal consequences of unwittingly capturing and storing potentially controversial material. Serious concerns as to how to proceed with further collection of data emerged in the authors' research when a preliminary inspection of initial sets of screenshots revealed that an adult user account was used by minors. Due to the large amount of data, manual inspection was not feasible. Automated processes are not reliable enough to detect potentially illegal material. University management was approached, as well as external and internal legal personnel, in order to understand the researchers' legal position. However, discussions did not yield actionable guidelines beyond the advice not to pursue the research further. Another consideration was the effect possible negative media publicity might have on researchers' reputations and the university's image. Even if researchers do not actually break the law, there may be a temptation for the media or conservative political elements to sensationalize the story (see [6]). In this scenario, universities would be inclined to err on the side of caution, and would be hesitant to support researchers who undertake risky projects. In view of the above, it was decided that the collected data should be destroyed. A promising project with potential social impact was abandoned.

Conclusion

Data in the wild provides researchers with unprecedented access to large naturalistic datasets, resources that were not previously available. However, significant methodological, ethical, and legal concerns arise. The authors' own experience points to potential legal and ethical pitfalls in engaging with such data. Current ethical and methodological frameworks do not adequately address the gaps brought about by the scale and nature of this data. Because we are unsure of the ethical and legal ramifications of working with large datasets, there may be a "chilling effect" on research as we act conservatively to avoid pitfalls. Laws may be untested in court and difficult for the layperson to understand (see [7] and sidebar).

At present, we do not have a workable framework of guidelines for conducting large-scale research with data in the wild that would comprehensively address issues such as protecting individuals whose data is being captured online and informing researchers of risks. Internet laws are complicated, not only because they vary from country to country but also due to rapid changes in sharing information, which creates a regulatory gap [8]. For example, in the U.K. the legal framework for this area is complex and constantly evolving. Depending on the nature of the data being used it potentially includes the Computer Misuse Act 1990, Data Protection Act 1998, the regulatory framework of CEOP (Child Exploitation and Protection Centre), and European legislation concerning human rights, data protection, and privacy.

We are calling for multidisciplinary research involving law, computer science, social science, and the humanities to address the concerns we have discussed. Some topics for future discussion include how to work out realistic guidelines for conducting research with data in the wild, and how to undertake educating and involving human subjects review committees, legislators, the public, students, and researchers themselves.

References

1. Hannay, T. What can the Web do for science? Computer 43, 11 (2010), 8487.

2. boyd, d. and Crawford, K. Six provocations for Big Data. A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society. Oxford Internet Institute, 2011; http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431

3. Boellstorff, T., Nardi, B., Pearce, C., and Taylor T.L. Ethnography and Virtual Worlds: A Handbook of Method. Princeton University Press, Princeton, NJ, 2012.

4. Big Data white paper: Challenges and opportunities with Big Data. A community white paper developed by leading researchers across the United States. 2012; http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf

5. Mills, E. Google's street-level maps raising privacy concerns; http://www.usatoday.com/tech/news/internetprivacy/2007-06-01-google-maps-privacy_N.htm

6. Bainbridge, W. eGods: Fantasy Versus Faith. Oxford University Press, Oxford, U.K., 2013.

7. California law documents: Child Exploitation and Online Protection Centre regulatory framework (http://ceop.police.uk/); Children Act 2004 (http://www.legislation.gov.uk/ukpga/2004/31/contents); Computer Misuse Act 1990 (http://www.legislation.gov.uk/ukpga/1990/18/contents); Data Protection Act 1998 (http://www.legislation.gov.uk/ukpga/1998/29/contents); The European Convention on Human Rights (http://www.hri.org/docs/ECHR50.html)

8. Internet Society. Understanding your online identity protecting your privacy. 2012; http://www.internetsociety.org/understanding-your-online-identity-protecting-your-privacy

Authors

Chee Siang Ang is a lecturer in the School of Engineering and Digital Arts, University of Kent. His main research interest lies in social computing, specifically virtual worlds, computer games, and social networking. He is also very keen to investigate the applications of these technologies in various domains such as healthcare.

Ania Bobrowicz is a senior lecturer in digital arts at the University of Kent at Canterbury, U.K. Her research interests include art history, computer-mediated communication, and emerging societal issues brought about by digital technologies. She is a fellow of the Royal Society of Arts and holds an M.Sc. in multimedia systems (London Guildhall University) and an M.A. in applied linguistics (University of Warsaw).

Diane Schiano is a user experience researcher specializing in social, psychological, and design implications of emerging patterns of mediated cognition, communication, and connection. She has a Ph.D. in experimental psychology (Princeton) and an M.A. in counseling (ITP), and has worked at Stanford, NASA/Ames, Interval Research, AT&T Labs, Xerox PARC, and as an independent consultant.

Bonnie Nardi is a professor at UC Irvine and the author of Ethnography and Virtual Worlds: A Handbook of Method (with T. Boellstorff, C. Pearce, and T.L. Taylor, Princeton Univ. Press, 2012) and My Life as a Night Elf Priest: An Anthropological Account of World of Warcraft (Univ. of Michigan Press, 2010).

Sidebar: Article 8 of the European Convention on Human Rights

Article 8 provides a right to respect for one's "private and family life, his home and his correspondence," subject to certain restrictions that are "in accordance with law" and "necessary in a democratic society." This article clearly provides a right to be free of unlawful searches, but the Court has given the protection for "private and family life" for which this article provides a broad interpretation, taking for instance that the prohibition of private, consensual homosexual acts violates this article. This may be compared with the jurisprudence of the U.S. Supreme Court, which has also adopted a somewhat broad interpretation of the right to privacy. Furthermore, Article 8 sometimes comprises positive obligations: Whereas classical human rights are formulated as prohibiting a State from interfering with rights, and thus not to do something (e.g., not to separate a family under family life protection), the effective enjoyment of such rights may also include an obligation for the State to become active, and to do something (e.g., to enforce access for a divorced father to his child).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ACM Interactions

Features

Post Comment

View This Article

Reader Tools

Browse This Issue

SIGN IN