You are here

The Opportunities and Challenges of 'Big Data' Research

At the end of an extended trip to a range of conferences and symposia I've made my way to Vienna, where I'm attending the DGPuK Digital Methods conference at the University of Vienna. The conference is in German, but I'll try to blog the presentations in English nonetheless - wish me luck... We begin with keynote by Jürgen Pfeffer, addressing - not surprisingly - the question of 'big data' in communications research.

Jürgen begins by asking what's different about 'big data' research. In our field, we're using 'big data' on communication and interaction to work towards a real-time analysis of large-scale, dynamic sociocultural systems, necessarily especially through computational approaches - this draws on the data available from major social networks and other participative sites, but it aims not to research "the Internet", but society by examining communication patterns on the Internet (and elsewhere).

There are great hopes to use this to understand collective human behaviour and interaction, in much finer detail than previously possible and in close to real time. We may examine, for example, how social and conventional media interact: how journalists use Twitter as a radar pointing to breaking stories, how social media users disseminate mainstream media articles, and how online and offline events interact with and reinforce each other; this documents cross-media dynamics between the different platforms and channels.

Traditionally, this might have been researched by conducting user surveys, interviewing journalists, observing Websites, etc.; but now we can also do so by tracking relevant keywords in social media spaces and using advanced computational methods to identify and examine the patterns we find. Increasingly, this draws on very large-scale data-mining approaches, combining a mix of sources, and often takes a data-driven approach with begins with data access and develops research questions from the patterns observed in the data.

An analysis of breaking news patterns about Syria shows the role of social media (both Facebook and Twitter) as a way of sharing breaking news, but also the quick drop-off of sharing after the first few hours; this takes place more so on Twitter than on Facebook, probably due to the different site design. This also serves as a predictor for how much relevant articles will be read on mainstream news sites in the following days - Jürgen has worked with Al Jazeera to develop a news readership predictor tool, for example.

More generally, some principles for 'big data' work emerge from this: there is a strong tendency to collect as many data as possible (n = all), however messy they may be initially, and to trust in computational systems to decide later which data are useful. The research process is also reorganised as a result: rather than staring from a defined hypothesis, data mining-based approaches begin from data gathering and analytics methods, gather large datasets, analyse them and present the results, and from this develop an understanding of the problems that these results illustrate.

This is not always an effective and rigorous way of working, though - especially because correlation does not necessarily mean causation. But often the question of 'why' isn't asked - a mere recognition of patterns suffices. But with an increase in the number of independent variables being examined in 'big data' research, some variables will always correlate, simply through random fluctuation; this means that false positives may be found.

Further, if 'big data' research takes an inclusive approach and tries to track all available data, what does it actually cover - does 'all' mean all, or just all that is readily available? What is missing; which populations are not covered by the data; do the patterns in the data actually represent real-life human behaviours; and do the outcomes of the analysis actually represent these patterns accurately? These are difficult questions to answer.

The work conducted around the structure of social networks provides an example of this. There are plenty of studies on the structure of offline networks, showing a gradation between close circles of friends and broader social networks; but this potentially looks quite different in online social network spaces, also dependent on the technological affordances of these spaces. This may amplify the spread of information, drive the formation of network clusters, or result in the creation of echo chambers.

The developers of social networks realise this, of course, and structure their technologies specifically around the creation of such online social networks - 70% of all new LinkedIn connections constitute triadic closures, for example, where two connections of a central node close the loop by forming a direct connection themselves.

But do the data being analysed actually represent the underlying system itself? One of Jürgen's projects compared Twitter firehose data with what's available through the free streaming API, for example, which delivers a maximum of 1% of all current tweets; how this 1% is selected from all tweets is far from transparent, however. Especially for the major trending topics, the streaming API does not provide an accurate representation of what's really happening on the platform.

This means that some important critical questions must continue to be asked of 'big data' research. And the data we've seen so far aren't even extremely big; Facebook, for example, gathers some 500 terabytes per day, including 300 million new photos. The total world Internet traffic is 1.1 exabytes (1.1 billion gigabytes) per day, by comparison - yet another several magnitudes more in size. At such sizes, not only data gathering, but data processing becomes a major issue - the machines required to do comprehensive work are rarely available, especially to publicly-funded researchers in the social sciences without financial support from the NSA.

The metrics and measures used for such analyses must also be examined in more detail. Many of the network analysis approaches we use today were originally developed for the study of small (offline) rather than very large (online) networks - many of the assumptions underlying social network analysis algorithms may not be appropriate to the task.

The big hopes and dreams associated with 'big data', and especially with 'big social data', may need to be replaced by more realistic, more balanced perspectives. We must move beyond the excitement of possibilities and towards a sober understanding of what can and should be done using 'big social data'; this also means thinking more carefully about the privacy, surveillance, and commercial implications of this work.

We should not be blinded by 'big data', but ask what we actually learn from such work, why we are doing it, and what scientific value it has. We can always find 'interesting' needles in the 'big data' haystack, but is it actually a haystack, or in fact a pile of needles, Jürgen asks. An answer to this question needs more than just computational methods - it needs mixed-methods and interdisciplinary research approaches.