Skip to main content
Home
Snurblog — Axel Bruns

Main navigation

  • Home
  • Information
  • Blog
  • Research
  • Publications
  • Presentations
  • Press
  • Creative
  • Search Site

Building a Shareable n-Gram Dataset from Non-Shareable Social Media Data

Snurb — Wednesday 18 March 2026 21:10
'Big Data' | Social Media | Twitter | Social Media Access Days 2026 | Liveblog |

The next speaker at the Social Media Access Days at the German National Library is Robert Jäschke. He begins by noting the legal constraints on social media data sharing, including Terms of Service, copyright, and other restrictions. One approach to managing this is the way Twitter approached this: sharing datasets with lists of tweet IDs without any further content was allowed, and researchers then needed to ‘rehydrate’ them by regathering the tweet data. Another approach is to share only aggregate metrics rather than the source data themselves; or to share derived datasets (like term matrices, n-gram datasets, or word embeddings) rather than the source data.

Such n-gram data could be generated from Twitter datasets like the TweetsKB dataset of the 1% sample of the streaming API between 2013 and 2023, for instance; in a total dataset of 14.2 billion tweets, this contains some 2.1 billion original English-language tweets that are more duplicates and have not been deleted subsequently.

After removing URLs and @mentions from this dataset, these tweets were tokenised and normalised to lowercase, and 1-, 2-, and 3-grams extracted. These were collated into datasets for each month over the 11-year period covered by the dataset.

This, then, enables an analysis of large-scale tweeting patterns over time, showing for instance a slow decline in posting activity to 2020, and a substantial increase again from early 2020 (probably as the COVID-19 pandemic kicked in). There are also some gaps and errors in this dataset, however, and of course the 1% Twitter sample has its own limitations to begin with.

This processed dataset is now publicly available for researchers to use.

  • 3 views
INFORMATION
BLOG
RESEARCH
PUBLICATIONS
PRESENTATIONS
PRESS
CREATIVE

Recent Work

Presentations and Talks

Beyond Interaction Networks: An Introduction to Practice Mapping (ACSPRI 2024)

» more

Books, Papers, Articles

Untangling the Furball: A Practice Mapping Approach to the Analysis of Multimodal Interactions in Social Networks (Social Media + Society)

» more

Opinion and Press

Inside the Moral Panic at Australia's 'First of Its Kind' Summit about Kids on Social Media (Crikey)

» more

Creative Work

Brightest before Dawn (CD, 2011)

» more

Lecture Series


Gatewatching and News Curation: The Lecture Series

Bluesky profile

Mastodon profile

Queensland University of Technology (QUT) profile

Google Scholar profile

Mixcloud profile

[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Licence]

Except where otherwise noted, this work is licensed under a Creative Commons BY-NC-SA 4.0 Licence.