Towards a Vocabulary for the On-Sharing of Research Data from Social Media Platforms

Snurb — Tuesday 17 March 2026 23:54

'Big Data' | Social Media | Social Media Access Days 2026 | Liveblog |

The next speaker in this session at the Social Media Access Days at the German National Library is Katharina Maubach, whose focus here is on data formats for archiving social media data. She works with a project exploring liking activities on social media platforms, especially relating to content from news sites; this covers Disqus, Facebook, YouTube, Xitter, and Instagram.

Ideally, such a cross-platform dataset should be shared with other researchers under FAIR principles (findable, accessible, interoperable, and reusable), but under the Terms of Service of such platforms and their data access conditions this is very difficult; the focus of Katharina’s talk here is on the data formats that might support such sharing, and the extent and limitations of such data.

While there are generic metadata formats which could be applied here (W3C Activity Streams, Dublin Core), these do not easily map onto the structure that social media activity data commonly take; the project therefore developed its own Canonical Social Media (CanSM) format. This covers data management information (collection time and modes, etc.), communication structure (message type and thread/tree information), provenance (platform, seed domain used in the collection, author information, etc.), temporal data (dates of creation and modification), message content (caption, text, tags), and metrics (e.g. engagement received).

Some such data are anonymous, pseudonymous, identifiable, and/or in some cases especially protected by applicable laws (e.g. personal opinion, religion, etc.); some were deleted from the platforms subsequent to data collection. This creates further complications for any data sharing. The project therefore identified several levels of data shareability to address this: 0 – data related to project organisation; 1 – anonymisable without losing information; 2 – aggregate data and statistics; 3 – localisable in datasets, and anonymising creates information loss; 4 – identifiable using the textual content. A first extract of this has been published via OSF.

The data vocabulary that underpins this was therefore designed specifically for social media data, and is transferable to other projects; its categories enable a differentiation in the formats and modes of publishing such data. Many other questions still need to be addressed, however, and this includes data protection, copyright issues, ethical and moral rights, and other issues. A further workshop on the publication of research data will be organised in early 2027.

43 views