You are here

Defining Themes for Twitter Data Gathering

The next presentation in this Digital Methods panel is by Christoph Neuberger and Sanja Kapidzic, whose focus is on the question of how to define themes and topics in online communication. Using single keywords to define topics is too simplistic, and there often is an implication that we know what a topic is when we see it - but what exactly is a topic?

Sometimes, specific labels do emerge for given topics, which makes tracking them easier, but these labels themselves may evolve. In live topics it becomes necessary to track these themes and continue to update the markers of themes which are seen as relevant. Themes may be defined variously by broad news beats, by thematic areas, by single themes, or at the most specific level by specific events; these levels of specificity also overlap considerably, however.

The project is interested in tracking the thematic careers of specific topics, without predetermining which thematic areas are selected; for example, it has recently examined the coverage of the NSA surveillance affair in Germany as that affair emerged and developed. From this work, it is interested in developing a more comprehensive perspective of what types of thematic careers there may be.

But how can themes be tracked through the ad hoc selection of keywords, then? The study made a broad selection of thematic areas, focussing both on specific topics and on abstract theme types (scandals, crises); it then defined these themes from the perspective of topic experts; and it attempted to develop the dimensions of these themes by identifying the actors and actor types, the threats and opportunities, and the problem solutions offered.

Operationalising these themes for data gathering meant the definition of groups of keywords which best cover any given thematic area. This involved the analysis of expert texts on the topic, the search for synonymous terms in a thesaurus, the use of glossaries for specific thematic areas, and the analysis of journalistic texts, in order to identify a broad range of relevant keywords. This was combined with the dimensions as outlined above, generating a list of 49 keywords for the initial set of themes, which were used in their 106 various grammatical inflections.

These 106 keywords resulted in some 10,000 matching tweets over the course of one week, with only about one keyword used per tweet; some 42% of keywords did not result in any matching tweets. These tweets were then coded for their thematic relevance, and some 80% of keywords found relevant tweets in at least 50% of cases; 56% of tweets found relevant tweets in more then 90% of all cases. Match quality was lower for abstract thematic areas, and often found metaphoric topics (e.g. for terms like "flood" or "storm"). Even many apparently unique search terms, such as politicians' names, often generated false positives.

This enabled a further refinement of search terms, and the thematic tracking proper has now been able to get started. It will be interesting to see how this can now be used for the main part of the study.