About a month ago I introduced my new Gawk script metrify.awk, which generates a wide range of Twitter metrics for a given Twapperkeeper/yourTwapperkeeper hashtag or keyword archive. Even as I was writing those posts, though – and certainly while playing with the language metrics I discussed in my last post –, I started to find a few areas where metrify could provide even more information on the dataset. So, the time has come for a first service release which upgrades metrify.awk to add some more functionality (and fix a few inconsistencies along the way). This is a revision rather than a full rewrite of the script, so let’s call it metrify 1.2; it’s now available for download here, where it replaces the older version.
As before, the new version of metrify.awk is called as follows:
gawk -F , -f metrify.awk time=”[year|month|day|hour|minute]” [divisions=x,y,z,…] [skipusers=1] input.csv >metrics.csv
(divisions defaults to ‘90,99’ – i.e. a 90%/9%/1% split of the userbase – if it is not specified).
In this post, I won’t go from scratch through the entire range of metrics that metrify.awk generates; my original four-part post is still sufficient for that purpose. Rather, I’ll focus only on the major changes in this new revision, which relate mainly to part two of that series (and I’ve noted the updates in those posts as well, to avoid confusion): the metrics over time.
Changes to Metrics over TimeThe first table generated by metrify shows the metrics over the chosen timeframe (e.g. day or hour), but it now contains a number of additional data points. The changes only concern the columns which contain metrics for the various user percentiles which are defined with the ‘divisions’ argument. Rather than providing information only on the number of users from each percentile which are actively participating during each timeframe (expressed as a percentage of the total number of currently active users), as metrify 1.0 did, revision 1.2 provides a number of further metrics:
Here’s a comparison of the relevant output columns between versions 1.0 and 1.2:
metrify.awk 1.0 metrify.awk 1.2 number of current users from least active x% (< u tweets) lowest x% users (<= u tweets) % of current users from least active x% (< u tweets) number of tweets from least active x% (< u tweets) % of tweets from least active x% (< u tweets) number of current users from > x% group (> u-1 tweets; a of n users) users > x% (> u tweets; a of n users) % of current users from > x% group (> u-1 tweets; a of n users) tweets from > x% group (> u-1 tweets; a of n users) % of tweets from > x% group (> u-1 tweets; a of n users) number of current users from > y% group (> v tweets; b of n users) users > y% (> v tweets; b of n users) % of current users from > y% group (> v tweets; b of n users) tweets from > y% group (> v tweets; b of n users) % of tweets from > y% group (> v tweets; b of n users)
(with the default settings, x% would be 90% and y% would be 99%; a, b, u, v, and n would depend on the dataset).
So, it now becomes possible not only to track what percentage of the total number of currently active users are from each of the percentiles we have defined, but also what percentage of the total volume of tweets during each period is contributed by each of the user percentiles. By way of example, here’s a comparison of those metrics for the #egypt dataset during February 2011:
Active users in the 90/9/1 user percentiles as percentage of total active userbase
Tweets by users in the 90/9/1 user percentiles as percentage of total current tweet volume
Unsurprisingly, the two charts move together – the greater the presence of a specific user group in the total active userbase, the greater their contribution to the current tweet volume – but only the second chart also tells the story of just how dominant the most active one per cent of users really is. Towards the end, they still only constitute slightly less than 20% of the total userbase participating during the final days of February – but more than half of all tweets posted at that time originate from them.
(At a later stage, I may also add functionality to track the use of different tweet types over time, by the different percentiles – but that’s a feature for metrify 1.5 or so.)
Other ChangesThe only other notable change in this new revision is that the third of the tables generated by metrify.awk, which describes the participating users themselves, has gained a further column, ‘percentile’. This contains a simple descriptor of which of the various percentiles a user has been placed in, and thereby allows for an easier filtering of the list (using Excel’s data filter functions). For the standard 90/9/1 division of the userbase, fields in the column would contain one of the following four options for each user:
Additionally, and less obviously, I’ve also rewired how users are tracked through the dataset. In principle, this should be a very simple process: each user has both a unique numerical Twitter user ID, and a unique alphanumeric username. However, for some esoteric reason the user IDs returned by the Twitter search and streaming APIs, which Twapperkeeper uses to retrieve its datasets, do not always match, especially for older archives (or perhaps for older accounts?); the same user may have two completely different user IDs (thanks for John O’Brien for the details on this). This means that using the user IDs to track user activities in the dataset is unreliable. Usernames, however, may also be changed by the user at any point – @KRuddMP could become @KRuddPM when you least expect it. (Sorry, couldn’t resist!)
Still, as this doesn’t happen all too often, and given the unreliability of the numerical user IDs, metrify does use (lowercase) usernames as its internal tracking ID. The final output itself shows usernames in their properly capitalised form as we’ve first encountered it in tweets by the users themselves (they may also have chosen to change that capitalisation at a later date, though; we’re not checking for that), wherever possible; for users who are only mentioned, but don’t themselves tweet actively, we use the capitalisation which we first encounter.
Finally, one caveat remains: as before, metrify will take quite some time to process a large dataset, and is likely to run out of memory if it’s trying to generate full user metrics for such datasets. (There doesn’t seem to be any way to allocate more memory to Gawk – or to the shell it runs in –, so there’s little I can do to fix this.) Where full, detailed per-user metrics aren’t required, use the skipusers=1 command-line argument, and Gawk will only output the number of tweets contributed by each user, and the percentile they’ve been allocated to on that basis. And it will take a lot less time to do so.
So much, then, for this service update of metrify.awk. In a follow-up post in a few days, I’ll show how metrify metrics can also be imported into Gephi to turbo-charge our network visualisations of Twitter @reply and retweet networks…
Another brief announcement: along with our CCI colleague Larissa Hjorth, Axel and I are looking forward to editing a special issue of the Journal of Broadcasting & Electronic Media (JOBEM) on the theme “Emerging Methods for Digital Media Research”, due for publication in March 2013. If you work in a related area, please consider submitting an abstract by the March deadline. Details follow below.
Emerging Methods for Digital Media Research
Special Themed Issue of the Journal of Broadcasting & Electronic Media (JOBEM), March 2013.
Guest Editors:
Jean Burgess (QUT)
Axel Bruns (QUT)
Larissa Hjorth (RMIT)
ARC Centre of Excellence for Creative Industries & Innovation (http://cci.edu.au/)
Editor: Zizi Papacharissi
With the rise of ‘big data’, locative media, and smartphones, existing media and communication studies methods are being recombined, reconfigured and replaced alongside their objects of study. This special issue of JOBEM seeks to expose new research methods for understanding the changing nature of the content industries, the impact of digital media on the practices of creative workers, and the experiences and practices of everyday users of digital media technologies.
We welcome papers based in the humanities and social sciences that reflect on, discuss or critique current methodological trends in digital media research, shedding light on the following questions:
1. Where are the emerging methodological gaps – are there pressing research problems that require the development of new methods, techniques and tools?
2. Where are there needs for new combinations of methods, within or across disciplines?
3. What are the implications for future pedagogical models in internet, media and communication studies, including doctoral education and other forms of research training?
We especially welcome papers grounded in the experience of conducting empirical digital media research. However we will give preference to papers that contextualise, historicise, and reflect on current methodological trends; rather than simply report on the applications or results of new methods.
Abstracts of 250 words are due by 31 March, 2012. Depending on the number of abstracts received, we may shortlist submissions at this stage. Please email your abstract and a list of 3 or 4 suggested peer reviewers to: jobem.edm@gmail.com.
Full articles of no more than 7000 words should be submitted on or before 1 August, 2012 at: http://mc.manuscriptcentral.com/hbem (select “Special Issue: Emerging Digital Methods” as a manuscript type). Manuscripts should conform to the guidelines of the Journal of Broadcasting & Electronic Media.
OK, this may be a somewhat esoteric subject for researchers who mainly work with Twitter data from specific countries and cultures, but over the past few weeks I’ve been working on a paper that analyses Twitter activities in the #egypt and #libya hashtags – and as part of that work, I’ve been interested in exploring the interactions between users tweeting in Arabic and users tweeting in other languages (mainly in English). Unfortunately, there’s no reliable means of identifying the language of specific tweets, or of the users who post them; while the Twitter API provides an ISO language code (e.g. ‘en’ for English, ‘no’ for Norwegian, etc.) for each tweet, this is drawn simply from the overall language setting of the user’s account, and not specific to each individual tweet itself. For users who alternate between languages in their tweeting, all tweets will be tagged with their chosen language code; for users who haven’t bothered to change their Twitter profile settings away from the default English, all their tweets will be tagged ‘en’, regardless of their actual language.
So far, so unhelpful. Further, short of running every tweet through some form of automatic language recognition tool (using Google Translate or a similar mechanism, for example) – which would be extremely time-consuming for Twitter archives upwards of a few thousand tweets – it is prohibitively difficult to identify the exact language of each tweet, not least also because of the 140 character limit of tweets. In theory, if we had word corpora for all major languages, we could cross-check each tweet against those corpora to see what words from what language occur most frequently – but again, that process would be extremely time-consuming, and would probably have serious difficulties with the abbreviations and contractions which Twitter users commonly employ to stay within that limit.
A much simpler approach – which does generate somewhat less conclusive results, though – works by examining the character sets used in tweets. This is able to make only relatively broad distinctions, but it’s good enough for what I’m trying to achieve with my #egypt/#libya datasets: here, a quick qualitative look at the data suggests that the major division is between Arabic tweets and tweets in English (and to some extent in other European languages) – so the main challenge is to distinguish between Latin and Arabic character sets. This we can do, even just with a basic Gawk script.
Twitter datasets as they are generated by our standard hashtag tracking solution, yourTwapperkeeper, are available in UTF-8 encoding, leaving virtually all characters and character sets intact. Each character is assigned a specific character code, and for historical reasons, the basic characters of the Latin script (unaccented letters, standard punctuation marks, etc.) retain their traditional ASCII codes, with values below 128; beyond that range, we’re moving into accented letters, more unusual punctuation marks, and non-Latin character sets. Sadly, our preferred tool for processing yourTwapperkeeper datasets, Gawk, doesn’t cope all that well with advanced UTF-8 characters – it copes fine with single-byte character codes (i.e. below 256), but not with multi-byte character codes (above 255; it reads these as multiple single-byte characters). At least on a Windows PC, there doesn’t seem to be any way to change that behaviour, either.
However, that’s still good enough for our immediate purpose of distinguishing between Latin and non-Latin (i.e. mainly English and Arabic) tweets. As it turns out, Gawk consistently sees Arabic characters as a sequence of two codes: of either 216 (Ø) or 217 (Ù), followed by another character with a code above 127. So, for a basic distinction between tweets using Latin and tweets using non-Latin scripts, we simply need to count the number of high-ASCII characters (with a code above 127) which Gawk sees in each tweet, and to set a threshold below which a tweet is still classified as ‘Latin’ (to allow tweets that use accented characters or ‘fancy’ quotation marks to be classed as Latin). Through trial and error, I’ve found that a threshold of 20 (i.e. ten Arabic or other non-Latin characters) seems to work reasonably well: few tweets in languages using the Latin alphabet will be miscounted as ‘non-Latin’, even if they contain a number of umlauts or accented characters, while tweets in Arabic, Hebrew, Greek, Chinese, Korean, and other non-Latin alphabets are reliably recognised.
We could use this to mark up the language of every line in a yourTwapperkeeper archive – but that’s not necessarily very useful or interesting. Instead, the script below operates on a user-by-user basis: for each user, it counts the number of their tweets which were above the ‘non-Latin’ threshold, and also calculates a language_ratio value: the percentage of their tweets which used non-Latin characters. The script accepts an optional ‘tolerance’ parameter, to set the ‘non-Latin’ threshold: a typical way to use it would be
gawk -F , -f userlanguage.awk tolerance=20 input.csv >output.csv
(tolerance defaults to zero if it isn’t set).
# userlanguage.awk - Extract stats on the language use of each user, as metrics for network visualisation in Gephi # # this script takes a Twapperkeeper CSV/TSV archive of tweets, and calculates for each user a ratio # indicating how many of their tweets were in non-Latin charactersets # # output is in a format ready to be imported as a node list into the Gephi Data Laboratory # on import, note that new data columns must be imported as 'float' type # # the script skips the first line, expecting that it contains header information # # script expects an optional numerical "tolerance" parameter, to set how many high-ASCII (non-Latin) characters a tweet may contain while still counted as Latin script # set tolerance to ~20 to treat most accented European languages as Latin (note that Gawk will count some UTF-8 characters as two or more high-ASCII characters) # default value for tolerance is 0 # # expected data format: # text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time # # output format: # nodes,id,label,user_tweets,user_highASCII_tweets,language_ratio # (language_ratio is a value between 1 = no Latin tweets and 0 = 100% Latin tweets) # # Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au BEGIN { getline if(!tolerance) tolerance = 0; # highASCII tolerance level: default 0 for(char = 0; char < 256; char++) { charnum[sprintf("%c", char)] = char } print "Nodes" FS "Id" FS "Label" FS "user_tweets" FS "user_highASCII_tweets" FS "language_ratio" } { nodename[tolower($3)] = $3 node[tolower($3),"tweets"]++ highASCII = 0 for(char = 1; char<=length($1); char++) { if(charnum[substr($1, char, 1)] > 127) highASCII++ # count number of high ASCII (>127) characters in tweet; note: some UTF-8 characters count as multiples } if(highASCII > tolerance) node[tolower($3),"highASCII"]++ } END { for(name in nodename) { print name FS name FS nodename[name] FS node[name,"tweets"] FS node[name,"highASCII"] FS node[name,"highASCII"] / node[name,"tweets"] } }
The resulting data can be used in a number of ways. For one, we might divide the total userbase into three groups: users who mainly used Latin characters (with a language_ratio below 0.33); users who mainly used non-Latin characters (language_ratio > 0.66); and users posting in a mix of languages (language_ratio between 0.33 and 0.66). If we further combine this grouping with the distinctions between lead users, highly active users, and less active users which the metrify.awk script makes possible, we now have the ability to examine the prevalence of different languages across these different groups – for #egypt during February 2011, this is what results, for example:
An interesting result: while ‘Latin’ (in this case, mainly English-speaking) users dominate overall, they’re mainly found amongst the less engaged 90% of users – they’re making or retweeting a small number of hashtagged comments about the situation in Egypt during February. The most engaged one per cent of users contain a much larger percentage of Arabic (i.e. non-Latin) speakers, as well as a sizeable proportion of users tweeting in a mix of languages and character sets.
(Note: of course, speakers of languages such as Chinese, Korean, Japanese, Greek, Hebrew, Russian, etc. will be included in the ‘non-Latin’ group here, and speakers of many European languages other than English will be counted amongst the ‘Latin’ group. In many cases, this will be a problem, and our approach here doesn’t allow for easy distinctions between, say, English and French, or Arabic and Hebrew. For our present purposes, however, that’s a negligible problem – few ‘non-Latin’ languages other than Arabic, and few ‘Latin’ languages other than English, are present in the #egypt dataset to any significant extent.)
Additionally, the output of userlanguage.awk is also designed to be easily imported into Gephi as an additional source of data on the users in the network. Assuming we’ve already created a network (for example showing @replies and retweets) for your dataset, using the Twitter usernames (normalised to lower case) as node IDs, we can now use the Data Laboratory to import the language data into the nodes table, as additional columns. Here, it’s important to make sure the numerical metrics generated by userlanguage.awk (user_tweets, user_highASCII_tweets, language_ratio) are imported as columns of the ‘Float’ type, in order to be able to use them effectively in Gephi.
(I’ll say much more about importing Twitter metrics data into Gephi in a future blog post – stay tuned.)
Once imported, these metrics are now available to be used for various purposes: as a means of sizing or colouring nodes in the network, or as criteria for filtering it. To finish off for now, here’s a simple example, which shows @replies and retweets in the #egypt hashtag during February 2011. I’ve used the language_ratio value as the guide for the colour scale here: blue indicates a language_ratio close to zero (predominantly tweeting in Latin characters); green a language_ratio close to one (predominantly tweeting in non-Latin characters); with a gradient of colours between them. Connections between users are coloured according to the language ratio of the sender. (Full graph here – PNG, 9 MB.)
There’s an obvious language divide here – English- and Arabic-speaking users are mainly tweeting amongst themselves. But there are also a good number of connections across the divide – and for these, given the graph above, the most active #egypt participants are disproportionately responsible: mixed-language users are much more likely to be found in that group than in any of the others.
And that’s it for now – more on my language analysis of #egypt and #libya when the paper gets published, and more on using Twitter metrics in Gephi in a future post!
In my new role as Deputy Director of the ARC Centre of Excellence for Creative Industries & Innovation (CCI for short), I’m excited to be leading the team that’s organising our most ambitious PhD and Early Career Researcher activity to date – the CCI Winter School, to be held in balmy Brisbane in late June this year. It’s a selective but free event (you or your institution only need to cover your travel), involving a fairly small group of promising PhD students and early career researchers from around the world. If you’re in the northern hemisphere and looking for a 2012 summer research school, why not consider being adventurous and coming down under instead? Axel and I will both be on hand as mentors, along with a bunch of other fabulous people.
Applications close on 31 January – don’t miss out!
CCI’s 2012 Winter School (coinciding with summer in the northern hemisphere) offers selected doctoral students and early career researchers a week-long program of interdisciplinary study, collaboration and social interaction in the broad area of creative industries and innovation research, drawing on the Centre’s expertise in media, cultural and communication studies, economics, education, policy and law, in relation to the creative economy.
We welcome applications from emerging scholars working on related topics including, but not limited to:
Participants will work with leading researchers, engage in intensive workshop activities and receive direct feedback and individual mentoring on their own work. Social activities will provide additional opportunities for participants to get to know each other and form collaborative relationships that will last for years to come.
For all the info, lists of mentors, an indicative program and the online application form, visit the CCI Winter School website.