The final speaker in this session at the Weizenbaum Conference is David Wegmann, who returns us to the Danish YouTube data donation study we discussed earlier: his work is to make sense of these data, with a particular focus on extracting features from the YouTube videos encountered by participating users.
Collectively, these users watched some 18 million videos; some 3 million of those are advertisements inserted into organic YouTube videos, though. This leaves some 7 million unique videos, indicating a typical long-tail distribution of user attention to these videos.
Details about what types of videos these data represent are more difficult to extract, however; what is needed here is a method for classifying the content of these videos more efficiently. Such classification could draw on audio and video tracks, for instance, and David explored this for a subset of some 20,000 videos from the dataset; he measured the video features of a keyframe every ten seconds, assessed the audio features, and used WhisperAI to explore their linguistic features.
On average these videos are some 12.5 minutes losing, have 300,000 likes, were watched some 800 days after upload, and contained English-language content in 79% of cases. David then also clustered these videos by their feature patterns, and this enables the identification of groups of videos with similar characteristics. The accuracy of this approach remains low so far, however, and its scalability also remains limited. At this stage, the approach also works better for longer than shorter videos.