Abstract
Background and aims
Thanks to smart devices, social media and streaming platforms, watching videos, like movies or short social media clips, has become extremely popular. Alcohol portrayals are frequent in videos, yet their prevalence is difficult to quantify using traditional methods such as manual coding. Artificial intelligence (AI) offers a scalable solution to analyse large volumes of video images. This study aimed to compare the accuracy of three AI models in detecting alcohol presence in video images.
Method
Experimental evaluation of three models: one supervised deep learning model (ABIDLA2) and two zero-shot learning models (ZSL-CLIP and ZSL-LLaVA). The models were tested on datasets of video frames that had been annotated by researchers for whether they included alcohol or not. Three datasets of increasing complexity were used: (1) a Google/Bing image set of clearly visible alcohol and non-alcohol images; (2) a set of movie frames manually annotated as containing or not containing alcohol; and (3) a contextually challenging set of movie frames from alcohol-related settings (e.g. bars, parties) that may or may not include visible alcohol. Model performance was assessed using accuracy, unweighted average recall (UAR) and F1 score, representing the balance between precision and recall. Execution time per frame was also measured to evaluate computational efficiency.
Results
Across the three datasets, ABIDLA2, ZSL-CLIP and ZSL-LLaVA achieved percentage accuracies of 90%, 91% and 92% on the Google/Bing images; 70%, 65% and 95% on the diverse movie-scene dataset; and 67%, 63% and 94% on the most complex alcohol-related dataset, respectively. In terms of execution time, ABIDLA2 processed a single frame the fastest (0.21 seconds), followed by ZSL-LLaVA (0.45 seconds), while ZSL-CLIP was the slowest (0.58 seconds).
Conclusion
Automated artificial intelligence (AI) models appear to be able to detect alcohol imagery in videos at large scale with high accuracy and in near real time. Of the three AI models tested, ZSL-LLaVA achieved the best balance between accuracy and speed. Offering a cost- and time-efficient alternative to labour-intensive manual coding, ZSL-LLaVA could be used to monitor alcohol-related visual content in videos across diverse media platforms.