Abstract
Supervised machine learning on textual data has successful industrial/business applications, but it is an open question whether it can be utilized in social knowledge building outside the scope of hermeneutically more trivial cases. Combining sociology and data science raises several methodological and epistemological questions. In our study the discursive framing of depression is explored in online health communities. Three discursive frameworks are introduced: the bio-medical, psychological, and social framings of depression. ~80 000 posts were collected, and a sample of them was manually classified. Conventional bag-of-words models, Gradient Boosting Machine, word-embedding-based models and a state-of-the-art Transformer-based model with transfer learning, called DistilBERT were applied to expand this classification on the whole database. According to our experience ‘discursive framing’ proves to be a complex and hermeneutically difficult concept, which affects the degree of both inter-annotator agreement and predictive performance. Our finding confirms that the level of inter-annotator disagreement provides a good estimate for the objective difficulty of the classification. By identifying the most important terms, we also interpreted the classification algorithms, which is of great importance in social sciences. We are convinced that machine learning techniques can extend the horizon of qualitative text analysis. Our paper supports a smooth fit of the new techniques into the traditional toolbox of social sciences.