Investigating the phenomenon of NSFW posts in Reddit

Abstract In this paper, we study the characteristics of NSFW (Not Safe For Work) posts in Reddit, highlighting their differences from SFW (Safe For Work) posts, which have been much more studied in the past …

Investigating the phenomenon of NSFW posts in Reddit

Abstract

In this paper, we study the characteristics of NSFW (Not Safe For Work) posts in Reddit, highlighting their differences from SFW (Safe For Work) posts, which have been much more studied in the past literature. In our investigation, we studied all Reddit posts from 2019. Through both descriptive analytics techniques and social network analysis techniques, we extract three findings on the main differences between NSFW and SFW posts in Reddit. Thanks to these findings, we are able to better understand the dynamics (authors, subreddits, readers) behind NSFW posts. In particular, it becomes clear that this is a niche world where authors are strongly cohesive. However, at the same time, the most popular ones show a clear opening to new authors, whom they are willing to collaborate with, from the beginning.

Introduction

Reddit1 is currently one of the most active social media. It has been extensively studied by researchers in the past [19]. In [25], the authors present an interesting longitudinal analysis of the evolution of this social medium. Furthermore, many papers have focused on specific aspects of this social network, concerning, for example, community structures and interactions [27], [8], [10], user behavior [3], [14], [16], structure and content of subreddits, posts and comments [24], structural properties [10], [13], [33], text classification [15], user migration [21], political and ideological aspects [12], [31].

One aspect of Reddit worth to be analyzed involves NSFW (Not Safe For Work) posts. This term refers to user-submitted content not suitable to be viewed in public or in professional contexts. The phenomenon of NSFW posts in Reddit has been very little investigated, although it is very common in this social medium. In fact, only a very small number of authors have analyzed it [17], [20]. The term “NSFW” has been proposed since 1998, and is one of the oldest acronyms of the Internet. Since its first appearance, many social media, such as Twitter, WhatsApp and Reddit, have adopted it to indicate certain sections or contents. In addition, several authors have focused on the analysis of this phenomenon in other social networks. The study about the role of images and selfies in NSFW content of tumblr.com, presented in [28], and the analysis of the anonymity level of NSFW content in both Twitter and Whisper, described in [7] are two examples.

In this paper, we give a contribution in this setting investigating the phenomenon of NSFW posts in Reddit and describing the whole context (authors, subreddits and readers) behind it. For this purpose, we consider a dataset that includes all the posts published in Reddit from January 1st, 2019 to December 31st, 2019.

During our investigation, we carried out three types of analysis, namely:

Descriptive Analysis, to study the distributions of the entities involved in the phenomenon (e.g., the distribution of NSFW posts against subreddits, authors, score and comments).

Social Network Analysis, to study the co-posting phenomenon, and therefore the interactions between authors of NSFW posts.

Assortativity Analysis, to extend and deepen the previous analyses to discover and study whether possible forms of assortativity [22] exist among the authors of NSFW posts. Recall that assortativity is a particular case of homophily in social networks [18], which indicates the tendency of a node to cooperate with nodes having similar characteristics.

These analyses allowed us to extract three findings regarding NSFW posts, NSFW authors and NSFW subreddits, respectively. Throughout our analysis, in most of the cases, we compare each finding on NSFW posts with the corresponding one on SFW (Safe For Work) posts. Some of the questions these findings provide an answer to are the following:

What can be said about the spread of NSFW posts in the subreddits?

What can be said about the quantity of posts an NSFW author usually submits?

What can be said about the score of NSFW posts?

What can be said about the number and the score of comments to NSFW posts?

What can be said about the level of interconnection between authors of NSFW posts?

Is there a backbone among experienced authors of NSFW posts? In other words, do they tend to interact only with their peers (i.e., authors with the same level of experience), or are they open to collaborations with new authors who have just started publishing NSFW posts?

Finally, we suitably combine the knowledge represented by the three findings in order to describe the dynamics behind the phenomenon of NSFW posts in Reddit.

The rest of this paper is organized as follows: In Section 2, we present related literature. In Section 3, we describe the dataset used in our analysis. In Section 4, we provide an overview of our investigation activity. In Section 5, we study various distributions involving NSFW posts. In Section 6, we study several distributions regarding comments of NSFW posts. In Section 7, we investigate the co-posting activity of the authors of NSFW posts. In Section 8, we evaluate the assortativity of the authors of NSFW posts. In Section 9, we combine the three findings derived during our investigations in order to define an overall picture of this phenomenon. Finally, in Section 10, we draw our conclusions and think of some possible developments of our research efforts.

Section snippets

Related literature

The term “NSFW” was first proposed in 1998 and it is one of the oldest acronyms of the Internet. It refers to content that is not suitable to be viewed in a working environment. Since then, different online systems, like Twitter, WhatsApp, many forums, and Reddit, have adopted this term to label sections with posted content not adequate for everybody and, in general, not suitable for public and professional contexts. Specifically, Reddit has introduced a dedicated group of contents called NSFW

Dataset description

The dataset used for our analysis has been downloaded from the website pushshift.io [1], one of the main Reddit data sources. In particular, we extracted all the posts published on Reddit from January 1st, 2019 to September 1st, 20192. The number of posts available for our analysis was 150,795,895. In Reddit, an NSFW post must be marked as such by its author.

Overview of our investigation activity

Our investigation of the phenomenon of NSFW posts in Reddit follows the workflow shown in Fig. 1.

Due to layout reasons, this figure shows the dataset in input only to the first module. Actually, the dataset is provided in input to each module of the workflow. Similarly, the descriptive (resp., co-posting) knowledge, which are shown as an input for the co-posting (resp., assortativity) analysis module, are also an output of the investigation activity.

As we can see in Fig. 1, the first phase of

Investigating distributions involving NSFW posts

In this section, we present some analyses directly involving NSFW and SFW posts. In particular, we study the distribution of subreddits and authors against posts and the distribution of posts against the scores assigned to them by Reddit users.

Investigating distributions on comments to NSFW posts

In this section, we analyze the comments to NSFW posts investigating their authors, the scores they get and the subreddits they are submitted to.

Investigating co-posting activity of the authors of NSFW posts

The goal of this analysis is to verify whether there is any correlation between the authors of NSFW posts. As usual, we will extract the information of interest and we will compare the behavior of authors of NSFW posts with the ones of SFW posts. In this activity, we will use a support data structure that we call co-posting network. Having observed in all the previous experiments that the results obtained for the Jan-Feb datasets (i.e., D and D‾) are stable, from now on we will refer to these

Evaluating assortativity of the authors of NSFW posts

The concept of “assortativity”, or “assortative mixing”, in a social network points out the predilection of its nodes to be connected with other nodes that are somehow similar to them. This concept, introduced by Newman [22], can be seen as an evolution of the concept of homophily [18], typical of Social Network Analysis. Assortativity is orthogonal to node similarity metrics considered, even if most of the authors in the literature have studied it with respect to node degree. According to this

Discussion

Combining together all the previous results, we can define three main findings related to posts, authors and subreddits, respectively. Some of these findings are made up of several sub-findings.

The three findings are the following:

PF (Finding on NSFW posts).

1.

NSFW posts are generally published in much fewer subreddits, have much lower scores and are much less commented than SFW posts.

2.

The scores of comments to NSFW posts are much lower than the ones to SFW posts.

AF (Finding on NSFW authors).

1.

NSFW

Conclusion

In this paper, we have presented an approach to investigate NSFW posts in Reddit. We have seen that this type of content is frequent in this social medium and, despite this, there are very few studies on this subject in the past literature. We have tried to fill this gap and we have proposed an approach that investigates the phenomenon of NSFW posts in Reddit with descriptive, co-posting and assortativity analyses.

In this way, we have obtained three findings, which, together with the principles

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partially supported by: (i) the Italian Ministry for Economic Development (MISE) under the project “Smarter Solutions in the Big Data World”, funded within the call “HORIZON2020” PON I&C 2014–2020 (CUP B28I17000250008), and (ii) the Department of Information Engineering at the Polytechnic University of Marche under the project “A network-based approach to uniformly extract knowledge and support decision making in heterogeneous application contexts” (RSAB 2018).