gtag('config', 'G-0PFHD683JR');
Price Prediction

Data science behind R/Antiofork Upvotes

Abstract and 1. Introduction

2. Related work

3. Systematic

4. Results

5. Discussion

6. Conclusion, references and appendix

3 methodology

3.1 data

We downloaded all posts and comments on R/Antiofork Subreddit from January 1, 2019 to July 31, 2022 using a Pushshift app.[9] [3]. We just looked at the posts with at least one comment related to the alternative to refined posts that indicate the same event, publications outside the subject and random mail, as well as posts that did not receive any participation of the user for other reasons. The data collection included 304,096 posts and 12,141,548 comments. These posts were made by 119,746 users (stickers) and comments were presented by 1,298,451 users (commentators).

We have already strengthened the data set to remove the comments that could prejudice our analysis. We have liquidated the comments that: (1) were removed by users or supervisors, but staying in the data group as deputy owners (comments are usually removed to violate community guidelines), or (2) were comments from robots (for example a car robot, or where the suspension body started “I am a robot.” , As many do through the agreement). After filtering, 11,665,342 comments remain in the data set (96.1 %). We removed the publications that had no comments after the liquidation, leaving 284449 jobs (93.5 %)

3.2 definitions

3.2.1 Types of users. In our analysis, we compare the behavior of two groups of users we refer to as “light” and “heavy” users of R/AntiWork. We define Light stickers or commentators As those who have one post or only one comment in the data set, respectively. The majority of stickers are light stickers (75.1 %) and a high percentage of commentators are light commentators (42.5 %). We define Heavy stickers or commentators Since the highest 1 % of users are classified in a descending order according to the number of posts or comments, respectively. In general, heavy stickers have made 10.1 % of leaflets and heavy commentators responsible for 29.8 % of the comments.

3.2.2 time periods. To analyze our topic modeling, we divided the data set into three time periods:

• period 1: January 1, 2019 – October 4, 2021

• period 2: October 15, 2021 – January 24, 2022

• period 3: January 25, 2022 – July 31, 2022

These periods are determined by two events in the main media: News of Newsweek[10]Which was the first example of the prevailing media article that has a virus post[11] On R/Antiofork (15 October 2021) and Fox News interview with Doreen Ford (January 25, 2022). The period 2 as a gray box is highlighted in all numbers where the axis represents the time.

3.3 Change the detection point

We use the classification of trees and slope to detect the point of change [5]. Cart is a non -Parameter method that uses a decision -making tree to divide the prediction space frequently into homogeneous periods (often called “division”). This retail process is completed with a complex teacher

Figure 2: The total number of daily posts provided to R/Antiofork, which received at least one comment. A large percentage of publications (29.6 %) was made by light stickers. Red intermittent lines are the results of detection of the point of change.Figure 2: The total number of daily posts provided to R/Antiofork, which received at least one comment. A large percentage of publications (29.6 %) was made by light stickers. Red intermittent lines are the results of detection of the point of change.

Figure 3: Total daily comments on R/Antiofork. A large percentage of the comments (29.8 %) was made by heavy commentators. Red intermittent lines are the results of detection of the point of change.Figure 3: Total daily comments on R/Antiofork. A large percentage of the comments (29.8 %) was made by heavy commentators. Red intermittent lines are the results of detection of the point of change.

This regulates the cost of planting the tree by adding a penalty to add additional sections (“pruning”). In our case, we fit the slope tree with the dependent variable such as the number of publications or comments, and the prediction space every day January 1, 2019 – July 31, 2022. We used the RPART R package to create the slope models [32]Jenny index for division and a complex teacher from 0.01 for pruning.

3.4 Topic modeling

We use Dirichlet Catent customization (LDA) [4]. LDA is a obstetric model that defines a set of underlying topics by estimating the distribution distributions of the document and the subjects of the topic within documents for a specific number of topics. In our case, we consider each publication document and contents of this document as a sequence of all comments for this post. We do not include the text of the post as part of the document because a large percentage of the published objects consists of images. We have previously vaccinated the comments to model the topic by removing URL addresses, stopping words, replacing the characters announced with ASCII bonuses, replacing contractions with their component words, and preparing all words. Finally, we filtered the posts with less than 50 comments, leaving 11,368,863 comments (97.5 %) across 181,913 jobs (64.0 %) for the subject model.

LDA was applied to both the three time periods separately (see section 3.2.2). It contained periods 1, 2 and 3 at 40794; 71,470 and 69,649 posts, respectively. We evaluate the quality of theme models using the degree of cohesion [24] To determine the optimal number of topics. Each subject was classified by a human comment with the knowledge of R/Antiofork and topics that were aligned between the models using these stickers and the distance of JenSen-Sannon between the word’s word distributions. Topic modeling was implemented using the Gensim Python Library [26].


[9] https://pushshift.io/

[10] https://www.newsweek.com/1639419

[11] https://www.reddit.com/r/antiWork/comments/q82VQK/

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button