A Python module for clustering creators of social media content into networks

Last update: Dec 30, 2022

Related tags

Data Analysis sm_content_clustering

Overview

sm_content_clustering

A Python module for clustering creators of social media content into networks.

Currently supports identifying potential networks of Facebook Pages in the CSV output files from CrowdTangle.

Installation

Can install via pip with

pip install git+https://github.com/jdallen83/sm_content_clustering

Install requires pandas and fasttext

Language Prediction

To enable language prediction, you will need to download a fasttext language model. Module was tested with lid.176.ftz.

Usage

Command line

Can be called as a module for command line usage.

For usage guide:

python -m sm_content_clustering -h

Example that will create an output CSV with potential networks and predicted languages from several input CSVs:

python -m sm_content_clustering --add_language --ft_model_path /path/to/lid.176.ftz --output_path /path/to/output.csv --min_threshold 0.03 /path/to/input_1.csv /path/to/input_2.csv

Python

Module can also be called from within Python.

Example that will generate a Pandas dataframe that contains potential networks:

import sm_content_clustering.sm_processor as sm_processor

input_files = ['/path/to/1.csv', '/path/to/2.csv', '/path/to/3.csv']
df = sm_processor.ct_generate_page_clusters(input_files, add_language=True, ft_model_path='/path/to/lid.176.ftz')
print(df)

Options

Arguments for sm_processor.ct_generate_page_clusters() are

infiles: Input files to read content from. Required.
content_cols: Which columns from the input files to use as content for the purposes of clustering identical posts. Default: Message, Image Text, Link, Link Text
add_language: Whether to predict the page and network languages. Default: False
ft_model_path: Path to fasttext model file. Default: None
outfile: Path to write output CSV with potential networks. Default: None
update_every: How often to output clustering status. (Print status 1 every N pages). Default: 1000
min_threshold: Minimum similarity score for clustering. Possible range between 0 and 1, with 1 being perfect high confidence overlap, and 0 being no overlap. Default: 0.03
second_cluster_factor: Requirement that the best matched cluster for a page must score a factor X above the second best matched cluster. Default: 2.5

Methodology

Module assumes you have social media content, which includes the body content of a message and the account that created it. It begins by grouping by all messages, and finds which accounts have shared identical messages within the dataset. It then applies a basic agglomerative clustering algorithm to group the accounts into clusters that are frequently sharing the same messages.

The clustering loops through the list of all accounts, normally sorted in reverse size or popularity, and for each account, searches all existing clusters to see if there is a valid match, given the min_threshold and second_cluster_factor parameters. If there is a match, the account is added to the existing cluster. If there is not a match, then, if there is enough messages from the account to justify, a new cluster will be created with the account acting as the seed. Otherwise the account is discarded.

In theory, any measure could be used to determine if a given account should be added to a given cluster, such as, what fraction of the accounts messages match those within the cluster. Currently, the module combines message coverage, Normalized Pointwise Mutual Information, and a dampening factor that reduces matching score when there is an insufficient number of messages to be confident.

At the end, any clusters that are below a size threshold are discarded.

License

MIT License

A Python module for clustering creators of social media content into networks

Related tags

Overview

sm_content_clustering

Installation

Language Prediction

Usage

Command line

Python

Options

Methodology

License

Owner

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Handle, manipulate, and convert data with units in Python

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

My first Python project is a simple Mad Libs program.

Creating a statistical model to predict 10 year treasury yields

Binance Kline Data With Python

Very useful and necessary functions that simplify working with data

CSV database for chihuahua (HUAHUA) blockchain transactions

A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

sportsdataverse python package

CRISP: Critical Path Analysis of Microservice Traces

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

API>local_db>AWS_RDS - Disclaimer! All data used is for educational purposes only.

Employee Turnover Analysis

Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Improving your data science workflows with

Python Kalman filtering and optimal estimation library. Implements Kalman filter, particle filter, Extended Kalman filter, Unscented Kalman filter, g-h (alpha-beta), least squares, H Infinity, smoothers, and more. Has companion book 'Kalman and Bayesian Filters in Python'.

Pandas and Spark DataFrame comparison for humans