A Python module for clustering creators of social media content into networks

Overview

sm_content_clustering

A Python module for clustering creators of social media content into networks.

Currently supports identifying potential networks of Facebook Pages in the CSV output files from CrowdTangle.

Installation

Can install via pip with

pip install git+https://github.com/jdallen83/sm_content_clustering

Install requires pandas and fasttext

Language Prediction

To enable language prediction, you will need to download a fasttext language model. Module was tested with lid.176.ftz.

Usage

Command line

Can be called as a module for command line usage.

For usage guide:

python -m sm_content_clustering -h

Example that will create an output CSV with potential networks and predicted languages from several input CSVs:

python -m sm_content_clustering --add_language --ft_model_path /path/to/lid.176.ftz --output_path /path/to/output.csv --min_threshold 0.03 /path/to/input_1.csv /path/to/input_2.csv

Python

Module can also be called from within Python.

Example that will generate a Pandas dataframe that contains potential networks:

import sm_content_clustering.sm_processor as sm_processor

input_files = ['/path/to/1.csv', '/path/to/2.csv', '/path/to/3.csv']
df = sm_processor.ct_generate_page_clusters(input_files, add_language=True, ft_model_path='/path/to/lid.176.ftz')
print(df)

Options

Arguments for sm_processor.ct_generate_page_clusters() are

  1. infiles: Input files to read content from. Required.
  2. content_cols: Which columns from the input files to use as content for the purposes of clustering identical posts. Default: Message, Image Text, Link, Link Text
  3. add_language: Whether to predict the page and network languages. Default: False
  4. ft_model_path: Path to fasttext model file. Default: None
  5. outfile: Path to write output CSV with potential networks. Default: None
  6. update_every: How often to output clustering status. (Print status 1 every N pages). Default: 1000
  7. min_threshold: Minimum similarity score for clustering. Possible range between 0 and 1, with 1 being perfect high confidence overlap, and 0 being no overlap. Default: 0.03
  8. second_cluster_factor: Requirement that the best matched cluster for a page must score a factor X above the second best matched cluster. Default: 2.5

Methodology

Module assumes you have social media content, which includes the body content of a message and the account that created it. It begins by grouping by all messages, and finds which accounts have shared identical messages within the dataset. It then applies a basic agglomerative clustering algorithm to group the accounts into clusters that are frequently sharing the same messages.

The clustering loops through the list of all accounts, normally sorted in reverse size or popularity, and for each account, searches all existing clusters to see if there is a valid match, given the min_threshold and second_cluster_factor parameters. If there is a match, the account is added to the existing cluster. If there is not a match, then, if there is enough messages from the account to justify, a new cluster will be created with the account acting as the seed. Otherwise the account is discarded.

In theory, any measure could be used to determine if a given account should be added to a given cluster, such as, what fraction of the accounts messages match those within the cluster. Currently, the module combines message coverage, Normalized Pointwise Mutual Information, and a dampening factor that reduces matching score when there is an insufficient number of messages to be confident.

At the end, any clusters that are below a size threshold are discarded.

License

MIT License

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 23 Dec 08, 2022
INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

This repo contains a powerful tool made using python which is used to visualize, analyse and finally assess the quality of the product depending upon the given observations

SasiVatsal 8 Oct 18, 2022
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
Hg002-qc-snakemake - HG002 QC Snakemake

HG002 QC Snakemake To Run Resources and data specified within snakefile (hg002QC

Juniper A. Lake 2 Feb 16, 2022
Python Practicum - prepare for your Data Science interview or get a refresher.

Python-Practicum Python Practicum - prepare for your Data Science interview or get a refresher. Data Data visualization using data on births from the

Jovan Trajceski 1 Jul 27, 2021
First steps with Python in Life Sciences

First steps with Python in Life Sciences This course material is part of the "First Steps with Python in Life Science" three-day course of SIB-trainin

SIB Swiss Institute of Bioinformatics 22 Jan 08, 2023
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022
In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.

Raster_Sampling_Demo (Resulting graph of this demo) Background Sampling values of a raster at specific geographic coordinates can be done with a numbe

2 Dec 13, 2022
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 07, 2022
Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

Samuel 1 Jan 04, 2022
Show you how to integrate Zeppelin with Airflow

Introduction This repository is to show you how to integrate Zeppelin with Airflow. The philosophy behind the ingtegration is to make the transition f

Jeff Zhang 11 Dec 30, 2022
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 04, 2022
Implementation in Python of the reliability measures such as Omega.

reliabiliPy Summary Simple implementation in Python of the [reliability](https://en.wikipedia.org/wiki/Reliability_(statistics) measures for surveys:

Rafael Valero Fernández 2 Apr 27, 2022
Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

PizzaOrders_DataPipeline There is a Tony who is owning a New Pizza shop. He knew that pizza alone was not going to help him get seed funding to expand

Melwin Varghese P 4 Jun 05, 2022
Pipetools enables function composition similar to using Unix pipes.

Pipetools Complete documentation pipetools enables function composition similar to using Unix pipes. It allows forward-composition and piping of arbit

186 Dec 29, 2022
Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

1 Jan 26, 2022
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022