Generating synthetic mobility data for a realistic population with RNNs to improve utility and privacy

Overview

lbs-data

Motivation

Location data is collected from the public by private firms via mobile devices. Can this data also be used to serve the public good while preserving privacy? Can we realize this goal by generating synthetic data for use instead of the real data? The synthetic data would need to balance utility and privacy.

Overview

What:

This project uses location based services (LBS) data provided by a location intelligence company in order to train a RNN model to generate synthetic location data. The goal is for the synthetic data to maintain the properties of the real data, at the individual and aggregate levels, in order to retain its utility. At the same time, the synthetic data should sufficiently differ from the real data at the individual level, in order to preserve user privacy.

Furthermore, the system uses home and work areas as labels and inputs in order to generate location data for synthetic users with the given home and work areas.
This addresses the issue of limited sample sizes. Population data, such as census data, can be used to create the input necessary to output a synthetic location dataset that represents the true population in size and distribution.

Data

/data/

ACS data

data/ACS/ma_acs_5_year_census_tract_2018/

Population data is sourced from the 2018 American Community Survey 5-year estimates.

LBS data

/data/mount/

Privately stored on a remote server.

Geography and time period

  • Geography: The region of study is limited to 3 counties surrounding Boston, MA.
  • Time period: The training and output data is for the first 5-day workweek of May 2018.

Data representation

The LBS data are provided as rows.

device ID, latitude, longitude, timestamp, dwelltime

The data are transformed into "stay trajectories", which are sequences where each index of a sequence represents a 1-hour time interval. Each stay trajectory represents the data for one user (device ID). The value at that index represents the location/area (census tract) where the user spent the most time during that 1-hour interval.

e.g.

[A,B,D,C,A,A,A,NULL,B...]

Where each letter represents a location. There are null values when no location data is reported in the time interval.

home and work locations are inferred for each user stay trajectory. stay trajectories are prefixed with the home and work locations. This home, work prefixes then serve as labels.

[home,work,A,B,D,C,A,A,A,NULL,B...]

Where home,work values are also elements (frequently) occuring in their associated stay trajectory (e.g. home=A).

These sequences are used to train the model and are also output by the model.

RNN

The RNN model developed in this work is meant to be simple and replicable. It was implemented via the open source textgenrnn library. https://github.com/minimaxir/textgenrnn.

Many models (>70) are trained with a variety of hyper parameter values. The models are each trained on the same training data and then use the same input (home, work labels) to generate output synthetic data. The output is evalued via a variety of utility and privacy metrics in order to determine the best model/parameters.

Pipeline

Preprocessing

Define geography / shapefiles

./shapefile_shaper.ipynb

Our study uses 3 counties surrounding Boston, MA: Middlesex, Norfolk, Suffolk counties.

shapefile_shaper prunes MA shapefiles for this geography.

Output files are in ./shapefiles/ma/

Census tracts are used as "areas"/locations in stay trajectories.

Data filtering

./preprocess_filtering.ipynb

The LBS data is sparse. Some users report just a few datapoints, while other users report many. In order to confidently infer home and work locations, and learn patterns, we only include data from devices with sufficient reporting.

./preprocess_filtering.ipynb filters the data accordingly. It pokes the data to try to determine what the right level of filtering is. It outputs saved files with filtered data. Namely, it saves a datafile with LBS data from devices that reported at least 3 days and 3 nights of data during the 1 workweek of the study period. This is the pruned dataset used in the following work.

Attach areas

/attach_areas.ipynb

Census areas are attached to LBS data rows.

Home, work inference

./infer_home_work.ipynb

Defines functions to infer home and work locations (census tracts ) for each device user, based on their LBS data. The home location is where the user spends most time in nighttime hours. The "work" location is where the user spends the most time in workday hours. These locations can be the same.

This file helps determine good hours to use for nighttime hours. Once the functions are defined, they are used to evaluate the data representativeness by comparing the inferred population statistics to ACS 2018 census data.

Saves a mapping of LBS user IDS to the inferred home,work locations.

Stay trajectories setup

./trajectory_synthesis/trajectory_synthesis_notebook.ipynb

Transforms preprocessed LBS data into prefixed stay trajectories.

And outputs files for model training, data generation, and comparison.

Note: for the purposes of model training and data generation, the area tokens within stay trajectories can be arbitrary. What is important for the model’s success is the relationship between them. In order to save the stay trajectories in this repository yet keep real data private, we do the following. We map real census areas to integers, and map areas in stay trajectories to the integers representing the areas. We use the transformed stay trajectories for model training and data generation. The mapping between real census areas and their integer representations is kept private. We can then map the integers in stay trajectories back to the real areas they represent when needed (such as when evaluating trip distance metrics).

Output files:

./data/relabeled_trajectories_1_workweek.txt: D: Full training set of 22704 trajectories

./data/relabeled_trajectories_1_workweek_prefixes_to_counts.json: Maps D home,work label prefixes to counts

./data/relabeled_trajectories_1_workweek_sample_2000.txt: S: Random sample of 2000 trajectories from D.

./data/relabeled_trajectories_1_workweek_prefixes_to_counts_sample_2000.json: Maps S home,work label prefixes to counts

  • This is used as the input for data generation so that the output sythetic sample, S', has a home,work label pair distribution that matches S.

Model training and data generation

./trajectory_synthesis/textgenrnn_generator/

Models with a variety of hyperparameter combinations were trained and then used to generate a synthetic sample.

The files model_trainer.py and generator.py are the templates for the scripts used to train and generate.

The model (hyper)parameter combinations were tracked in a spreadsheet. ./trajectory_synthesis/textgenrnn_generator/textgenrnn_model_parameters_.csv

Evaluation

./trajectory_synthesis/evaluation/evaluate_rnn.ipynb

A variety of utility and privacy evaluation tools and metrics were developed. Models were evaluated by their synthetic data outputs (S'). This was done in ./trajectory_synthesis/evaluation/evaluate_rnn.ipynb. The best model (i.e. best parameters) was determined by these evaluations. The results for this model are captured in trajectory_synthesis/evaluation/final_eval_plots.ipynb.

Owner
Alex
Systems Architect, product oriented Engineer, Hacker for the social good, Math Nerd that loves solving hard problems and working with great people.
Alex
StackNet is a computational, scalable and analytical Meta modelling framework

StackNet This repository contains StackNet Meta modelling methodology (and software) which is part of my work as a PhD Student in the computer science

Marios Michailidis 1.3k Dec 15, 2022
Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild

Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild

1.1k Jan 03, 2023
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022
RMTD: Robust Moving Target Defence Against False Data Injection Attacks in Power Grids

RMTD: Robust Moving Target Defence Against False Data Injection Attacks in Power Grids Real-time detection performance. This repo contains the code an

0 Nov 10, 2021
Learning hierarchical attention for weakly-supervised chest X-ray abnormality localization and diagnosis

Hierarchical Attention Mining (HAM) for weakly-supervised abnormality localization This is the official PyTorch implementation for the HAM method. Pap

Xi Ouyang 22 Jan 02, 2023
Deep Anomaly Detection with Outlier Exposure (ICLR 2019)

Outlier Exposure This repository contains the essential code for the paper Deep Anomaly Detection with Outlier Exposure (ICLR 2019). Requires Python 3

Dan Hendrycks 464 Dec 27, 2022
ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet)

ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet) (

Wei-Ting Chen 49 Dec 27, 2022
A generator of point clouds dataset for PyPipes.

CloudPipesGenerator Documentation | Colab Notebooks | Video Tutorials | Master Degree website A generator of point clouds dataset for PyPipes. TODO Us

1 Jan 13, 2022
Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

CasRel-pytorch-reimplement Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The o

longlongman 170 Dec 01, 2022
基于Paddlepaddle复现yolov5,支持PaddleDetection接口

PaddleDetection yolov5 https://github.com/Sharpiless/PaddleDetection-Yolov5 简介 PaddleDetection飞桨目标检测开发套件,旨在帮助开发者更快更好地完成检测模型的组建、训练、优化及部署等全开发流程。 PaddleD

36 Jan 07, 2023
Benchmarking Pipeline for Prediction of Protein-Protein Interactions

B4PPI Benchmarking Pipeline for the Prediction of Protein-Protein Interactions How this benchmarking pipeline has been built, and how to use it, is de

Loïc Lannelongue 4 Jun 27, 2022
SphereFace: Deep Hypersphere Embedding for Face Recognition

SphereFace: Deep Hypersphere Embedding for Face Recognition By Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj and Le Song License SphereFa

Weiyang Liu 1.5k Dec 29, 2022
Learning to Reconstruct 3D Non-Cuboid Room Layout from a Single RGB Image

NonCuboidRoom Paper Learning to Reconstruct 3D Non-Cuboid Room Layout from a Single RGB Image Cheng Yang*, Jia Zheng*, Xili Dai, Rui Tang, Yi Ma, Xiao

67 Dec 15, 2022
HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022 [Project page | Video] Getting sta

51 Nov 29, 2022
OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.

English | 简体中文 Documentation: https://mmtracking.readthedocs.io/ Introduction MMTracking is an open source video perception toolbox based on PyTorch.

OpenMMLab 2.7k Jan 08, 2023
Code for the paper: Adversarial Training Against Location-Optimized Adversarial Patches. ECCV-W 2020.

Adversarial Training Against Location-Optimized Adversarial Patches arXiv | Paper | Code | Video | Slides Code for the paper: Sukrut Rao, David Stutz,

Sukrut Rao 32 Dec 13, 2022
git《FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding》(CVPR 2021) GitHub: [fig8]

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding (CVPR 2021) This repo contains the implementation of our state-of-the-art fewshot ob

233 Dec 29, 2022
Some useful blender add-ons for SMPL skeleton's poses and global translation.

Blender add-ons for SMPL skeleton's poses and trans There are two blender add-ons for SMPL skeleton's poses and trans.The first is for making an offli

犹在镜中 154 Jan 04, 2023
A symbolic-model-guided fuzzer for TLS

tlspuffin TLS Protocol Under FuzzINg A symbolic-model-guided fuzzer for TLS Master Thesis | Thesis Presentation | Documentation Disclaimer: The term "

69 Dec 20, 2022
Code for the CVPR 2021 paper: Understanding Failures of Deep Networks via Robust Feature Extraction

Welcome to Barlow Barlow is a tool for identifying the failure modes for a given neural network. To achieve this, Barlow first creates a group of imag

Sahil Singla 33 Dec 05, 2022