Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Overview

Toy Machine Learning Pipeline

Table of Contents
  1. About
  2. Getting Started
  3. ML task description and evaluation procedure
  4. Dataset description
  5. Repository structure
  6. Utils documentation
  7. Roadmap
  8. Contributing
  9. Contact

About

This is a toy example of a standalone ML pipeline written entirely in Python. No external tools are incorporated into the master branch. I built this for two reasons:

  1. To experiment with my own ideas for MLOps tools, as it is hard to develop devtools in a vacuum :)
  2. To have something to integrate existing MLOps tools with so I can have real opinions

The following diagram describes the pipeline at a high level. The README describes it in more detail.

Diagram

Getting started

This pipeline is broken down into several components, described in a high level by the directories in this repository. See the Makefile for various commands you can run, but to serve the inference API locally, you can do the following:

  1. git clone the repository
  2. In the root directory of the repo, run make serve
  3. [OPTIONAL] In a new tab, run make inference to ping the API with some sample records

All Python dependencies and virtual environment creation is handled by the Makefile. See setup.py to see the packages installed into the virtual environment, which mainly consist of basic Python packages such as pandas or sklearn.

ML task description and evaluation procedure

We train a model to predict whether a passenger in a NYC taxicab ride will give the driver a large tip. This is a binary classification task. A large tip is arbitrarily defined as greater than 20% of the total fare (before tip). To evaluate the model or measure the efficacy of the model, we measure the F1 score.

The current best model is an instance of sklearn.ensemble.RandomForestClassifier with max_depth of 10 and other default parameters. The test set F1 score is 0.716. I explored this toy task earlier in my debugging ML talk.

Dataset description

We use the yellow taxicab trip records from the NYC Taxi & Limousine Comission public dataset, which is stored in a public aws S3 bucket. The data dictionary can be found here and is also shown below:

Field Name Description
VendorID A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
PULocationID TLC Taxi Zone in which the taximeter was engaged.
DOLocationID TLC Taxi Zone in which the taximeter was disengaged
RateCodeID The final rate code in effect at the end of the trip. 1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip, N= not a store and forward trip
Payment_type A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip
Fare_amount The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.

Repository structure

The pipeline contains multiple components, each organized into the following high-level subdirectories:

  • etl
  • training
  • inference

Pipeline components

Any applied ML pipeline is essentially a series of functions applied one after the other, such as data transformations, models, and output transformations. This pipeline was initially built in a lightweight fashion to run on a regular laptop with around 8 GB of RAM. The logic in these components is a first pass; there is a lot of room to improve.

The following table describes the components of this pipeline, in order:

Name Description How to run File(s)
Cleaning Reads the dataset (stored in a public S3 bucket) and performs very basic cleaning (drops rows outside the time range or with $0-valued fares) make cleaning etl/cleaning.py
Featuregen Generates basic features for the ML model make featuregen etl/featuregen.py
Split Splits the features into train and test sets make split training/split.py
Training Trains a random forest classifier on the train set and evaluates it on the test set make training training/train.py
Inference Locally serves an API that is essentially a wrapper around the predict function make serve, make inference [inference/app.py, inference/inference.py]

Data storage

The inputs and outputs for the pipeline components, as well as other artifacts, are stored in a public S3 bucket named toy-applied-ml-pipeline located in us-west-1. Read access is universal and doesn't require special permissions. Write access is limited to those with credentials. If you are interested in contributing and want write access, please contact me directly describing how you would like to be involved, and I can send you keys.

The bucket has a scratch folder, where random scratch files live. These random scratch files were likely generated by the write_file function in utils.io. The bulk of the bucket lies in the dev directory, or s3://toy-applied-ml-pipeline/dev.

The dev directory's subdirectories represent the components in the pipeline. These subdirectories contain the outputs of each component respectively, where the outputs are versioned with the timestamp the component was run. The utils.io library contains helper functions to write outputs and load the latest component output as input to another component. To inspect the filesystem structure further, you can call io.list_files(dirname), which returns the immediate files in dirname.

If you have write permissions, store your keys/ids in an .env file, and the Makefile will automatically pick it up. If you do not have write permissions, you will run into an error if you try to write to the S3 bucket.

Utils documentation

The utils directory contains helper functions and abstractions for expanding upon the current pipeline. Tests are in utils/tests.py. Note that only the io functions are tested as of now.

io

utils/io.py contains various helper functions to interface with S3. The two most useful functions are:

def load_output_df(component: str, dev: bool = True, version: str = None) -> pd.DataFrame:
  """
    This function loads the latest version of data that was produced by a component.
    Args:
        component (str): component name that we want to get the output from
        dev (bool): whether this is run in development or "production" mode
        version (str, optional): specified version of the data
    Returns:
        df (pd.DataFrame): dataframe corresponding to the data in the latest version of the output for the specified component
    """
    ...

def save_output_df(df: pd.DataFrame, component: str, dev: bool = True, overwrite: bool = False, version: str = None) -> str:
    """
    This function writes the output of a pipeline component (a dataframe) to a parquet file.
    Args:
        df (pd.DataFrame): dataframe representing the output
        component (str): name of the component that produced the output (ex: clean)
        dev (bool, optional): whether this is run in development or "production" mode
        overwrite (bool, optional): whether to overwrite a file with the same name
        version (str, optional): optional version for the output. If not specified, the function will create the version number.
    Returns:
        path (str): Full path that the file can be accessed at
    """
    ...

Note that save_output_df's default parameters are set such that you cannot overwrite an existing file. You can change this by setting overwrite = True.

Feature generators

utils.feature_generators.py contains the lightweight abstraction for a feature generator to make it easy for someone to create a new feature. The abstraction is as follows:

class FeatureGenerator(ABC):
    """Abstract class for a feature generator."""

    def __init__(self, name: str, required_columns: typing.List[str]):
        """Constructor stores the name of the feature and columns required in a df to construct that feature."""
        self.name = name
        self.required_columns = required_columns

    @abstractmethod
    def compute(self):
        pass

    @abstractmethod
    def schema(self):
        pass

See utils.feature_generators.py for examples on how to create specific feature types and etl/featuregen.py for an example on how to create the actual instances of the features themselves.

Models

utils/models.py contains the ModelWrapper abstraction. This abstraction is essentially a wrapper around a model and consists of:

  • the model binary
  • pointer to dataset(s)
  • metric values

To use this abstraction, you must create a subclass of ModelWrapper and implement the preprocess, train, predict, and score methods. The base class also provides methods to save and load the ModelWrapper object. It will fail to save if the client has not added data paths and metrics to the object.

An example of a subclass of ModelWrapper is the RandomForestModelWrapper, which is also found in utils/models.py. The RandomForestModelWrapper client usage example is in training/train.py and is partially shown below:

from utils import models

# Create and train model
mw = models.RandomForestModelWrapper(
    feature_columns=feature_columns, model_params=model_params)
mw.train(train_df, label_column)

# Score model
train_score = mw.score(train_df, label_column)
test_score = mw.score(test_df, label_column)

mw.add_data_path('train_df', train_file_path)
mw.add_data_path('test_df', test_file_path)
mw.add_metric('train_f1', train_score)
mw.add_metric('test_f1', test_score)

# Save model
print(mw.save('training/models'))

# Load latest model version
reloaded_mw = models.RandomForestModelWrapper.load('training/models')
test_preds = reloaded_mw.predict(test_df)

Roadmap

See the open issues for tickets corresponding to feature ideas. The issues in this repo are mainly tagged either data science or engineering.

Contributing

Having a toy example of an ML pipeline isn't just nice to have for people experimenting with MLOps tools. ML beginners or data science enthusiasts looking to understand how to build pipelines around ML models can also benefit from this repository.

Anyone is welcome to contribute, and your contribution is greatly appreciated! Feel free to either create issues or pull requests to address issues.

  1. Fork the repo
  2. Create your branch (git checkout -b YOUR_GITHUB_USERNAME/somefeature)
  3. Make changes and add files to the commit (git add .)
  4. Commit your changes (git commit -m 'Add something')
  5. Push to your branch (git push origin YOUR_GITHUB_USERNAME/somefeature)
  6. Make a pull request

Contact

Original author: Shreya Shankar

Email: [email protected]

Owner
Shreya Shankar
Trying to make machine learning work in the real world. Previously at @viaduct-ai, @google-research, @facebook, and @Stanford computer science.
Shreya Shankar
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 03, 2023
Model for recasing and repunctuating ASR transcripts

Recasing and punctuation model based on Bert Benoit Favre 2021 This system converts a sequence of lowercase tokens without punctuation to a sequence o

Benoit Favre 88 Dec 29, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
p-tuning for few-shot NLU task

p-tuning_NLU Overview 这个小项目是受乐于分享的苏剑林大佬这篇p-tuning 文章启发,也实现了个使用P-tuning进行NLU分类的任务, 思路是一样的,prompt实现方式有不同,这里是将[unused*]的embeddings参数抽取出用于初始化prompt_embed后

3 Dec 29, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
RecipeReduce: Simplified Recipe Processing for Lazy Programmers

RecipeReduce This repo will help you figure out the amount of ingredients to buy for a certain number of meals with selected recipes. RecipeReduce Get

Qibin Chen 9 Apr 22, 2022
ADCS cert template modification and ACL enumeration

Purpose This tool is designed to aid an operator in modifying ADCS certificate templates so that a created vulnerable state can be leveraged for privi

Fortalice Solutions, LLC 78 Dec 12, 2022
To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

Aravindhan 1 Feb 09, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

English | 中文 Features 🌍 Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc. ?

Vega 25.6k Dec 31, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre

THUNLP 2.3k Jan 08, 2023
NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

NumPy String-Indexed NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels, rather than conventio

Aitan Grossman 1 Jan 08, 2022
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

43 Nov 11, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk James Turk 1.8k Dec 21, 2022

本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
Anuvada: Interpretable Models for NLP using PyTorch

Anuvada: Interpretable Models for NLP using PyTorch So, you want to know why your classifier arrived at a particular decision or why your flashy new d

EDGE 102 Oct 01, 2022
मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

मुक्त स्त्रोत 20 Oct 11, 2022
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

fluz 11 Nov 16, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
TLA - Twitter Linguistic Analysis

TLA - Twitter Linguistic Analysis Tool for linguistic analysis of communities TLA is built using PyTorch, Transformers and several other State-of-the-

Tushar Sarkar 47 Aug 14, 2022