Easy pipelines for pandas DataFrames.

Last update: Jan 05, 2023

Overview

pdpipe ˨

Easy pipelines for pandas DataFrames (learn how!).

Website: https://pdpipe.github.io/pdpipe/

Documentation: https://pdpipe.github.io/pdpipe/doc/pdpipe/

>>> df = pd.DataFrame(
        data=[[4, 165, 'USA'], [2, 180, 'UK'], [2, 170, 'Greece']],
        index=['Dana', 'Jane', 'Nick'],
        columns=['Medals', 'Height', 'Born']
    )
>>> import pdpipe as pdp
>>> pipeline = pdp.ColDrop('Medals').OneHotEncode('Born')
>>> pipeline(df)
            Height  Born_UK  Born_USA
    Dana     165        0         1
    Jane     180        1         0
    Nick     170        0         0

Contents

1 Documentation
2 Installation
3 Contributing
- 3.1 Installing for development
- 3.2 Running the tests
- 3.3 Adding tests
- 3.4 Code style
- 3.5 Adding documentation
- 3.6 Adding doctests
4 Credits

1 Documentation

This is the repository of the pdpipe package, and this readme file is aimed to help potential contributors to the project.

To learn more about how to use pdpipe, either visit pdpipe's homepage or read the online documentation of pdpipe.

2 Installation

Install pdpipe with:

pip install pdpipe

Some pipeline stages require scikit-learn; they will simply not be loaded if scikit-learn is not found on the system, and pdpipe will issue a warning. To use them you must also install scikit-learn.

Similarly, some pipeline stages require nltk; they will simply not be loaded if nltk is not found on your system, and pdpipe will issue a warning. To use them you must additionally install nltk.

3 Contributing

Package author and current maintainer is Shay Palachy ([email protected]); You are more than welcome to approach him for help. Contributions are very welcomed, especially since this package is very much in its infancy and many other pipeline stages can be added.

3.1 Installing for development

Clone:

git clone [email protected]:pdpipe/pdpipe.git

Install in development mode with test dependencies:

cd pdpipe
pip install -e ".[test]"

3.2 Running the tests

To run the tests, use:

python -m pytest

Notice pytest runs are configured by the pytest.ini file. Read it to understand the exact pytest arguments used.

3.3 Adding tests

At the time of writing, pdpipe is maintained with a test coverage of 100%. Although challenging, I hope to maintain this status. If you add code to the package, please make sure you thoroughly test it. Codecov automatically reports changes in coverage on each PR, and so PR reducing test coverage will not be examined before that is fixed.

Tests reside under the tests directory in the root of the repository. Each module has a separate test folder, with each class - usually a pipeline stage - having a dedicated file (always starting with the string "test") containing several tests (each a global function starting with the string "test"). Please adhere to this structure, and try to separate tests cases to different test functions; this allows us to quickly focus on problem areas and use cases. Thank you! :)

3.4 Code style

pdpip code is written to adhere to the coding style dictated by flake8. Practically, this means that one of the jobs that runs on the project's Travis for each commit and pull request checks for a successfull run of the flake8 CLI command in the repository's root. Which means pull requests will be flagged red by the Travis bot if non-flake8-compliant code was added.

To solve this, please run flake8 on your code (whether through your text editor/IDE or using the command line) and fix all resulting errors. Thank you! :)

3.5 Adding documentation

This project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings (in my personal opinion, of course). When documenting code you add to this project, please follow these conventions.

Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.

3.6 Adding doctests

Please notice that for pdoc3 - the Python package used to generate the html documentation files for pdpipe - to successfully include doctests in the generated documentation files, the whole doctest must be indented in relation to the opening multi-string indentation, like so:

class ApplyByCols(PdPipelineStage):
    """A pipeline stage applying an element-wise function to columns.

    Parameters
    ----------
    columns : str or list-like
        Names of columns on which to apply the given function.
    func : function
        The function to be applied to each element of the given columns.
    result_columns : str or list-like, default None
        The names of the new columns resulting from the mapping operation. Must
        be of the same length as columns. If None, behavior depends on the
        drop parameter: If drop is True, the name of the source column is used;
        otherwise, the name of the source column is used with the suffix
        '_app'.
    drop : bool, default True
        If set to True, source columns are dropped after being mapped.
    func_desc : str, default None
        A function description of the given function; e.g. 'normalizing revenue
        by company size'. A default description is used if None is given.


    Example
    -------
        >>> import pandas as pd; import pdpipe as pdp; import math;
        >>> data = [[3.2, "acd"], [7.2, "alk"], [12.1, "alk"]]
        >>> df = pd.DataFrame(data, [1,2,3], ["ph","lbl"])
        >>> round_ph = pdp.ApplyByCols("ph", math.ceil)
        >>> round_ph(df)
           ph  lbl
        1   4  acd
        2   8  alk
        3  13  alk
    """

4 Credits

Created by Shay Palachy ([email protected]).

Comments

My article about your package

Hi,

I found your package quite useful and pretty neat. I may contribute to it later with ideas and suggestions for enhancement.

For now, I have written an article about it and it has gained traction pretty fast.

Check out the article here.
question

opened by tirthajyoti 14
Pickling of pipelines

Hi,

I stumbled upon your excellent library when researching ways of serializing and pickling pipelines for reuse on new datasets, and found your approach to be most promising. However, when attempting to pickle to the following pipeline from your documentation;

pipeline = pdp.ColDrop("Name") + pdp.OneHotEncode("Label") pipeline += pdp.MapColVals("Job", {"Part": True, "Full":True, "No": False}) pipeline += pdp.PdPipeline([pdp.ColRename({"Job": "Employed"})]) joblib.dump(pipeline, "pipeline.pkl")

... I get the following Exception:

PicklingError: Can't pickle <function ColRename.__init__.<locals>._tprec at 0x000001CBC33864C0>: it's not found as pdpipe.basic_stages.ColRename.__init__.<locals>._tprec

I get similar exceptions when attemtping other function-based steps, like ApplyToRows or ColByFrameFunc.

Is there a way to get around this? As an example: How would I solve applying a function that calculates age from a column containing birthdate and pickling this?

In general I'd love to see more explanations and examples on pickling pipelines in your otherwise excellent documentation!
bug complex issue

opened by MagBuchSB1 13
Application context objects should not be kept by default + add way to supply fit context
pdpipe uses PdpApplicationContext objects in two ways:

As the fit_context that should be kept as-is after a fit, and used by stages to pass to one another parameters that should also be used on transform time.

As the application_context that should be discarded after a specific application is done, and is used by stages to feed consecutive stages with context. It can be added to by supplying apply(context={}), fit_transform(context={}) or transform(context={}) with a dict that will be used to update the application context.

Two changes are required:

At the moment there is a single context parameter to application functions that is used to update both the fit and the application context. I think they should be two, one for each type of context.

At the moment the application_context is not discarded when the application is done. It's as simple as self.application_context = None expression added at the PdPipeline level in the couple of right cases.

enhancement good first issue
opened by shaypal5 9
Docs/Improve documentation
Summary

Issue #28

Fix spellings

Fix docstrings

Example: Missing callable in columns : str or iterable, optional

Add code examples when use cases are complex

Trying to use existing tests when possible

If cannot, tests are added

Enable links for crucial references

Example: pdpipe.cq to pdpipe.cq

Progress

[x] documentation.md

[x] pdpipe.basic_stages

[x] pdpipe.col_generation

[x] pdpipe.cond

[x] pdpipe.core

[x] pdpipe.cq

[x] pdpipe.nltk_stages

[x] pdpipe.skintegrate

[x] pdpipe.sklearn_stages

[x] pdpipe.text_stages

[x] pdpipe.wrappers
opened by yarkhinephyo 8
Improve the docs

Excellent work! I love this tool very much. I think you can improve the docs. I tried some times to know how to use AdHocStage to do sum/mean operations. The description of AggByCols is worse.

Please provide some basic information about your system (Python version, operating system, &c).
enhancement help wanted good first issue documentation

opened by gandad 7
Fix setup_required in setup.py

In the setup.py there are the following lines:

install_requires=INSTALL_REQUIRES, setup_requires=INSTALL_REQUIRES,

I think they should only be install_requires and not setup_requires. This causes an issue where dependencies during conda packaging can't be met. Removal of the line fixes the problem.

Does that sound reasonable? Sorry for the earlier confusion on my part about sklearn :)
bug

opened by Silun 6
Issue #27 changes

1.1 Not sure what exactly is needed as "description". Implemented the class name for now. 3. May be can be implemented whenever #19 is done.
enhancement

opened by naveenkaushik2504 6
About ColByFrameFunc
Summary.

Expected Result

Execute add_Col function according to the condition of the df['A'] column value.

What you expected.

Actual Result

ERROR:PipelineApplicationError: Exception raised in stage [ 0] PdPipelineStage: Applying a function to generate column A. What happened instead.

Reproduction Steps

import pdpipe as pdp #%% data = [[3, 3], [2, 4], [1, 5]] df_usage = pd.DataFrame(data, [1,2,3], ["A","B"]) def add_Col(a,b): if a > 2: return a+ b else: return a+10 func = lambda df:add_Col(df['A'],df['B']) pipeline = pdp.PdPipeline([ pdp.ColByFrameFunc("A",func,follow_column='B'), pdp.ColRename({'A':'CloA','B':'ColB'}) ]) df_usage = pipeline(df_usage,verbose=True)

System Information

MacOS 10.15.7 python version:3.9.10

Please provide some basic information about your system (Python version, operating system, &c).
invalid
opened by banduoba 5
Add example for GridSearchCV parameter tuning

Hi, please add to documentation simple example for common scenario in ML models, how to use pdpipe with GridSearchCV.

It is not clear, how to do it / of if expected scenario at all.

This basic simple will help people to use pdpipe for all ML models and parameter-tuning techniques.

Best Stefan
enhancement question

opened by stefansimik 5
Feature Request: More information in the messages of PipelineApplicationError
At the moment, there is too little information in the messages of PipelineApplicationError.

Things that should be added:

The type of the pipeline stage (so the name of the class). 1.1. Actually, maybe the description?! :)

The index in the pipeline (obviously only when part of a pipeline; requires catching and rethrowing at the pipeline level).

The label of the pipeline stage, when #19 is implemented.

enhancement good first issue
opened by shaypal5 5
exclude_columns does not work for ColumnsBasedPipelineStage

When setting exclude_columns as a list of strings in the constructor of ColumnsBasedPipelineStage the member variable self._exclude_columns is set to a tuple instead of the passed in list. This leads to errors with __get_cols_by_arg which ends up returning the tuple and not the list itself.

Code where the issue appears to be

https://github.com/pdpipe/pdpipe/blob/763320db326e9a49f51bd7fb9ea65944f51869f2/pdpipe/core.py#L602

The line which seems incorrect is linked above. _interpret_columns_param returns a tuple, so we should be assigning the individual components of the tuple, in that line, as opposed to assigning the entire return value to 'self._exclude_columns'; something like

self._exclude_columns, _ = self._interpret_columns_param(...
bug

opened by jjlee88 4
Read & write pipeline configuration from/to YAML
This is a feature that has been discussed and requested several times.

A couple of independent projects tried to do exactly that, but I believe it is worthwhile to have one official API exposed by the package, and also kept up-to-date:

https://github.com/blakeNaccarato/pdpipewrench

https://github.com/altescy/pdpcli

https://github.com/neilbartlett/datapipeliner

enhancement complex issue
opened by shaypal5 0
Feature Request: Contextual params using application context
I want a way to supply pipeline stage constructor parameters with future-like placeholders, so that actual values will be determined by prior stages on application time.

I would assign the name of the future ApplicationContext key holding the value after it was calculated — wrapped by a unique class — to the parameter.

The constructor will hold on to this object, and will — in application time — pull the value of the right key from either the fit context or the application context (depending on how I set it: I might want this value to be set on pipeline fit, or to be set on each application dynamically, even on transforms when the pipeline is fitted), and use it for the transformation. Default should probably be the fit context?

Here's an example:

import numpy as np; import pandas as pd; import pdpipe as pdp; def scaling_decider(X: pd.DataFrame) -> str: """Determines with type of scaling to apply by examining all numerical columns.""" numX = X.select_dtypes(include=np.number) for col in numX.columns: # this is nonsense logic, just an example if np.std(numX[col]) > 2 * np.mean(numX[col]): return 'StandardScaler' return 'MinMaxScaler' pipeline = pdp.PdPipeline(stages=[ pdp.ColDrop(pdp.cq.StartWith('n_'), pdp.ApplicationContextEnricher(scaling_type=scaling_decider), pdp.Scale( # fit=False means it will take it from the application context, and not fit context scaler=pdp.contextual('scaling_type', fit=False), joint=True, ), ])

Design

This has to be implemented at the PdPipeline base class. Since the base class can't hijack constructor arguments, I think the contract with extending classes should be:

When implementing class exteding PdPipeline, if you want to enjoy support for contextual constructor parameters, you MUST delay any initialization of inner state objects to fit_transform, so that fit/application context is available on initialization (it is NOT available at pipeline stage costruction and initialization, after all).

pdp.contextual is a factory function that returns contextual parameter placeholder objects. Code using it shouldn't really care about it, as it should never interact with the resulting objects directly. I think.

PdPipeline can auto-magically make sure that any attribute of a PdPipeline instance that is assigned a pdp.contextual object in the constructor (e.g. self.k = k, and the k constructor argument was provided with k=pdp.future('pca_k')) will be hot-swapped with a concrete value by the time we wish to use it fit_transform or transform (for example, when we call self.pca_ = PCA(k=self.k)). It can also do so for any such object that is contained in any iterable or dict-like attribute (so if I have self._pca_kwargs = {...} in my constructor, I can safely call self.pca_ = PCA(**self._pca_kwargs) in fit_transform().

Implementation thoughts

To make this efficient, since this means accessing sub-class attribute instance on pipeline transformations, I have a few thoughts:

The contextuals module should have a global variable such as CONTEXTUALS_ARE_ON = False. Then, the pdp.contextual factory function sets global CONTEXTUALS_ARE_ON; CONTEXTUALS_ARE_ON = True when called. Then, we condition the whole inspection-heavy logic on this indicator variable, so that if our user never called pdp.contextual during the current kernel, runtime is saved.

I first thought pdp.contextual could somehow register _ContextualParam objects in a registery we could use to find what needed to be swapped, but actually this wouldn't help, as they won't know which attribute of which pipeline stage they were assigned to.

We thus have to scan sub-class attributes, but we can do so if and only after pdp.contextual was called, and right after pipeline stage initialization. Moreover, we can create a literal list of all attribute names we should ignore, stored in pdpipe.core as a global: e.g. _IGNORE_PDPSTAGE_ATT = ['_desc', '_name'], etc. Everything we know isn't an attribute the sub-class declared. Then, we can check any attribute that isn't one of these. This can be done in pdpipe.PdPipelineStage.__init__(), since it's called (by contract; we can demand that from extending subclasses) at the end of the __init__() method of subclasses. When we find that such an attribute holds a pdp.contextual, we register it at the pdp.contextuals module, in some global dict (or something more sophisticated), keyed by the attribute name. We can also registed the containig stage object.

Then, in a stage fit_transform and transform methods, if the current stage object is registered for contextual hot-swapping, we find the concrete contextual value of any attribute resigtered for this stage (in either the self.fit_context object or self.application_context object the pipeline injects all stages during applications of the pipeline) and hot-swap it literaly: This will look something like setattr(self, 'k', self.fit_context['pca_k']), since we're at pdpipe.PdPipelineStage.fit_transform(), and the self object is an instance of the subclass requiring the hot swap (in this case, pdp.Decompose).
enhancement complex issue
opened by shaypal5 0
External file loading problem with pdpipe import
Discussed in https://github.com/pdpipe/pdpipe/discussions/97

^{Originally posted by Dranikf March 14, 2022} Will start with example. I create test.py file in some dir, it may looks like this:

from sys import path import pdpipe path.append("<path to external folder>/some_external_folder") import temp

In ".../some_external_folder" i have added temp.py file, which contains 'hello world' printing. When I try to run first file i have a error:

Traceback (most recent call last): File "<first file folder path>/test.py", line 6, in <module> import temp ModuleNotFoundError: No module named 'temp'

But when i rem import pdpipe program starts. What can cause this behavior and how to correct it?

bug
opened by shaypal5 4
Feature: A generic ,column-based, stage wrapper for any matrix-to-matrix sklearn transformer.

Basically the column parameter - which can be a single columns, a list of them or a dynamic ColumnQualifier object, takes a subset of input dataframes, and the wrapped sklearn transformer transforms just this sub-dataframe, which the stage puts back in the right place.
enhancement

opened by shaypal5 0
Chore: Add pickling tests for all pipeline stages

Issue #71 pointed at a pickling problem with the ColRename stage, with a fix released in v0.0.68. After fixing that, a couple of additional unpickle-able stage were found and fixed, which was released in v0.0.69.

Everything should be pickle-able now, BUT not all stages are tested for this. This situation should be remedied sooner rather than later.
chore tests

opened by shaypal5 0

Releases(v0.3.2)

v0.3.2(Sep 19, 2022)

Source code(tar.gz)
Source code(zip)
v0.3.1(Aug 9, 2022)

Source code(tar.gz)
Source code(zip)
v0.3.0(Jul 4, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.8(Jun 23, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.7(Jun 22, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.6(Jun 22, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.5(Jun 22, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.4(Jun 22, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.3(Mar 13, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.2(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.1(Feb 23, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.0(Feb 14, 2022)

Source code(tar.gz)
Source code(zip)
v0.1.6(Jan 30, 2022)

Source code(tar.gz)
Source code(zip)
v0.1.2(Jan 24, 2022)

Source code(tar.gz)
Source code(zip)
v0.1.0(Jan 23, 2022)

Source code(tar.gz)
Source code(zip)
v0.0.71(Dec 26, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.70(Dec 19, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.69(Dec 10, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.68(Dec 8, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.67(Nov 15, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.66(Nov 8, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.65(Nov 8, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.64(Nov 8, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.63(Nov 8, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.62(Oct 27, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.61(Oct 27, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.60(Oct 21, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.59(Oct 21, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.58(Aug 30, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.57(Aug 30, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Easy pipelines for pandas DataFrames.

GitHub Repository https://pdpipe.github.io/pdpipe/

A Python toolkit for processing tabular data

401 Dec 19, 2022

Easy pipelines for pandas DataFrames.

pdpipe ˨ Easy pipelines for pandas DataFrames (learn how!). Website: https://pdpipe.github.io/pdpipe/ Documentation: https://pdpipe.github.io/pdpipe/d

694 Jan 05, 2023

dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

754 Nov 21, 2022

Tools for parsing messy tabular data.

Parsing for messy tables A library for dealing with messy tabular data in several formats, guessing types and detecting headers. See the documentation

382 Nov 10, 2022

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

BatchFlow BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflo

185 Dec 20, 2022

Easy pipelines for pandas DataFrames.

Related tags

Overview

pdpipe ˨

Comments

Summary

Progress

Expected Result

Actual Result

Reproduction Steps

System Information

Code where the issue appears to be

Design

Implementation thoughts

Discussed in https://github.com/pdpipe/pdpipe/discussions/97

Releases(v0.3.2)

v0.3.2(Sep 19, 2022)

v0.3.1(Aug 9, 2022)

v0.3.0(Jul 4, 2022)

v0.2.8(Jun 23, 2022)

v0.2.7(Jun 22, 2022)

v0.2.6(Jun 22, 2022)

v0.2.5(Jun 22, 2022)

v0.2.4(Jun 22, 2022)

v0.2.3(Mar 13, 2022)

v0.2.2(Mar 10, 2022)

v0.2.1(Feb 23, 2022)

v0.2.0(Feb 14, 2022)

v0.1.6(Jan 30, 2022)

v0.1.2(Jan 24, 2022)

v0.1.0(Jan 23, 2022)

v0.0.71(Dec 26, 2021)

v0.0.70(Dec 19, 2021)

v0.0.69(Dec 10, 2021)

v0.0.68(Dec 8, 2021)

v0.0.67(Nov 15, 2021)

v0.0.66(Nov 8, 2021)

v0.0.65(Nov 8, 2021)

v0.0.64(Nov 8, 2021)

v0.0.63(Nov 8, 2021)

v0.0.62(Oct 27, 2021)

v0.0.61(Oct 27, 2021)

v0.0.60(Oct 21, 2021)

v0.0.59(Oct 21, 2021)

v0.0.58(Aug 30, 2021)

v0.0.57(Aug 30, 2021)

Owner

A Python toolkit for processing tabular data

Easy pipelines for pandas DataFrames.

dplyr for python

Tools for parsing messy tabular data.

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Clean APIs for data cleaning. Python implementation of R package Janitor

Microsoft Azure provides a wide number of services for managing and storing data

Directions overlay for working with pandas in an analysis environment

Pandas integration with sklearn

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

functional data manipulation for pandas