Clean APIs for data cleaning. Python implementation of R package Janitor

Overview

pyjanitor

https://dev.azure.com/ericmjl/Open%20Source%20Packages/_apis/build/status/ericmjl.pyjanitor?branchName=dev

pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data.

Why janitor?

Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm.

Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The pandas API has been invaluable for the Python data science ecosystem, and implements method chaining of a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more, are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the ease-of-use and expressiveness of the dplyr package of the R statistical language ecosystem, we have evolved pyjanitor into a language for expressing the data processing DAG for pandas users.

To accomplish this, actions for which we would need to invoke imperative-style statements, can be replaced with method chains that allow one to read off the logical order of actions taken. Let us see the annotated example below. First off, here is the textual description of a data cleaning pathway:

  1. Create a DataFrame.
  2. Delete one column.
  3. Drop rows with empty values in two particular columns.
  4. Rename another two columns.
  5. Add a new column.

Let's import some libraries and begin with some sample data for this example :

# Libraries
import numpy as np
import pandas as pd
import janitor

# Sample Data curated for this example
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}

In pandas code, most users might type something like this:

# The Pandas Way

# 1. Create a pandas DataFrame from the company_sales dictionary
df = pd.DataFrame.from_dict(company_sales)

# 2. Delete a column from the DataFrame. Say 'Company1'
del df['Company1']

# 3. Drop rows that have empty values in columns 'Company2' and 'Company3'
df = df.dropna(subset=['Company2', 'Company3'])

# 4. Rename 'Company2' to 'Amazon' and 'Company3' to 'Facebook'
df = df.rename(
    {
        'Company2': 'Amazon',
        'Company3': 'Facebook',
    },
    axis=1,
)

# 5. Let's add some data for another company. Say 'Google'
df['Google'] = [450.0, 550.0, 800.0]

# Output looks like this:
# Out[15]:
# SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

Slightly more advanced users might take advantage of the functional API:

df = (
    pd.DataFrame(company_sales)
    .drop(columns="Company1")
    .dropna(subset=['Company2', 'Company3'])
    .rename(columns={"Company2": "Amazon", "Company3": "Facebook"})
    .assign(Google=[450.0, 550.0, 800.0])
    )

# Output looks like this:
# Out[15]:
# SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

With pyjanitor, we enable method chaining with method names that are verbs, which describe the action taken.

df = (
    pd.DataFrame.from_dict(company_sales)
    .remove_columns(['Company1'])
    .dropna(subset=['Company2', 'Company3'])
    .rename_column('Company2', 'Amazon')
    .rename_column('Company3', 'Facebook')
    .add_column('Google', [450.0, 550.0, 800.0])
)

# Output looks like this:
# Out[15]:
# SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

As such, pyjanitor's etymology has a two-fold relationship to "cleanliness". Firstly, it's about extending Pandas with convenient data cleaning routines. Secondly, it's about providing a cleaner, method-chaining, verb-based API for common pandas routines.

Installation

pyjanitor is currently installable from PyPI:

pip install pyjanitor

pyjanitor also can be installed by the conda package manager:

conda install pyjanitor -c conda-forge

pyjanitor can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:

pipenv install --pre pyjanitor

pyjanitor requires Python 3.6+.

Functionality

Current functionality includes:

  • Cleaning columns name (multi-indexes are possible!)
  • Removing empty rows and columns
  • Identifying duplicate entries
  • Encoding columns as categorical
  • Splitting your data into features and targets (for machine learning)
  • Adding, removing, and renaming columns
  • Coalesce multiple columns into a single column
  • Date conversions (from matlab, excel, unix) to Python datetime format
  • Expand a single column that has delimited, categorical values into dummy-encoded variables
  • Concatenating and deconcatenating columns, based on a delimiter
  • Syntactic sugar for filtering the dataframe based on queries on a column
  • Experimental submodules for finance, biology, chemistry, engineering, and pyspark

API

The idea behind the API is two-fold:

  • Copy the R package function names, but enable Pythonic use with method chaining or pandas piping.
  • Add other utility functions that make it easy to do data cleaning/preprocessing in pandas.

Continuing with the company_sales dataframe previously used:

import pandas as pd
import numpy as np
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}

As such, there are three ways to use the API. The first, and most strongly recommended one, is to use pyjanitor's functions as if they were native to pandas.

import janitor  # upon import, functions are registered as part of pandas.

# This cleans the column names as well as removes any duplicate rows
df = pd.DataFrame.from_dict(company_sales).clean_names().remove_empty()

The second is the functional API.

from janitor import clean_names, remove_empty

df = pd.DataFrame.from_dict(company_sales)
df = clean_names(df)
df = remove_empty(df)

The final way is to use the pipe() method:

from janitor import clean_names, remove_empty
df = (
    pd.DataFrame.from_dict(company_sales)
    .pipe(clean_names)
    .pipe(remove_empty)
)

Contributing

Follow contribution docs for a full description of the process of contributing to pyjanitor.

Adding new functionality

Keeping in mind the etymology of pyjanitor, contributing a new function to pyjanitor is a task that is not difficult at all.

Define a function

First off, you will need to define the function that expresses the data processing/cleaning routine, such that it accepts a dataframe as the first argument, and returns a modified dataframe:

import pandas_flavor as pf

@pf.register_dataframe_method
def my_data_cleaning_function(df, arg1, arg2, ...):
    # Put data processing function here.
    return df

We use pandas_flavor to register the function natively on a pandas.DataFrame.

Add a test case

Secondly, we ask that you contribute a test case, to ensure that it works as intended. Follow the contribution docs for further details.

Feature requests

If you have a feature request, please post it as an issue on the GitHub repository issue tracker. Even better, put in a PR for it! We are more than happy to guide you through the codebase so that you can put in a contribution to the codebase.

Because pyjanitor is currently maintained by volunteers and has no fiscal support, any feature requests will be prioritized according to what maintainers encounter as a need in our day-to-day jobs. Please temper expectations accordingly.

API Policy

pyjanitor only extends or aliases the pandas API (and other dataframe APIs), but will never fix or replace them.

Undesirable pandas behaviour should be reported upstream in the pandas issue tracker. We explicitly do not fix the pandas API. If at some point the pandas devs decide to take something from pyjanitor and internalize it as part of the official pandas API, then we will deprecate it from pyjanitor, while acknowledging the original contributors' contribution as part of the official deprecation record.

Credits

Test data for chemistry submodule can be found at Predictive Toxicology .

Comments
  • Case_when function

    Case_when function

    Brief Description

    I would like to propose a case_when function, similar to the same function in SQL and R, for conditions. It will be a wrapper around np.select.

    Example API

           Type	Set
    1	A	Z
    2	B	Z
    3	B	X
    4	C	Y
    
    # run checks based on conditions and create a new column, or modify existing column.
    @pf.register_dataframe_method
    def case_when(df, conditions,column_name, default = None, ) : 
        condlist, choicelist = zip(*[(value,key) for key, value in conditions.items()])
        df = df.assign(**{column_name : np.select(condlist, choicelist, default=default)})
        return df
    
    cond = {"green" : df.Set=="Z", "yellow":df.Set=="X"}
    df.case_when(conditions=cond, default=None, column_name='rag')
    
           Type	Set	rag
    1	A	Z	green
    2	B	Z	green
    3	B	X	yellow
    4	C	Y	None
    
    opened by samukweku 36
  • [ENH] Pandas String Methods in a Single Class

    [ENH] Pandas String Methods in a Single Class

    PR Description

    Please describe the changes proposed in the pull request:

    • Single function to access pandas' string methods
    • Small subset to see if the current method is acceptable

    This PR resolves #360 .

    PR Checklist

    Please ensure that you have done the following:

    1. [x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
    1. [x] If you're not on the contributors list, add yourself to AUTHORS.rst.
    1. [ ] Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.
      • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

    Quick Check

    To do a very quick check that everything is correct, follow these steps below:

    • [x] Run the command make check from pyjanitor's top-level directory. This will automatically run:
      • black formatting
      • flake8 checking
      • running the test suite
      • docs build

    Once done, please check off the check-box above.

    If make check does not work for you, you can execute the commands listed in the Makefile individually.

    Code Changes

    If you are adding code changes, please ensure the following:

    • [x] Ensure that you have added tests.
    • [ ] Run all tests ($ pytest .) locally on your machine.
      • [ ] Check to ensure that test coverage covers the lines of code that you have added.
      • [ ] Ensure that all tests pass.

    Documentation Changes

    If you are adding documentation changes, please ensure the following:

    • [ ] Build the docs locally.
    • [ ] View the docs to check that it renders correctly.

    Relevant Reviewers

    Please tag maintainers to review.

    • @ericmjl
    opened by samukweku 29
  • Select_Columns Function Added Suggestion...

    Select_Columns Function Added Suggestion...

    I see there is remove columns function. I think a select_columns function would work nice. It would be cleaner and easier to understand then df[['col1',col2','col3']].

    enhancement 
    opened by jcvall 29
  • [DOC] Adding minimal working examples to docstrings; a checklist

    [DOC] Adding minimal working examples to docstrings; a checklist

    Background

    This thread is borne out of the discussion from #968 , in an effort to make documentation more beginner-friendly & more understandable. One of the subtasks mentioned in that thread was to go through the function docstrings and include a minimal working example to each of the public functions in pyjanitor.

    Criteria reiterated here for the benefit of discussion:

    It should fit with our existing choice to go with mkdocs, mkdocstrings, and mknotebooks.
    The examples should be minimal and executable and complete execution within 5 seconds per function.
    The examples should display in rich HTML on our docs page.
    We should have an automatic way of identifying whether a function has an example provided or not so that every function has an example.
    

    Sample of what MWE should look like is shown here.


    I'm thinking we can create a task list so that 1. we can encourage more users to join in the effort, and 2. make sure we don't do duplicate work. A lot of the groundwork can be covered by selectively copying one or two examples over from the software test suite.

    Then we can label this issue as a Help Wanted / Low-Hanging Fruit and get people to mention in this thread if they're intending to work on the files?

    Task list

    • [X] functions/add_columns.py
    • [x] functions/also.py
    • [x] functions/bin_numeric.py
    • [x] functions/case_when.py
    • [x] functions/change_type.py
    • [x] functions/clean_names.py
    • [x] functions/coalesce.py
    • [x] functions/collapse_levels.py
    • [x] functions/complete.py
    • [x] functions/concatenate_columns.py
    • [x] functions/conditional_join.py
    • [x] functions/convert_date.py
    • [x] functions/count_cumulative_unique.py
    • [x] functions/currency_column_to_numeric.py
    • [x] functions/deconcatenate_column.py
    • [x] functions/drop_constant_columns.py
    • [x] functions/drop_duplicate_columns.py
    • [x] functions/dropnotnull.py
    • [x] functions/encode_categorical.py
    • [x] functions/expand_column.py
    • [x] functions/expand_grid.py
    • [x] functions/factorize_columns.py
    • [x] functions/fill.py
    • [x] functions/filter.py
    • [x] functions/find_replace.py
    • [x] functions/flag_nulls.py
    • [x] functions/get_dupes.py
    • [x] functions/groupby_agg.py
    • [x] functions/groupby_topk.py
    • [x] functions/impute.py
    • [x] functions/jitter.py
    • [x] functions/join_apply.py
    • [x] functions/label_encode.py
    • [x] functions/limit_column_characters.py
    • [x] functions/min_max_scale.py
    • [x] functions/move.py
    • [x] functions/pivot.py
    • [x] functions/process_text.py
    • [x] functions/remove_columns.py
    • [x] functions/remove_empty.py
    • [x] functions/rename_columns.py
    • [x] functions/reorder_columns.py
    • [x] functions/round_to_fraction.py
    • [x] functions/row_to_names.py
    • [x] functions/select_columns.py
    • [x] functions/shuffle.py
    • [x] functions/sort_column_value_order.py
    • [x] functions/sort_naturally.py
    • [x] functions/take_first.py
    • [x] functions/then.py
    • [x] functions/to_datetime.py
    • [x] functions/toset.py
    • [x] functions/transform_columns.py
    • [x] functions/truncate_datetime.py
    • [x] functions/update_where.py
    • [ ] spark/backend.py
    • [ ] spark/functions.py
    • [x] xarray/functions.py
    • [x] biology.py
    • [x] chemistry.py
    • [x] engineering.py
    • [ ] errors.py
    • [x] finance.py
    • [x] io.py
    • [x] math.py
    • [x] ml.py
    • [x] timeseries.py B
    good first issue 
    opened by thatlittleboy 27
  • [ENH] first pass on fill_missing_timestamps

    [ENH] first pass on fill_missing_timestamps

    Closes #705

    PR Description

    Please describe the changes proposed in the pull request:

    • Introducing a "timeseries.py" module with data checking and data cleaning functions geared towards time series data
    • Adding two functions (a utility function and an actual user-facing function) to fill_missing_timestamps in a timeseries data set

    This PR resolves #705.

    PR Checklist

    Please ensure that you have done the following:

    1. [x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>. I think I am doing this, but I am really not sure, so please guide me through and I will be happy to resubmit.

    2. [x] If you're not on the contributors list, add yourself to AUTHORS.rst.

    3. [x] Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.

      • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

    Quick Check

    I have made code and documentation changes, however I don't have enough experience in adding unit tests. I just added one unit test using an existing example from the code base, but I need some help with pytest. I have used unittest very briefly in the past, but any guidance regarding the testing piece will be helpful for me to progress further.

    To do a very quick check that everything is correct, follow these steps below:

    • [x] Run the command make check from pyjanitor's top-level directory. This will automatically run:
      • black formatting
      • flake8 checking
      • running the test suite
      • docs build
    tests\conftest.py:7: in <module>
        from janitor.testing_utils import date_data
    E   ModuleNotFoundError: No module named 'janitor'
    

    Code Changes

    If you are adding code changes, please ensure the following:

    • [ ] Ensure that you have added tests.
    • [ ] Run all tests ($ pytest .) locally on your machine.
      • [ ] Check to ensure that test coverage covers the lines of code that you have added.
      • [ ] Ensure that all tests pass.

    Documentation Changes

    If you are adding documentation changes, please ensure the following:

    • [ ] Build the docs locally.
    • [ ] View the docs to check that it renders correctly.

    Relevant Reviewers

    Please tag maintainers to review.

    • @ericmjl
    opened by UGuntupalli 27
  • [ENH] Attempt to fix issues 737, 198, 752, 209, 366, 382, 695

    [ENH] Attempt to fix issues 737, 198, 752, 209, 366, 382, 695

    PR Description

    Please describe the changes proposed in the pull request:

    • We have added the enhancements listed in the above issues, as well as written tests and made sure the continuous integration passed and complied with all linters. Do note: there may be better solutions.

    **This PR resolves #737 ** **This PR resolves #198 ** **This PR resolves #752 ** **This PR resolves #209 ** **This PR resolves #366 ** **This PR resolves #382 ** **This PR resolves #695 **

    PR Checklist

    Please ensure that you have done the following:

    Automatic checks

    There will be automatic checks run on the PR. These include:

    • Building a preview of the docs on Netlify
    • Automatically linting the code
    • Making sure the code is documented
    • Making sure that all tests are passed
    • Making sure that code coverage doesn't go down.

    Relevant Reviewers

    Please tag maintainers to review.

    • @ericmjl
    opened by BaritoneBeard 25
  • [ENH] updates to update_where and find_replace functions

    [ENH] updates to update_where and find_replace functions

    PR Description

    Please describe the changes proposed in the pull request:

    • Changed update_where conditions to query style mode
    • Removed find_replace dependency on update_where function

    This PR resolves #663 .

    PR Checklist

    Please ensure that you have done the following:

    1. [x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
    1. [x] If you're not on the contributors list, add yourself to AUTHORS.rst.
    1. [ ] Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.
      • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

    Quick Check

    To do a very quick check that everything is correct, follow these steps below:

    • [x] Run the command make check from pyjanitor's top-level directory. This will automatically run:
      • black formatting
      • flake8 checking
      • running the test suite
      • docs build

    Once done, please check off the check-box above.

    If make check does not work for you, you can execute the commands listed in the Makefile individually.

    Code Changes

    If you are adding code changes, please ensure the following:

    • [x] Ensure that you have added tests.
    • [x] Run all tests ($ pytest .) locally on your machine.
      • [ ] Check to ensure that test coverage covers the lines of code that you have added.
      • [ ] Ensure that all tests pass.

    Documentation Changes

    If you are adding documentation changes, please ensure the following:

    • [ ] Build the docs locally.
    • [ ] View the docs to check that it renders correctly.

    Relevant Reviewers

    Please tag maintainers to review.

    • @ericmjl
    opened by samukweku 22
  • [ENH] Count the cumulative number of unique elements in a column

    [ENH] Count the cumulative number of unique elements in a column

    Brief Description

    I would like to propose a function that counts the cumulative number of unique items in a column.

    Example implementation

    def cumulative_unique(df, column_name, new_column_name):
        """
        Compute the cumulative number of unique items in a column that have been seen.
        
        This function can be used to limit the number of source plates
        that we use to seed colonies.
        """
        df = df.copy()
        unique_elements = set()
        cumulative_unique = []
        for val in df[column_name].values:
            unique_elements.add(val)
            cumulative_unique.append(len(unique_elements))
        df[new_column_name] = cumulative_unique
        return df
    

    I'm happy for anybody else to run with the implementation above and improve upon it.

    enhancement good first issue available for hacking 
    opened by ericmjl 22
  • [ENH] Pyjanitor for PySpark

    [ENH] Pyjanitor for PySpark

    Brief Description

    I would like to know if there are any interest to create pyjanitor for pyspark? I'm using pyspark a lot and I would really like use custom method chaining to clean up my ETL code.

    I'm not sure if it is doable or how easy it is but I would be open to explore.

    enhancement question good advanced issue 
    opened by zjpoh 22
  • [ENH] Add case_when #736

    [ENH] Add case_when #736

    Adding a case_when function

    Please describe the changes proposed in the pull request:

    • Effectively, this is a wrapper around pd.Series.mask but with an easier to manage mechanism for passing arguments using variable arguments.
    • chose pd.Series.mask to allow support for pandas extension arrays/dtypes (np.select would convert them to numpy dtypes, which wont be beneficial for nullable Integer dtypes, or String dtypes, or other future pandas dtypes).
    • The function does not mutate the original pd.DataFrame and is vectorized.
    • Can be method chained

    **This PR resolves #736 **

    Speed comparison # ... with a pinch of salt ...

    df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})
      col1 col2
    0    A    Z
    1    B    Z
    2    B    X
    3    C    Y
    
    df = pd.concat([df]*100_000)
    
    df.shape
    (400000, 2)
    
    
    %%timeit
    conditions = [
        (df['col2'] == 'Z') & (df['col1'] == 'A'),
        (df['col2'] == 'Z') & (df['col1'] == 'B'),
        (df['col1'] == 'B')
    ]
    
    choices = ['yellow', 'blue', 'purple']
    
    df['color'] = np.select(conditions, choices, default='black')
    
    167 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    
    df.case_when(
        df.col2.eq("Z") & df.col1.eq("A"), "yellow",  # first condition and value
        df.col2.eq("Z") & df.col1.eq("B"), "blue",    # second condition and value
        df.col1.eq("B"), "purple",                    # third condition and value 
        "black",                                      # default if no condition is True
        column_name = "color"
        )
    
    125 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    
    # anonymous function
    df.case_when(
        lambda df: df.col2.eq("Z") & df.col1.eq("A"), "yellow", 
        lambda df: df.col2.eq("Z") & df.col1.eq("B"), "blue",    
        lambda df: df.col1.eq("B"), "purple",                         
        "black",                                                     
        column_name = "color"
        )
    
    134 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    
    # pd.eval
    %%timeit
    df.case_when(
       "col2 == 'Z' and col1 == 'A'", "yellow", 
      "col2 == 'Z' and col1 == 'B'", "blue",   
     "col1 == 'B'", "purple",                        
       "black",                                             
      column_name = "color"
      )
    74.4 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    PR Checklist

    Please ensure that you have done the following:

    1. [x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
    1. [x] If you're not on the contributors list, add yourself to AUTHORS.rst.
    1. [x] Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.
      • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

    Quick Check

    To do a very quick check that everything is correct, follow these steps below:

    • [x] Run the command make check from pyjanitor's top-level directory. This will automatically run:
      • black formatting
      • flake8 checking
      • running the test suite
      • docs build

    Once done, please check off the check-box above.

    If make check does not work for you, you can execute the commands listed in the Makefile individually.

    Code Changes

    If you are adding code changes, please ensure the following:

    • [x] Ensure that you have added tests.
    • [x] Run all tests ($ pytest .) locally on your machine.
      • [x] Check to ensure that test coverage covers the lines of code that you have added.
      • [x] Ensure that all tests pass.

    Relevant Reviewers

    Please tag maintainers to review.

    • @ericmjl
    • @samukweku
    opened by robertmitchellv 21
  • [ENH] Pyjanitor fill method

    [ENH] Pyjanitor fill method

    PR Description

    Please describe the changes proposed in the pull request:

    • A method-chainable fill method for forward and backward fills on selected columns of a dataframe.

    This PR resolves #700 .

    Please ensure that you have done the following:

    1. [x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
    1. [x] If you're not on the contributors list, add yourself to AUTHORS.rst.
    1. [ ] Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.
      • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

    Quick Check

    To do a very quick check that everything is correct, follow these steps below:

    • [x] Run the command make check from pyjanitor's top-level directory. This will automatically run:
      • black formatting
      • flake8 checking
      • running the test suite
      • docs build

    Once done, please check off the check-box above.

    If make check does not work for you, you can execute the commands listed in the Makefile individually.

    Code Changes

    If you are adding code changes, please ensure the following:

    • [x] Ensure that you have added tests.
    • [x] Run all tests ($ pytest .) locally on your machine.
      • [x] Check to ensure that test coverage covers the lines of code that you have added.
      • [x] Ensure that all tests pass.

    Documentation Changes

    If you are adding documentation changes, please ensure the following:

    • [ ] Build the docs locally.
    • [ ] View the docs to check that it renders correctly.

    Relevant Reviewers

    Please tag maintainers to review.

    • @ericmjl
    opened by samukweku 21
  • `mutate` function

    `mutate` function

    Brief Description

    I would like to propose a mutate function, similar to pandas' assign function, but more flexible - it will also serve as replacement for the transform functions

    Example API

    df.mutate(y='sum',n=lambda df: df.nth(1))
    df.mutate(y='sum',n=lambda df: df.nth(1), by='x')
    
    
    # replicate dplyr's across
    # https://stackoverflow.com/q/63200530/7175713
    # select_columns syntax can fit in nicely here
    mtcars.summarize(("*t", "mean"), ("*p", "sum"), {"cyl": lambda df: df + 1, "new_col": lambda df: df.select_columns("*t").sum(axis=1))
    
    opened by samukweku 1
  • `summarize`

    `summarize`

    Brief Description

    I would like to propose a summarize function, similar to dplyr's summarise function and pandas' agg function, but for grouping operations, and more flexible

    Example API

    df.summarize(y='sum',n=lambda df: df.nth(1), by='x')
    
    # summarize on multiple columns
    df.summarize((['a','b','c'], 'sum'), by = 'x')
    
    # replicate dplyr's across
    # https://stackoverflow.com/q/63200530/7175713
    # select_columns syntax can fit in nicely here
    mtcars.summarize(("*t", "mean"), ("*p", "sum"), by='cyl')
    
    opened by samukweku 0
  • [ENH] Improve `conditional_join`

    [ENH] Improve `conditional_join`

    PR Description

    Please describe the changes proposed in the pull request:

    • Improve selection in conditional_join - either only df or right dataframe can be returned
    • indicator parameter added - similar to the indicator parameter in pd.merge
    • rewrite numba code, such that numba is called only once - numpy used as much as possible, where possible
    • improve code for _range_indices algorithm, taking advantage of binary search for improved performance, for scenarios where both columns on the right dataframe are monotonically increasing

    This PR resolves #1223 .

    Speed test when both columns on the right are monotonically increasing

    In [3]: import pandas as pd
       ...: import janitor
       ...: import numpy as np
    
    # adapted from https://stackoverflow.com/a/41094529/7175713
    In [4]: # Sample df1.
       ...: n1 = 50_000
       ...: df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
       ...: 
       ...: # Sample df2.
       ...: n2 = 4_00
       ...: df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
       ...: 
       ...: # Randomly shift the start and end dates of the df2 intervals.
       ...: shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
       ...: shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
       ...: shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
       ...: df2['start_date'] += shift_start
       ...: df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2
    
    # this PR 
    # _range_indices performs well here - the numpy option sorts only once (if at all)
    # and such does less work than the numba option - sorting is relatively expensive
    # so if it can be avoided great
    # numba shines for overlaps or scenarios where the columns are not both monotically increasing
    # and for relatively large dataframes
    In [5]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='),('end', 'start_date','>='), use_numb a = False)
    5.8 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [6]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='),('end', 'start_date','>='), use_numba = True)
    10.6 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # Dev
    In [7]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='), ('end', 'start_date','>='), use_numba = False)
    54 ms ± 8.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [8]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='), ('end', 'start_date','>='), use_numba = True)
    12.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    indicator parameter usage:

    In [5]: df1 = pd.DataFrame({'id': [1,1,1,2,2,3],
                            'value_1': [2,5,7,1,3,4]})
       
       df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3],
                            'value_2A': [0,3,7,12,0,2,3,1],
                           'value_2B': [1,5,9,15,1,4,6,3]})
    
    In [6]: (df1
       .conditional_join(
            df2,
            ('id', 'id', '=='),
            ('value_1', 'value_2A', '>='),
           ('value_1', 'value_2B', '<='),
            how='left',
           sort_by_appearance=True,
            indicator=True,
           # this ensures only the `df` dataframe is returned
            right_columns=None
        ))
    Out[6]: 
       id  value_1     _merge
    0   1        2  left_only
    1   1        5       both
    2   1        7       both
    3   2        1       both
    4   2        3       both
    5   2        3       both
    6   3        4  left_only
    

    PR Checklist

    Please ensure that you have done the following:

    1. [ ] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
    1. [x] If you're not on the contributors list, add yourself to AUTHORS.md.
    1. [x] Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
      • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

    Automatic checks

    There will be automatic checks run on the PR. These include:

    • Building a preview of the docs on Netlify
    • Automatically linting the code
    • Making sure the code is documented
    • Making sure that all tests are passed
    • Making sure that code coverage doesn't go down.

    Relevant Reviewers

    Please tag maintainers to review.

    • @ericmjl
    opened by samukweku 7
  • [ENH] Improve selection/perf in `conditional_join`

    [ENH] Improve selection/perf in `conditional_join`

    • Improve selection in conditional_join, such that only a single dataframe (left/right) is returned
    • simplify the numba implementation
    • add indicator parameter , similar to the indicator parameter in pd.merge
    opened by samukweku 0
  • [ENH] `pivot_longer` - named groups in regex and dictionary support

    [ENH] `pivot_longer` - named groups in regex and dictionary support

    PR Description

    Please describe the changes proposed in the pull request:

    • support for named groups in regex, making it easier to associate name with the sub regex
    • support for dictionary in names_pattern

    Example:

    iris
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
    0           5.1          3.5           1.4          0.2     setosa
    1           5.9          3.0           5.1          1.8  virginica
    
    In [6]: iris.pivot_longer(
       ...:         index = 'Species',
       ...:         names_to = None,
       ...:         names_pattern = r'(?P<part>.+)\.(?P<dimension>.+)'
       ...:         )
    Out[6]: 
         Species   part dimension  value
    0     setosa  Sepal    Length    5.1
    1  virginica  Sepal    Length    5.9
    2     setosa  Sepal     Width    3.5
    3  virginica  Sepal     Width    3.0
    4     setosa  Petal    Length    1.4
    5  virginica  Petal    Length    5.1
    6     setosa  Petal     Width    0.2
    7  virginica  Petal     Width    1.8
    
    # retain sub parts of the columns as new headers:
    In [26]: iris.pivot_longer(
        ...:         index = 'Species',
        ...:         names_to = None,
        ...:         names_pattern = r'(?P<_>.+)\.(?P<dimension>.+)'
        ...:         )
    Out[26]: 
         Species dimension  Sepal  Petal
    0     setosa    Length    5.1    1.4
    1  virginica    Length    5.9    5.1
    2     setosa     Width    3.5    0.2
    3  virginica     Width    3.0    1.8
    
    # reshape with a dictionary: 
    
    In [27]: iris.pivot_longer(
        ...:         index = 'Species',
        ...:         names_to = None,
        ...:         names_pattern = {'Sepal':'Sepal','Petal':'Petal'})
        ...: 
        ...: 
    Out[27]: 
         Species  Sepal  Petal
    0     setosa    5.1    1.4
    1  virginica    5.9    5.1
    2     setosa    3.5    0.2
    3  virginica    3.0    1.8
    

    This PR resolves #1209 .

    • @ericmjl
    opened by samukweku 3
  • [ENH] minor fixes

    [ENH] minor fixes

    PR Description

    Please describe the changes proposed in the pull request:

    • minor fixes for drop_constant_columns and get_dupes
    • improve performance for select when all entries are scalars and are the same dtype
    • move is more flexible with the select_columns syntax - multiple columns/rows can be moved at once
    • avoid mutation in collapse_levels
    • impute now supports multiple columns, making it easy to deprecate fill_empty
    • fix deprecation warning for np.bool8

    Please tag maintainers to review.

    • @ericmjl
    opened by samukweku 2
Releases(v0.24.0)
  • v0.24.0(Nov 12, 2022)

  • v0.23.1(May 3, 2022)

  • v0.22.0(Nov 21, 2021)

    Contribution details can be found in CHANGELOG.md

    What's Changed

    • [INF] simplify a bit linting, use pre-commit as CI linting checker by @Zeroto521 in https://github.com/pyjanitor-devs/pyjanitor/pull/892
    • [ENH] Updates to Pivot longer by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/886
    • [ENH] Pivot wider improve by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/874
    • [BUGFIX] Conditional Joins by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/910
    • [ENH] Add case_when #736 by @robertmitchellv in https://github.com/pyjanitor-devs/pyjanitor/pull/775
    • [ENH] Process text by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/878
    • [ENH] Improve complete function by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/933
    • [ENH] Handle null values in concatenate_columns function by @farhanreynaldo in https://github.com/pyjanitor-devs/pyjanitor/pull/935
    • [DOC] Hotfix for CHANGELOG.md by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/944
    • Improvements to Conditional Join by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/943
    • [ENH] Categoricals improve by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/930
    • Bump babel from 2.8.0 to 2.9.1 in /.requirements by @dependabot in https://github.com/pyjanitor-devs/pyjanitor/pull/950
    • [ENH] Complete dict fix by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/948
    • [ENH] Deprecate names_sort from pivot_wider by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/951
    • [DOC] Switch from Sphinx to MkDocs by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/897
    • XFAIL test_dict_extension_array for janitor.complete function (CI dtype mismatch) by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/953
    • [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/pyjanitor-devs/pyjanitor/pull/955
    • [ENH] conditional_join by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/954
    • [ENH] add softmax to math module by @loganthomas in https://github.com/pyjanitor-devs/pyjanitor/pull/941
    • [ENH] Conditional_join by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/956

    New Contributors

    • @robertmitchellv made their first contribution in https://github.com/pyjanitor-devs/pyjanitor/pull/775
    • @farhanreynaldo made their first contribution in https://github.com/pyjanitor-devs/pyjanitor/pull/935
    • @pre-commit-ci made their first contribution in https://github.com/pyjanitor-devs/pyjanitor/pull/955

    Full Changelog: https://github.com/pyjanitor-devs/pyjanitor/compare/v0.21.2...v0.22.0

    Source code(tar.gz)
    Source code(zip)
  • v0.21.2(Sep 1, 2021)

    Contribution details can be found in CHANGELOG.md

    What's Changed

    • [ENH] Coalesce keyword args only by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/887
    • [INF] Add SciPy as explicit dependency in base.in by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/896

    Full Changelog: https://github.com/pyjanitor-devs/pyjanitor/compare/v0.21.1...v0.21.2

    Source code(tar.gz)
    Source code(zip)
  • v0.21.1(Aug 29, 2021)

    Contribution details can be found in CHANGELOG.md

    What's Changed

    • [INF] GitHub Auto-Release Pointer by @loganthomas in https://github.com/pyjanitor-devs/pyjanitor/pull/842
    • [DOC] Updated broken links in the README and contributing docs files by @nvamsikrishna05 in https://github.com/pyjanitor-devs/pyjanitor/pull/839
    • [INF] Update pre-commit hook revs by @loganthomas in https://github.com/pyjanitor-devs/pyjanitor/pull/846
    • [INF] Update code_check github actions black version by @nvamsikrishna05 in https://github.com/pyjanitor-devs/pyjanitor/pull/848
    • [DOC] Fix AUTHORS.rst by @loganthomas in https://github.com/pyjanitor-devs/pyjanitor/pull/843
    • [ENH] Updated label_encode to use pandas factorize by @nvamsikrishna05 in https://github.com/pyjanitor-devs/pyjanitor/pull/847
    • Add codecov.io to test suite by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/851
    • [ENH] Added factorize_columns fuction by @nvamsikrishna05 in https://github.com/pyjanitor-devs/pyjanitor/pull/850
    • [ENH] Add drop_constant_columns function by @fireddd in https://github.com/pyjanitor-devs/pyjanitor/pull/852
    • [ENH] Add reset index flag to row_to_name function by @fireddd in https://github.com/pyjanitor-devs/pyjanitor/pull/849
    • [DOC] Scrub readthedocs from repo by @loganthomas in https://github.com/pyjanitor-devs/pyjanitor/pull/864
    • Doc updates by @loganthomas in https://github.com/pyjanitor-devs/pyjanitor/pull/865
    • [INF] Fix isort checks by @loganthomas in https://github.com/pyjanitor-devs/pyjanitor/pull/866
    • [DOC] Add a link to the documentation, update deprecated links by @adrienpacifico in https://github.com/pyjanitor-devs/pyjanitor/pull/875
    • Forgot the word documentation in my sentence by @adrienpacifico in https://github.com/pyjanitor-devs/pyjanitor/pull/877
    • [DOC]Add missing functions to API docs Index by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/872
    • [INF] Add docs build GitHub action by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/870
    • [ENH] Fill direction by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/879
    • [EHN] Set expand_column's sep default as "|" by @Zeroto521 in https://github.com/pyjanitor-devs/pyjanitor/pull/880
    • [ENH] variable args for complete by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/857
    • [ENH} Conditional Join by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/861
    • [INF] Speed up pytest via -n (pytest-xdist) option by @Zeroto521 in https://github.com/pyjanitor-devs/pyjanitor/pull/884
    • [DOC] Add list mark to keep select_columns's example same style by @Zeroto521 in https://github.com/pyjanitor-devs/pyjanitor/pull/888
    • [ENH] rename_columns function mapper enhancements by @nvamsikrishna05 in https://github.com/pyjanitor-devs/pyjanitor/pull/893

    New Contributors

    • @fireddd made their first contribution in https://github.com/pyjanitor-devs/pyjanitor/pull/852
    • @adrienpacifico made their first contribution in https://github.com/pyjanitor-devs/pyjanitor/pull/875

    Full Changelog: https://github.com/pyjanitor-devs/pyjanitor/compare/v0.21.0...v0.21.1

    Source code(tar.gz)
    Source code(zip)
  • v0.21.0(Aug 28, 2021)

    Release Notes can be found in the Changelog.rst file

    What's Changed

    • Bump pyyaml from 5.3.1 to 5.4 in /.requirements by @dependabot in https://github.com/pyjanitor-devs/pyjanitor/pull/822
    • [ENH] General Fixes by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/823
    • Bump pygments from 2.7.1 to 2.7.4 in /.requirements by @dependabot in https://github.com/pyjanitor-devs/pyjanitor/pull/824
    • [ENH] fix bug pivot_longer by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/828
    • Pin py at 1.10.0 by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/830
    • [ENH] Minor fixes by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/833
    • [ENH] Pivot longer and other fixes by @samukweku in https://github.com/pyjanitor-devs/pyjanitor/pull/837
    • [ENH] Attempt to fix issues 737, 198, 752, 209, 366, 382, 695 by @BaritoneBeard in https://github.com/pyjanitor-devs/pyjanitor/pull/832
    • [ENH] Updated convert_excel_date to throw meaningful error by @nvamsikrishna05 in https://github.com/pyjanitor-devs/pyjanitor/pull/841

    New Contributors

    • @BaritoneBeard made their first contribution in https://github.com/pyjanitor-devs/pyjanitor/pull/832

    Full Changelog: https://github.com/pyjanitor-devs/pyjanitor/compare/v0.20.14...v0.21.0

    Source code(tar.gz)
    Source code(zip)
  • v0.18.1(Aug 10, 2019)

    Contribution details can be found in CHANGELOG.rst.

    What's Changed

    • [ENH] add preserve_position kwarg to deconcatenate_column #478 by @shandou in https://github.com/pyjanitor-devs/pyjanitor/pull/484
    • [DOC] started release notes for version 0.18.1 #489 by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/490
    • [DOC] Added contributions that did not leave git trace #483 by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/491
    • [TST] merged tests for deconcatenate columns #492 by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/494
    • [DOC] Clarify Python version compatibility to >=py36 by @hectormz in https://github.com/pyjanitor-devs/pyjanitor/pull/497
    • [ENH] Add inflation adjustment function #384 by @rahosbach in https://github.com/pyjanitor-devs/pyjanitor/pull/485
    • [DOC] Update PR template to include changelog in release. by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/499
    • [DOC] add inflation adjustment to changelog by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/500
    • [DOC] Docstring n code guideline #488 by @shandou in https://github.com/pyjanitor-devs/pyjanitor/pull/505
    • [ENH] Add optional removal of accents on functions.clean_names, enabled by default. by @mralbu in https://github.com/pyjanitor-devs/pyjanitor/pull/506
    • [DOC] prettyfying changelog by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/508
    • [ENH] Add snake option to clean_names by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/509
    • [ENH] Added convert_units() method in engineering submodule by @rahosbach in https://github.com/pyjanitor-devs/pyjanitor/pull/507
    • [DOC] Add PyPI description by @hectormz in https://github.com/pyjanitor-devs/pyjanitor/pull/513
    • [ENH] Add Example Notebook for Finance Submodule by @rahosbach in https://github.com/pyjanitor-devs/pyjanitor/pull/517
    • [ENH] Add null_flag function #501 by @anzelpwj in https://github.com/pyjanitor-devs/pyjanitor/pull/510
    • [INF] add pip-wheel-metadata to gitignore by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/521
    • [DOC] Link to contribution docs for tests by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/522
    • [DOC] Fix markdown level for intl teach notebook #476 by @jk3587 in https://github.com/pyjanitor-devs/pyjanitor/pull/523
    • [BUG] changed assertion statements in deconcatenate_columns to error statements by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/525

    New Contributors

    • @mralbu made their first contribution in https://github.com/pyjanitor-devs/pyjanitor/pull/506

    Full Changelog: https://github.com/pyjanitor-devs/pyjanitor/compare/v0.18.0...v0.18.1

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Mar 29, 2018)

    First public release to PyPI and conda-forge.

    What's Changed

    • Initial Update by @pyup-bot in https://github.com/pyjanitor-devs/pyjanitor/pull/7
    • Updated column name cleaning functions. by @ericmjl in https://github.com/pyjanitor-devs/pyjanitor/pull/9
    • Clean names multiindex by @JoshuaC3 in https://github.com/pyjanitor-devs/pyjanitor/pull/11
    • Clean names remove outer underscores by @JoshuaC3 in https://github.com/pyjanitor-devs/pyjanitor/pull/13

    New Contributors

    • @pyup-bot made their first contribution in https://github.com/pyjanitor-devs/pyjanitor/pull/7

    Full Changelog: https://github.com/pyjanitor-devs/pyjanitor/commits/v0.1.1

    Source code(tar.gz)
    Source code(zip)
Owner
Eric Ma
Find more about me at my personal website.
Eric Ma
Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

Prodmodel 53 Nov 29, 2022
Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hoste

Riya Vijay Vishwakarma 1 Dec 12, 2021
Directions overlay for working with pandas in an analysis environment

dovpanda Directions OVer PANDAs Directions are hints and tips for using pandas in an analysis environment. dovpanda is an overlay companion for workin

dovpandev 431 Dec 20, 2022
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022
Pandas integration with sklearn

Sklearn-pandas This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides

2.7k Dec 27, 2022
BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

BatchFlow BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflo

Data Analysis Center 185 Dec 20, 2022
Tools for parsing messy tabular data.

Parsing for messy tables A library for dealing with messy tabular data in several formats, guessing types and detecting headers. See the documentation

Open Knowledge Foundation 382 Nov 10, 2022
dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

Chris Riederer 754 Nov 21, 2022
A Python toolkit for processing tabular data

meza: A Python toolkit for processing tabular data Index Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installat

Reuben Cummings 401 Dec 19, 2022
Easy pipelines for pandas DataFrames.

pdpipe ˨ Easy pipelines for pandas DataFrames (learn how!). Website: https://pdpipe.github.io/pdpipe/ Documentation: https://pdpipe.github.io/pdpipe/d

694 Jan 05, 2023
Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

Eric Ma 1.1k Jan 01, 2023