Tidy interface to polars

Overview

tidypolars

PyPI Latest Release

tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users.

Installation

$ pip3 install tidypolars

General syntax

tidypolars methods are designed to work like tidyverse functions:

import tidypolars as tp
from tidypolars import col, desc

df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])

(
    df
    .select('x', 'y', 'z')
    .filter(col('x') < 4, col('y') > 1)
    .arrange(desc('z'), 'x')
    .mutate(double_x = col('x') * 2,
            x_plus_y = col('x') + col('y'))
)
┌─────┬─────┬─────┬──────────┬──────────┐
│ xyzdouble_xx_plus_y │
│ ---------------      │
│ i64i64stri64i64      │
╞═════╪═════╪═════╪══════════╪══════════╡
│ 25b47        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 03a03        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 14a25        │
└─────┴─────┴─────┴──────────┴──────────┘

The key difference from R is that column names must be wrapped in col() in the following methods:

  • .filter()
  • .mutate()
  • .summarize()

The general idea - when doing calculations on a column you need to wrap it in col(). When doing simple column selections (like in .select()) you can pass the column names as strings.

Group by syntax

Methods operate by group by calling the by arg.

  • A single column can be passed with by = 'z'
  • Multiple columns can be passed with by = ['y', 'z']
(
    df
    .summarize(avg_x = tp.mean(col('x')),
               by = 'z')
)
┌─────┬───────┐
│ zavg_x │
│ ------   │
│ strf64   │
╞═════╪═══════╡
│ a0.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b2     │
└─────┴───────┘

Selecting/dropping columns

tidyselect functions can be mixed with normal selection when selecting columns:

df = tp.Tibble(x1 = range(3), x2 = range(3), y = range(3), z = range(3))

df.select(tp.starts_with('x'), 'z')
┌─────┬─────┬─────┐
│ x1x2z   │
│ --------- │
│ i64i64i64 │
╞═════╪═════╪═════╡
│ 000   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 111   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 222   │
└─────┴─────┴─────┘

To drop columns use the .drop() method:

df.drop(tp.starts_with('x'), 'z')
┌─────┐
│ y   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
├╌╌╌╌╌┤
│ 1   │
├╌╌╌╌╌┤
│ 2   │
└─────┘

Converting to/from pandas data frames

If you need to use a package that requires pandas data frames, you can convert from a tidypolars Tibble to a pandas DataFrame.

To do this you'll first need to install pyarrow:

pip3 install pyarrow

To convert to a pandas DataFrame:

df = df.to_pandas()

To convert from a pandas DataFrame to a tidypolars Tibble:

df = tp.from_pandas(df)

Speed Comparisons

A few notes:

  • Comparing times from separate functions typically isn't very useful. For example - the .summarize() tests were performed on a different dataset from .pivot_wider().
  • All tests are run 5 times. The times shown are the median of those 5 runs.
  • All timings are in milliseconds.
  • All tests can be found in the source code here.
  • FAQ - Why are some tidypolars functions faster than their polars counterpart?
    • Short answer - they're not! After all they're just using polars in the background.
    • Long answer - All python functions have some slight natural variation in their execution time. By chance the tidypolars runs were slightly shorter on those specific functions on this iteration of the tests. However one goal of these tests is to show that the "time cost" of translating syntax to polars is very negligible to the user (especially on medium-to-large datasets).
  • Lastly I'd like to mention that these tests were not rigorously created to cover all angles equally. They are just meant to be used as general insight into the performance of these packages.
┌─────────────┬────────────┬─────────┬──────────┐
│ func_testedtidypolarspolarspandas   │
│ ------------      │
│ strf64f64f64      │
╞═════════════╪════════════╪═════════╪══════════╡
│ arrange190.345169.478500.112  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ case_when87.34879.427152.623  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ distinct16.88816.28228.725   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ filter29.78929.91231.397  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ full_join236.784231.2831042.689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ inner_join49.7147.563630.98   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ left_join113.7921151100.607 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ mutate7.9797.408117.283  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ pivot_wider42.76439.93949.048   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ summarize59.43458.011453.707  │
└─────────────┴────────────┴─────────┴──────────┘

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Comments
  • `drop` with error `RuntimeError: Any(NotFound(

    `drop` with error `RuntimeError: Any(NotFound("^x.*$"))`

    import sys
    import tidypolars as tp
    sys.version
    # '3.9.7 (default, Sep 16 2021, 13:09:58) \n[GCC 7.5.0]'
    tp.__version__
    # '0.2.1'
    ## error
    df = tp.Tibble(x1 = range(3), x2 = range(3), y=range(3), z = range(3))
    df.drop([tp.starts_with('x'), 'z'])
    df.drop()
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    /tmp/ipykernel_12815/866601321.py in <module>
    ----> 1 df.drop(tp.starts_with('x'))
    
    ~/miniconda3/envs/py39/lib/python3.9/site-packages/polars/eager/frame.py in drop(self, name)
       2253             return df
       2254 
    -> 2255         return wrap_df(self._df.drop(name))
       2256 
       2257     def drop_in_place(self, name: str) -> "pl.Series":
    
    RuntimeError: Any(NotFound("^x.*$"))
    `
    
    
    
    opened by ztsweet 9
  • `AttributeError: arrange not found`

    `AttributeError: arrange not found`

    import tidypolars as tp
    from tidypolars import col, desc
    import sys
    sys.version
    # '3.10.0 | packaged by conda-forge | (default, Oct 12 2021, 21:24:52) [GCC 9.4.0]'
    tp.__version__
    # '0.2.1'
    df = tp.Tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
    df.arrange('x', 'y')
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        882         try:
    --> 883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    
    RuntimeError: Any(NotFound("arrange"))
    
    During handling of the above exception, another exception occurred:
    
    AttributeError                            Traceback (most recent call last)
    /tmp/ipykernel_21110/1194586334.py in <module>
    ----> 1 df.arrange('x', 'y')
    
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    --> 885             raise AttributeError(f"{item} not found")
        886 
        887     def __iter__(self) -> Iterator[Any]:
    
    AttributeError: arrange not found
    `
    
    bug 
    opened by ztsweet 6
  • Missing attributes when chaining

    Missing attributes when chaining

    Hi Mark, thanks for putting this package together. It looks very cool.

    I'm having a tough time getting the motivating examples to work, though. For example, the following triggers an error:

    import tidypolars as tp
    from tidypolars import col, desc
    
    df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])
    
    df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Untitled-1 in <cell line: 1>()
    ----> <a href='untitled:Untitled-1?line=7'>8</a> df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    AttributeError: 'DataFrame' object has no attribute 'arrange'
    

    What's genuinely odd about the above is that arrange works on its own and when it comes before filter.

    # All of these work as expected
    df.filter(col('x') < 2)
    df.arrange(desc('z'), 'x')
    df.arrange(desc('z'), 'x').filter(col('x') < 2)
    

    A seemingly related issue is that I can't pass two arguments to filter when it follows arrange (or other verbs likeselect for that matter).

    df.filter(col('x') < 2, col('y')>3) ## works
    df.arrange(desc('z'), 'x').filter(col('x') < 2, col('y')>3) ## errors with "filter() takes 2 positional arguments but 3 were given"
    

    Any ideas?

    I'm on Python 3.9.2 installed via Homebrew on a 2019 Macbook (so regular Intel chip) and running the latest version of tidypolars (0.2.15).

    bug 
    opened by grantmcdermott 5
  • Is it possible to have dplyr's `group_by` + `mutate` behavior?

    Is it possible to have dplyr's `group_by` + `mutate` behavior?

    First of all, I really like this package and I've started to use it a lot in my work. As a Pythonista whose first language is R, I really enjoy tidypolars.

    In R, we can do something like the following

    library(dplyr)
    data(iris)
    
    iris %>%
      group_by(Species) %>%
      mutate(
        result = Petal.Width - mean(Petal.Width)
      )
    

    Since we have a group_by(Species) call, dplyr will subtract the mean that corresponds to each group in the mutate() operation (not the mean across all observations from all species).

    As far as I understand, this is still not possible with tidypolars since we don't have a group_by function that behaves in a similar way to the one in dplyr. So my questions are

    • Is it possible to have this behavior in tidypolars now?
      • If yes, how?
      • If not, is it going to be possible? I could volunteer to try to implement it. I'm not familiar with the existing codebase, but I suspect that Python eager evaluation of function arguments is what makes it harder to have such a feature?

    Again, thanks for the fantastic library!

    opened by tomicapretto 5
  • idiomatic way to add list as column

    idiomatic way to add list as column

    Forgive what's probably a dumb question, but is there a way to get .mutate to return the same object as the .bind_cols line?

    import tidypolars as tp
    
    tb = tp.Tibble({'a': [1, 2, 3]})
    x = [4, 5, 6]
    # gives desired output 
    tb.bind_cols(tp.Tibble({'b': x}))
    # gives error: ValueError: could not convert value '[4, 5, 6]' as a Literal
    tb.mutate(b = x)
    
    feature 
    opened by eutwt 4
  • purrr functions!?

    purrr functions!?

    I noticed that in tidytable, you have purrr functions like map.(), but not in tidypolars.

    Using for loops + lambda functions are just not desirable for collaborative coding / code readability/comprehension. In Python, even if there is a bit of sacrifice in performance, if it allows better code readability, it would be really nice to have.

    Would something like map.() be in the scope of this repo?

    feature 
    opened by exsell-jc 3
  • ```ValueError``` with ```filter```

    ```ValueError``` with ```filter```

    When I chain filter expressions with | (error message said to use | and not or), I receive a ValueError message:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. 
    Hint: use '&' or '|' to chain Expr together, not and/or.
    

    It works fine if I do one or the other:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(# col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    # chr_col
    #   --
    #   str
    # "this is a test 2"
    
    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' # |
                # col('chr_col') == 'this is a test 2'
                )
    
    # chr_col
    #   --
    #   str
    # "this is a test 1"
    
    opened by alexandro-ag 3
  • ```as_date``` with ```RuntimeError: please define a fmt```

    ```as_date``` with ```RuntimeError: please define a fmt```

    Good afternoon,

    I think I found an issue with the as_date method. In the example per the documentation, the following succeeds:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['2021-12-31']) # Year-Month-Day (%Y-%m-%d)
    date_df.mutate(date_parsed = tp.as_date(col('date'))) # Success
    

    However when parsing different formats (using the fmt argument), the date fails to parse:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['12/31/2021']) # Month/Day/Year (%m/%d/%Y)
    date_df.mutate(date_parsed = tp.as_date(col('date'), fmt='%m/%d/%Y')) # RuntimeError
    

    I also extend my appreciation for all the work on this package. I've been searching for a tidyverse implementation in python and this one knocks my expectations out of the park. Thank you.

    opened by alexandro-ag 3
  • Revisit `.rename()` syntax

    Revisit `.rename()` syntax

    Should the syntax be the same as pl.DataFrame.rename? Currently polars mimics pandas syntax. Or should it be something that attempts to mimic tidyverse syntax?

    Note: polars also has a .rename_col() with syntax df.rename_col('old', 'new').

    opened by markfairbanks 3
  • Compatibility with polars v0.14.0

    Compatibility with polars v0.14.0

    PR that caused the break: https://github.com/pola-rs/polars/pull/4309

    Old behavior that tidypolars relied on: https://github.com/pola-rs/polars/pull/2862

    feature 
    opened by markfairbanks 2
  • Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Really new to the library, but looking at the documentation did not really help with understanding.

    Problem 1

    import polars as pl
    import tidypolars as tp
    import csv
    import requests
    
    url = f'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-24/Scrumqueens-data-2022-05-23.csv'
    
    df = tp.read_csv(file = url) # does not work
    df = pl.read_csv(file = url) # works??
    

    Problem 2

    df = df.drop('...1', 'Notes') # does not work
    df = df.drop('...1') # works separately
    df = df.drop('Notes') # works separately
    

    Problem 3

    df.colnames
    df.names
    df.colnames()
    df.names()
    # None of these work
    

    What am I missing, exactly?

    opened by exsell-jc 2
  • plans for adding type hints

    plans for adding type hints

    Hi, it seems that the codebase is not annotated making the discoverability of methods difficult and static code analysis not working. Any plans on adding type hints?

    feature 
    opened by mr-majkel 1
  • `write_csv()` returns `'super' object has no attribute 'to_csv'`

    `write_csv()` returns `'super' object has no attribute 'to_csv'`

    Hi, There seems to be a problem with write_csv(). I can import tidypolars and the data just fine:

    import tidypolars as tp
    
    rents = tp.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv")
    

    But when I try to export the data frame as a csv file:

    rents.write_csv("rents.csv")
    

    I get an error stating 'super' object has no attribute 'to_csv'.

    The data come from the Tidytuesday repo. Python version is 3.10.8 and tidypolars is 0.2.19. I'm on macOS 13.

    bug 
    opened by alesvomacka 1
  • Calculating time

    Calculating time

    In R with lubridate, it would look like this:

    one_year_before = some_date - years(1)
    one_year_before = some_date - months(12)
    

    But in tidypolars functions list, there doesn't seem to be a years or months function: https://tidypolars.readthedocs.io/en/latest/reference.html

    feature 
    opened by exsell-jc 4
Releases(v0.2.19)
[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

PG-MORL This repository contains the implementation for the paper Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Contro

MIT Graphics Group 65 Jan 07, 2023
This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

Arun Verma 1 Nov 09, 2021
Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

Peyman Majidi Moein 16 Sep 22, 2022
Aircraft design optimization made fast through modern automatic differentiation

Aircraft design optimization made fast through modern automatic differentiation. Plug-and-play analysis tools for aerodynamics, propulsion, structures, trajectory design, and much more.

Peter Sharpe 394 Dec 23, 2022
GAN-STEM-Conv2MultiSlice - Exploring Generative Adversarial Networks for Image-to-Image Translation in STEM Simulation

GAN-STEM-Conv2MultiSlice GAN method to help covert lower resolution STEM images generated by convolution methods to higher resolution STEM images gene

UW-Madison Computational Materials Group 2 Feb 10, 2021
Fast Learning of MNL Model From General Partial Rankings with Application to Network Formation Modeling

Fast-Partial-Ranking-MNL This repo provides a PyTorch implementation for the CopulaGNN models as described in the following paper: Fast Learning of MN

Xingjian Zhang 3 Aug 19, 2022
🏅 Top 5% in 제2회 연구개발특구 인공지능 경진대회 AI SPARK 챌린지

AI_SPARK_CHALLENG_Object_Detection 제2회 연구개발특구 인공지능 경진대회 AI SPARK 챌린지 🏅 Top 5% in mAP(0.75) (443명 중 13등, mAP: 0.98116) 대회 설명 Edge 환경에서의 가축 Object Dete

3 Sep 19, 2022
Code for LIGA-Stereo Detector, ICCV'21

LIGA-Stereo Introduction This is the official implementation of the paper LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based

Xiaoyang Guo 75 Dec 09, 2022
[ACL-IJCNLP 2021] Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

CLNER The code is for our ACL-IJCNLP 2021 paper: Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning CLNER is a

71 Dec 08, 2022
Code for the paper Task Agnostic Morphology Evolution.

Task-Agnostic Morphology Optimization This repository contains code for the paper Task-Agnostic Morphology Evolution by Donald (Joey) Hejna, Pieter Ab

Joey Hejna 18 Aug 04, 2022
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 799 Dec 28, 2022
This is the pytorch implementation of the paper - Axiomatic Attribution for Deep Networks.

Integrated Gradients This is the pytorch implementation of "Axiomatic Attribution for Deep Networks". The original tensorflow version could be found h

Tianhong Dai 150 Dec 23, 2022
Checking fibonacci - Generating the Fibonacci sequence is a classic recursive problem

Fibonaaci Series Generating the Fibonacci sequence is a classic recursive proble

Moureen Caroline O 1 Feb 15, 2022
Autotype on websites that have copy-paste disabled like Moodle, HackerEarth contest etc.

Autotype A quick and small python script that helps you autotype on websites that have copy paste disabled like Moodle, HackerEarth contests etc as it

Tushar 32 Nov 03, 2022
You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling Transformer-based models are widely used in natural language processi

Zhanpeng Zeng 12 Jan 01, 2023
Simple node deletion tool for onnx.

snd4onnx Simple node deletion tool for onnx. I only test very miscellaneous and limited patterns as a hobby. There are probably a large number of bugs

Katsuya Hyodo 6 May 15, 2022
PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

PAML PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021. (Continuously updating ) Int

15 Nov 18, 2022
HAR-stacked-residual-bidir-LSTMs - Deep stacked residual bidirectional LSTMs for HAR

HAR-stacked-residual-bidir-LSTM The project is based on this repository which is presented as a tutorial. It consists of Human Activity Recognition (H

Guillaume Chevalier 287 Dec 27, 2022
TensorFlow ROCm port

Documentation TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, a

ROCm Software Platform 622 Jan 09, 2023
Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

CoSMo.pytorch Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback, Seungmin Lee*, Dongwan Kim*, Bohyung

Seung Min Lee 54 Dec 08, 2022