Tidy interface to polars

Overview

tidypolars

PyPI Latest Release

tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users.

Installation

$ pip3 install tidypolars

General syntax

tidypolars methods are designed to work like tidyverse functions:

import tidypolars as tp
from tidypolars import col, desc

df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])

(
    df
    .select('x', 'y', 'z')
    .filter(col('x') < 4, col('y') > 1)
    .arrange(desc('z'), 'x')
    .mutate(double_x = col('x') * 2,
            x_plus_y = col('x') + col('y'))
)
┌─────┬─────┬─────┬──────────┬──────────┐
│ xyzdouble_xx_plus_y │
│ ---------------      │
│ i64i64stri64i64      │
╞═════╪═════╪═════╪══════════╪══════════╡
│ 25b47        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 03a03        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 14a25        │
└─────┴─────┴─────┴──────────┴──────────┘

The key difference from R is that column names must be wrapped in col() in the following methods:

  • .filter()
  • .mutate()
  • .summarize()

The general idea - when doing calculations on a column you need to wrap it in col(). When doing simple column selections (like in .select()) you can pass the column names as strings.

Group by syntax

Methods operate by group by calling the by arg.

  • A single column can be passed with by = 'z'
  • Multiple columns can be passed with by = ['y', 'z']
(
    df
    .summarize(avg_x = tp.mean(col('x')),
               by = 'z')
)
┌─────┬───────┐
│ zavg_x │
│ ------   │
│ strf64   │
╞═════╪═══════╡
│ a0.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b2     │
└─────┴───────┘

Selecting/dropping columns

tidyselect functions can be mixed with normal selection when selecting columns:

df = tp.Tibble(x1 = range(3), x2 = range(3), y = range(3), z = range(3))

df.select(tp.starts_with('x'), 'z')
┌─────┬─────┬─────┐
│ x1x2z   │
│ --------- │
│ i64i64i64 │
╞═════╪═════╪═════╡
│ 000   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 111   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 222   │
└─────┴─────┴─────┘

To drop columns use the .drop() method:

df.drop(tp.starts_with('x'), 'z')
┌─────┐
│ y   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
├╌╌╌╌╌┤
│ 1   │
├╌╌╌╌╌┤
│ 2   │
└─────┘

Converting to/from pandas data frames

If you need to use a package that requires pandas data frames, you can convert from a tidypolars Tibble to a pandas DataFrame.

To do this you'll first need to install pyarrow:

pip3 install pyarrow

To convert to a pandas DataFrame:

df = df.to_pandas()

To convert from a pandas DataFrame to a tidypolars Tibble:

df = tp.from_pandas(df)

Speed Comparisons

A few notes:

  • Comparing times from separate functions typically isn't very useful. For example - the .summarize() tests were performed on a different dataset from .pivot_wider().
  • All tests are run 5 times. The times shown are the median of those 5 runs.
  • All timings are in milliseconds.
  • All tests can be found in the source code here.
  • FAQ - Why are some tidypolars functions faster than their polars counterpart?
    • Short answer - they're not! After all they're just using polars in the background.
    • Long answer - All python functions have some slight natural variation in their execution time. By chance the tidypolars runs were slightly shorter on those specific functions on this iteration of the tests. However one goal of these tests is to show that the "time cost" of translating syntax to polars is very negligible to the user (especially on medium-to-large datasets).
  • Lastly I'd like to mention that these tests were not rigorously created to cover all angles equally. They are just meant to be used as general insight into the performance of these packages.
┌─────────────┬────────────┬─────────┬──────────┐
│ func_testedtidypolarspolarspandas   │
│ ------------      │
│ strf64f64f64      │
╞═════════════╪════════════╪═════════╪══════════╡
│ arrange190.345169.478500.112  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ case_when87.34879.427152.623  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ distinct16.88816.28228.725   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ filter29.78929.91231.397  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ full_join236.784231.2831042.689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ inner_join49.7147.563630.98   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ left_join113.7921151100.607 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ mutate7.9797.408117.283  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ pivot_wider42.76439.93949.048   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ summarize59.43458.011453.707  │
└─────────────┴────────────┴─────────┴──────────┘

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Comments
  • `drop` with error `RuntimeError: Any(NotFound(

    `drop` with error `RuntimeError: Any(NotFound("^x.*$"))`

    import sys
    import tidypolars as tp
    sys.version
    # '3.9.7 (default, Sep 16 2021, 13:09:58) \n[GCC 7.5.0]'
    tp.__version__
    # '0.2.1'
    ## error
    df = tp.Tibble(x1 = range(3), x2 = range(3), y=range(3), z = range(3))
    df.drop([tp.starts_with('x'), 'z'])
    df.drop()
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    /tmp/ipykernel_12815/866601321.py in <module>
    ----> 1 df.drop(tp.starts_with('x'))
    
    ~/miniconda3/envs/py39/lib/python3.9/site-packages/polars/eager/frame.py in drop(self, name)
       2253             return df
       2254 
    -> 2255         return wrap_df(self._df.drop(name))
       2256 
       2257     def drop_in_place(self, name: str) -> "pl.Series":
    
    RuntimeError: Any(NotFound("^x.*$"))
    `
    
    
    
    opened by ztsweet 9
  • `AttributeError: arrange not found`

    `AttributeError: arrange not found`

    import tidypolars as tp
    from tidypolars import col, desc
    import sys
    sys.version
    # '3.10.0 | packaged by conda-forge | (default, Oct 12 2021, 21:24:52) [GCC 9.4.0]'
    tp.__version__
    # '0.2.1'
    df = tp.Tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
    df.arrange('x', 'y')
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        882         try:
    --> 883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    
    RuntimeError: Any(NotFound("arrange"))
    
    During handling of the above exception, another exception occurred:
    
    AttributeError                            Traceback (most recent call last)
    /tmp/ipykernel_21110/1194586334.py in <module>
    ----> 1 df.arrange('x', 'y')
    
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    --> 885             raise AttributeError(f"{item} not found")
        886 
        887     def __iter__(self) -> Iterator[Any]:
    
    AttributeError: arrange not found
    `
    
    bug 
    opened by ztsweet 6
  • Missing attributes when chaining

    Missing attributes when chaining

    Hi Mark, thanks for putting this package together. It looks very cool.

    I'm having a tough time getting the motivating examples to work, though. For example, the following triggers an error:

    import tidypolars as tp
    from tidypolars import col, desc
    
    df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])
    
    df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Untitled-1 in <cell line: 1>()
    ----> <a href='untitled:Untitled-1?line=7'>8</a> df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    AttributeError: 'DataFrame' object has no attribute 'arrange'
    

    What's genuinely odd about the above is that arrange works on its own and when it comes before filter.

    # All of these work as expected
    df.filter(col('x') < 2)
    df.arrange(desc('z'), 'x')
    df.arrange(desc('z'), 'x').filter(col('x') < 2)
    

    A seemingly related issue is that I can't pass two arguments to filter when it follows arrange (or other verbs likeselect for that matter).

    df.filter(col('x') < 2, col('y')>3) ## works
    df.arrange(desc('z'), 'x').filter(col('x') < 2, col('y')>3) ## errors with "filter() takes 2 positional arguments but 3 were given"
    

    Any ideas?

    I'm on Python 3.9.2 installed via Homebrew on a 2019 Macbook (so regular Intel chip) and running the latest version of tidypolars (0.2.15).

    bug 
    opened by grantmcdermott 5
  • Is it possible to have dplyr's `group_by` + `mutate` behavior?

    Is it possible to have dplyr's `group_by` + `mutate` behavior?

    First of all, I really like this package and I've started to use it a lot in my work. As a Pythonista whose first language is R, I really enjoy tidypolars.

    In R, we can do something like the following

    library(dplyr)
    data(iris)
    
    iris %>%
      group_by(Species) %>%
      mutate(
        result = Petal.Width - mean(Petal.Width)
      )
    

    Since we have a group_by(Species) call, dplyr will subtract the mean that corresponds to each group in the mutate() operation (not the mean across all observations from all species).

    As far as I understand, this is still not possible with tidypolars since we don't have a group_by function that behaves in a similar way to the one in dplyr. So my questions are

    • Is it possible to have this behavior in tidypolars now?
      • If yes, how?
      • If not, is it going to be possible? I could volunteer to try to implement it. I'm not familiar with the existing codebase, but I suspect that Python eager evaluation of function arguments is what makes it harder to have such a feature?

    Again, thanks for the fantastic library!

    opened by tomicapretto 5
  • idiomatic way to add list as column

    idiomatic way to add list as column

    Forgive what's probably a dumb question, but is there a way to get .mutate to return the same object as the .bind_cols line?

    import tidypolars as tp
    
    tb = tp.Tibble({'a': [1, 2, 3]})
    x = [4, 5, 6]
    # gives desired output 
    tb.bind_cols(tp.Tibble({'b': x}))
    # gives error: ValueError: could not convert value '[4, 5, 6]' as a Literal
    tb.mutate(b = x)
    
    feature 
    opened by eutwt 4
  • purrr functions!?

    purrr functions!?

    I noticed that in tidytable, you have purrr functions like map.(), but not in tidypolars.

    Using for loops + lambda functions are just not desirable for collaborative coding / code readability/comprehension. In Python, even if there is a bit of sacrifice in performance, if it allows better code readability, it would be really nice to have.

    Would something like map.() be in the scope of this repo?

    feature 
    opened by exsell-jc 3
  • ```ValueError``` with ```filter```

    ```ValueError``` with ```filter```

    When I chain filter expressions with | (error message said to use | and not or), I receive a ValueError message:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. 
    Hint: use '&' or '|' to chain Expr together, not and/or.
    

    It works fine if I do one or the other:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(# col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    # chr_col
    #   --
    #   str
    # "this is a test 2"
    
    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' # |
                # col('chr_col') == 'this is a test 2'
                )
    
    # chr_col
    #   --
    #   str
    # "this is a test 1"
    
    opened by alexandro-ag 3
  • ```as_date``` with ```RuntimeError: please define a fmt```

    ```as_date``` with ```RuntimeError: please define a fmt```

    Good afternoon,

    I think I found an issue with the as_date method. In the example per the documentation, the following succeeds:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['2021-12-31']) # Year-Month-Day (%Y-%m-%d)
    date_df.mutate(date_parsed = tp.as_date(col('date'))) # Success
    

    However when parsing different formats (using the fmt argument), the date fails to parse:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['12/31/2021']) # Month/Day/Year (%m/%d/%Y)
    date_df.mutate(date_parsed = tp.as_date(col('date'), fmt='%m/%d/%Y')) # RuntimeError
    

    I also extend my appreciation for all the work on this package. I've been searching for a tidyverse implementation in python and this one knocks my expectations out of the park. Thank you.

    opened by alexandro-ag 3
  • Revisit `.rename()` syntax

    Revisit `.rename()` syntax

    Should the syntax be the same as pl.DataFrame.rename? Currently polars mimics pandas syntax. Or should it be something that attempts to mimic tidyverse syntax?

    Note: polars also has a .rename_col() with syntax df.rename_col('old', 'new').

    opened by markfairbanks 3
  • Compatibility with polars v0.14.0

    Compatibility with polars v0.14.0

    PR that caused the break: https://github.com/pola-rs/polars/pull/4309

    Old behavior that tidypolars relied on: https://github.com/pola-rs/polars/pull/2862

    feature 
    opened by markfairbanks 2
  • Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Really new to the library, but looking at the documentation did not really help with understanding.

    Problem 1

    import polars as pl
    import tidypolars as tp
    import csv
    import requests
    
    url = f'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-24/Scrumqueens-data-2022-05-23.csv'
    
    df = tp.read_csv(file = url) # does not work
    df = pl.read_csv(file = url) # works??
    

    Problem 2

    df = df.drop('...1', 'Notes') # does not work
    df = df.drop('...1') # works separately
    df = df.drop('Notes') # works separately
    

    Problem 3

    df.colnames
    df.names
    df.colnames()
    df.names()
    # None of these work
    

    What am I missing, exactly?

    opened by exsell-jc 2
  • plans for adding type hints

    plans for adding type hints

    Hi, it seems that the codebase is not annotated making the discoverability of methods difficult and static code analysis not working. Any plans on adding type hints?

    feature 
    opened by mr-majkel 1
  • `write_csv()` returns `'super' object has no attribute 'to_csv'`

    `write_csv()` returns `'super' object has no attribute 'to_csv'`

    Hi, There seems to be a problem with write_csv(). I can import tidypolars and the data just fine:

    import tidypolars as tp
    
    rents = tp.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv")
    

    But when I try to export the data frame as a csv file:

    rents.write_csv("rents.csv")
    

    I get an error stating 'super' object has no attribute 'to_csv'.

    The data come from the Tidytuesday repo. Python version is 3.10.8 and tidypolars is 0.2.19. I'm on macOS 13.

    bug 
    opened by alesvomacka 1
  • Calculating time

    Calculating time

    In R with lubridate, it would look like this:

    one_year_before = some_date - years(1)
    one_year_before = some_date - months(12)
    

    But in tidypolars functions list, there doesn't seem to be a years or months function: https://tidypolars.readthedocs.io/en/latest/reference.html

    feature 
    opened by exsell-jc 4
Releases(v0.2.19)
Differentiable simulation for system identification and visuomotor control

gradsim gradSim: Differentiable simulation for system identification and visuomotor control gradSim is a unified differentiable rendering and multiphy

105 Dec 18, 2022
Deep Q-Learning Network in pytorch (not actively maintained)

pytoch-dqn This project is pytorch implementation of Human-level control through deep reinforcement learning and I also plan to implement the followin

Hung-Tu Chen 342 Jan 01, 2023
Python Auto-ML Package for Tabular Datasets

Tabular-AutoML AutoML Package for tabular datasets Tabular dataset tuning is now hassle free! Run one liner command and get best tuning and processed

Sagnik Roy 18 Nov 20, 2022
Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

Jin-Hwa Kim 506 Nov 29, 2022
Piotr - IoT firmware emulation instrumentation for training and research

Piotr: Pythonic IoT exploitation and Research Introduction to Piotr Piotr is an emulation helper for Qemu that provides a convenient way to create, sh

Damien Cauquil 51 Nov 09, 2022
2021-MICCAI-Progressively Normalized Self-Attention Network for Video Polyp Segmentation

2021-MICCAI-Progressively Normalized Self-Attention Network for Video Polyp Segmentation Authors: Ge-Peng Ji*, Yu-Cheng Chou*, Deng-Ping Fan, Geng Che

Ge-Peng Ji (Daniel) 85 Dec 30, 2022
ManipNet: Neural Manipulation Synthesis with a Hand-Object Spatial Representation - SIGGRAPH 2021

ManipNet: Neural Manipulation Synthesis with a Hand-Object Spatial Representation - SIGGRAPH 2021 Dataset Code Demos Authors: He Zhang, Yuting Ye, Tak

HE ZHANG 194 Dec 06, 2022
Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning.

Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning. Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive

<a href=[email protected](SZ)"> 7 Dec 16, 2021
Learning Continuous Image Representation with Local Implicit Image Function

LIIF This repository contains the official implementation for LIIF introduced in the following paper: Learning Continuous Image Representation with Lo

Yinbo Chen 1k Dec 25, 2022
The description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts.

FMFCC-A This project is the description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts. The FMFCC-A dataset is shared through BaiduCl

18 Dec 24, 2022
Reviving Iterative Training with Mask Guidance for Interactive Segmentation

This repository provides the source code for training and testing state-of-the-art click-based interactive segmentation models with the official PyTorch implementation

Visual Understanding Lab @ Samsung AI Center Moscow 406 Jan 01, 2023
Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Analysis of SARS-CoV-2 reads in sequencing of 2018-2019 Antarctica samples in PRJNA692319 The samples analyzed here are described in this preprint, wh

Jesse Bloom 4 Feb 09, 2022
Notes taking website build with Docker + Django + React.

Notes website. Try it in browser! / But how to run? Description. This is monorepository with notes website. Website provides web interface for creatin

Kirill Zhosul 2 Jul 27, 2022
Author's PyTorch implementation of TD3 for OpenAI gym tasks

Addressing Function Approximation Error in Actor-Critic Methods PyTorch implementation of Twin Delayed Deep Deterministic Policy Gradients (TD3). If y

Scott Fujimoto 1.3k Dec 25, 2022
This repository is based on Ultralytics/yolov5, with adjustments to enable polygon prediction boxes.

Polygon-Yolov5 This repository is based on Ultralytics/yolov5, with adjustments to enable polygon prediction boxes. Section I. Description The codes a

xinzelee 226 Jan 05, 2023
Generative Adversarial Text to Image Synthesis

Text To Image Synthesis This is a tensorflow implementation of synthesizing images. The images are synthesized using the GAN-CLS Algorithm from the pa

Hao 575 Jan 08, 2023
clustering moroccan stocks time series data using k-means with dtw (dynamic time warping)

Moroccan Stocks Clustering Context Hey! we don't always have to forecast time series am I right ? We use k-means to cluster about 70 moroccan stock pr

Ayman Lafaz 7 Oct 18, 2022
[NeurIPS 2020] Official Implementation: "SMYRF: Efficient Attention using Asymmetric Clustering".

SMYRF: Efficient attention using asymmetric clustering Get started: Abstract We propose a novel type of balanced clustering algorithm to approximate a

Giannis Daras 46 Dec 22, 2022
CTF challenges from redpwnCTF 2021

redpwnCTF 2021 Challenges This repository contains challenges from redpwnCTF 2021 in the rCDS format; challenge information is in the challenge.yaml f

redpwn 27 Dec 07, 2022
Minimal PyTorch implementation of Generative Latent Optimization from the paper "Optimizing the Latent Space of Generative Networks"

Minimal PyTorch implementation of Generative Latent Optimization This is a reimplementation of the paper Piotr Bojanowski, Armand Joulin, David Lopez-

Thomas Neumann 117 Nov 27, 2022