A simple and efficient tool to parallelize Pandas operations on all available CPUs

Last update: Dec 31, 2022

Related tags

Overview

Pandaral·lel

Without parallelization
With parallelization

Installation

$ pip install pandarallel [--upgrade] [--user]

Requirements

On Windows, Pandaral·lel will works only if the Python session (python, ipython, jupyter notebook, jupyter lab, ...) is executed from Windows Subsystem for Linux (WSL).

On Linux & macOS, nothing special has to be done.

Warning

Parallelization has a cost (instantiating new processes, sending data via shared memory, ...), so parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.

Examples

An example of each API is available here.

Benchmark

For some examples, here is the comparative benchmark with and without using Pandaral·lel.

Computer used for this benchmark:

OS: Linux Ubuntu 16.04
Hardware: Intel Core i7 @ 3.40 GHz - 4 cores

For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map which runs only 3.2x faster).

API

First, you have to import pandarallel:

from pandarallel import pandarallel

Then, you have to initialize it.

pandarallel.initialize()

This method takes 5 optional parameters:

shm_size_mb: Deprecated.
nb_workers: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.
progress_bar: Display progress bars if set to True. (bool, False by default)
verbose: The verbosity level (int, 2 by default)
- 0 - don't display any logs
- 1 - display only warning logs
- 2 - display all logs
use_memory_fs: (bool, None by default)
- If set to None and if memory file system is available, Pandarallel will use it to transfer data between the main process and workers. If memory file system is not available, Pandarallel will default on multiprocessing data transfer (pipe).
- If set to True, Pandarallel will use memory file system to transfer data between the main process and workers and will raise a SystemError if memory file system is not available.
- If set to False, Pandarallel will use multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Using memory file system reduces data transfer time between the main process and workers, especially for big data.

Memory file system is considered as available only if the directory /dev/shm exists and if the user has read and write rights on it.

Basically, memory file system is only available on some Linux distributions (including Ubuntu).

With df a pandas DataFrame, series a pandas Series, func a function to apply/map, args, args1, args2 some arguments, and col_name a column name:

Without parallelization	With parallelization
`df.apply(func)`	`df.parallel_apply(func)`
`df.applymap(func)`	`df.parallel_applymap(func)`
`df.groupby(args).apply(func)`	`df.groupby(args).parallel_apply(func)`
`df.groupby(args1).col_name.rolling(args2).apply(func)`	`df.groupby(args1).col_name.rolling(args2).parallel_apply(func)`
`df.groupby(args1).col_name.expanding(args2).apply(func)`	`df.groupby(args1).col_name.expanding(args2).parallel_apply(func)`
`series.map(func)`	`series.parallel_map(func)`
`series.apply(func)`	`series.parallel_apply(func)`
`series.rolling(args).apply(func)`	`series.rolling(args).parallel_apply(func)`

You will find a complete example here for each row in this table.

Troubleshooting

I have 8 CPUs but parallel_apply speeds up computation only about x4. Why?

Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.

On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo.

I use Jupyter Lab and instead of progress bars, I see these kind of things:
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…

Run the following 3 lines, and you should be able to see the progress bars:

$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

(You may also have to install nodejs if asked)

Comments

Maximum size exceeded

hi I set pandarallel.initialize(shm_size_mb=10000) and after apply parallel_apply to my column i get the net error Maximum size exceeded (2GB)

why i get this message when i set more than 2gb?
bug

opened by vvssttkk 23
Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's
When progress_bar=True, I noticed that the execution of my parallel_apply task stopped right before all parallel processes reached 1% progress mark. Here are some further details of what I was encountering -

I turned on logging with DEBUG messages, but no messages were displayed when the execution stopped. There were no error messages either. The dataframe rows simply stopped processing further and the process seemed to be frozen.

I have two CPU's. It seems that the progress bar only updates in 1% increments. One of the progress bars reaches 1% mark, but when the number of processed rows reaches the 2% mark (which I assume is associated with the second progress bar updating to 1% as well), that's when the process froze.

The process runs fine with progress_bar=False.
opened by abhineetgupta 22

pandarallel_apply crashes with OverflowError: int too big to convert

Hi everyone,

I am getting this error here using parallel_apply in pandas:

  File "extract_specifications.py", line 156, in <module>
    extracted_data = df.parallel_apply(extract_raw_infos, axis=1)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 367, in closure
    kwargs,
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 239, in get_workers_args
    zip(input_files, output_files, chunk_lengths)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 238, in <listcomp>
    for index, (input_file, output_file, chunk_length) in enumerate(
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 169, in wrapper
    time=time,
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 464, in inline
    func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in shift_instructions
    for instruction in instructions
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in <genexpr>
    for instruction in instructions
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 293, in shift_instruction
    return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 71, in int2python_bytes
    return int.to_bytes(item, nb_bytes, "little")
OverflowError: int too big to convert

I am using

pandarallel == 1.4.2
pandas == 0.24.2
python == 3.6.9

Any idea how to proceed from here? I have basically no idea what could cause this bug. I suspect it might be related to the size of the data I have in one column (I save html from web pages in there). But otherwise no idea. I would help removing this bug(?) if I had some guidance here. Thx for helping.

opened by yeus 21

Fails with "_wrap_applied_output() missing 1 required positional argument" where a simple pandas apply succeeds

Hello,

I'm using python 3.8.10 (anaconda distribution, GCC 7.5.10) in Ubuntu LTS 20 64bits x86

From my pip freeze:

pandarallel 1.5.2 pandas 1.3.0 numpy 1.20.3

I'm working with a dataFrame that looks like this one:

	HoleID	scaffold	tpl	base	score	tMean	tErr	modelPrediction	ipdRatio	coverage	isboundary	identificationQv	context	experiment	isbegin_bondary	isend_boundary	isin_IES	uniqueID	No_known_IES_retention_this_CCS	detailed_classif
1025444	70189477	scaffold_024_with_IES	688203	T	2	0.517	0.190	0.555	0.931	11	True	NaN	TTAAATAGAAATTAAAATCAGCTGC	NM9_10	False	False	False	NM9_10_70189477	False	POTENTIALLY_RETAINED_MACIES_OUTIES
1025446	70189477	scaffold_024_with_IES	688204	A	4	1.347	0.367	1.251	1.077	13	True	NaN	TAAATAGAAATTAAAATCAGCTGCT	NM9_10	False	False	False	NM9_10_70189477	False	POTENTIALLY_RETAINED_MACIES_OUTIES
1025448	70189477	scaffold_024_with_IES	688205	A	5	1.913	0.779	1.464	1.307	16	True	NaN	AAATAGAAATTAAAATCAGCTGCTT	NM9_10	False	False	False	NM9_10_70189477	False	POTENTIALLY_RETAINED_MACIES_OUTIES
1025450	70189477	scaffold_024_with_IES	688206	A	4	1.535	0.712	1.328	1.156	18	True	NaN	AATAGAAATTAAAATCAGCTGCTTA	NM9_10	False	False	False	NM9_10_70189477	False	POTENTIALLY_RETAINED_MACIES_OUTIES
1025452	70189477	scaffold_024_with_IES	688207	A	5	1.655	0.565	1.391	1.190	18	True	NaN	ATAGAAATTAAAATCAGCTGCTTAA	NM9_10	False	False	False	NM9_10_70189477	False	POTENTIALLY_RETAINED_MACIES_OUTIES

I defined the following function

def get_distance_from_nearest_criteria(df,criteria):
    begins = df[df[criteria]].copy()
    
    if len(begins) == 0:
        return pd.Series([np.nan for x in range(len(df))])
    else:
        list_return = []

        for idx, nt in df.iterrows():
            distances = [abs(nt["tpl"] - x) for x in begins["tpl"]]
            mindistance = min(distances,default=np.nan)
            list_return.append(mindistance)

        return pd.Series(list_return)

Then using :

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=False, nb_workers=12)
out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))

leads to :

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-49-02fc7c0589e3> in <module>
----> 1 out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))

~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
    463             )
    464 
--> 465             return reduce(results, reduce_meta_args)
    466 
    467         finally:

~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/data_types/dataframe_groupby.py in reduce(results, df_grouped)
     14         keys, values, mutated = zip(*results)
     15         mutated = any(mutated)
---> 16         return df_grouped._wrap_applied_output(
     17             keys, values, not_indexed_same=df_grouped.mutated or mutated
     18         )

TypeError: _wrap_applied_output() missing 1 required positional argument: 'values'

For me, the error is not clear enough (I can't tell what's happening)

However, when I run it with a simple pandas apply :

uniqueID           
HT2_10354935    0      297.0
                1      297.0
                2      296.0
                3      296.0
                4      295.0
                       ...  
NM9_10_9568952  502      NaN
                503      NaN
                504      NaN
                505      NaN
                506      NaN
Length: 1028437, dtype: float64

I'm running all of this in a jupyter notebook

ipykernel 5.3.4 ipython 7.22.0 ipython-genutils 0.2.0 notebook 6.4.0 jupyter 1.0.0 jupyter-client 6.1.12 jupyter-console 6.4.0 jupyter-core 4.7.1 jupyter-dash 0.4.0 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0

I was wondering if someone could explain me what's hapenning, and how to fix it if the error is mine. Because it works out of the box with a simple pandas apply, I suppose that there is a small problem in pandarallel

NB: Note also that this code leaves unkilled processes even after I interrupted or restarted the ipython kernel EDIT: Would it be linked to the fact that I'm using a lambda function ?

opened by GDelevoye 18

Add `parallel_apply` for `Resampler` class

I implemented parallel_apply for the Resampler class to have some important time series functionality. For now it is still using the default _chunk method, but it can lead to some processes terminating much quicker than others i.e. if the time series gets denser over time. A potential upgrade would be to random sample the contents of the chunks, so each chunk gets a similar distribution of workloads.

P.S.: I noticed that 30/188 of the tests fail due to a ZeroDivisionError. This is unrelated to my pull request, but an important issue.

opened by alvail 17
Connection to IPC socket failed for pathname

Hello, ask, pandarallel.initialize () appears warning how is it? Thank you: WARNING: Logging before InitGoogleLogging() is written to STDERR E0812 19:11:57.484051 2409853824 http://io.cc:168] Connection to IPC socket failed for pathname /var/folders/sp/vz74h1tx3jlb3jqrq__bjwh00000gp/T/pandarallel-32ts0h6r/plasma_sock, retrying 20 more times please help me,thank

opened by lsircc 10
ZeroDivisionError: float division by zero
General

Operating System: windows 11

Python version: 3.8.3

Pandas version: 1.4.2

Pandarallel version: 1.6.1

Acknowledgement

[ ] My issue is NOT present when using pandas without alone (without pandarallel)

[ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

Bug description

WARNING: You are on Windows. If you detect any issue with pandarallel, be sure you checked out the Troubleshooting page: 0.99% | 46 / 4628 | 0.00% | 0 / 4627 | multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\core.py", line 158, in call results = self.work_function( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\data_types\series.py", line 26, in work return data.apply( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\series.py", line 4433, in apply return SeriesApply(self, func, convert_dtype, args, kwargs).apply() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1082, in apply return self.apply_standard() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1137, in apply_standard mapped = lib.map_infer( File "pandas_libs\lib.pyx", line 2870, in pandas._libs.lib.map_infer File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\progress_bars.py", line 206, in closure state.next_put_iteration += max(int((delta_i / delta_t) * 0.25), 1) ZeroDivisionError: float division by zero

Observed behavior

Write here the observed behavior

Expected behavior

Write here the expected behavior

Minimal but working code sample to ease bug fix for pandarallel team

Write here the minimal code sample to ease bug fix for pandarallel team
bug
opened by heya5 8

OverflowError

Using python 3.8, I am getting an OverflowError running apply_parallel:

OverflowError                             Traceback (most recent call last)
<ipython-input-5-a78fd5119887> in <module>
     37     grouped_df = df.groupby("id")
     38 
---> 39     grouped_df.parallel_apply(lookahead) \
     40         .to_parquet(output_location_look_ahead, compression='snappy', engine='pyarrow')
     41 

../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
    429         queue = manager.Queue()
    430 
--> 431         workers_args, chunk_lengths, input_files, output_files = get_workers_args(
    432             use_memory_fs,
    433             nb_requested_workers,

../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in get_workers_args(use_memory_fs, nb_workers, progress_bar, chunks, worker_meta_args, queue, func, args, kwargs)
    284             raise OSError(msg)
    285 
--> 286         workers_args = [
    287             (
    288                 input_file.name,

../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in <listcomp>(.0)
    293                 progress_bar == PROGRESS_IN_WORKER,
    294                 dill.dumps(
--> 295                     progress_wrapper(
    296                         progress_bar >= PROGRESS_IN_FUNC, queue, index, chunk_length
    297                     )(func)

../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in wrapper(func)
    203     def wrapper(func):
    204         if progress_bar:
--> 205             wrapped_func = inline(
    206                 progress_pre_func,
    207                 func,

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
     32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
     33 
---> 34         return function(*args, **kwargs)
     35 
     36     return wrapper

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in inline(pre_func, func, pre_func_arguments)
    485 
    486     func_instructions = tuple(get_instructions(func))
--> 487     shifted_func_instructions = shift_instructions(
    488         func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
    489     )

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
     32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
     33 
---> 34         return function(*args, **kwargs)
     35 
     36     return wrapper

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instructions(instructions, qty)
    301     If Python version not in 3.{5, 6, 7}, a SystemError is raised.
    302     """
--> 303     return tuple(
    304         shift_instruction(instruction, qty)
    305         if bytes((instruction[0],))

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in <genexpr>(.0)
    302     """
    303     return tuple(
--> 304         shift_instruction(instruction, qty)
    305         if bytes((instruction[0],))
    306         in (

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
     32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
     33 
---> 34         return function(*args, **kwargs)
     35 
     36     return wrapper

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instruction(instruction, qty)
    291     """
    292     operation, *values = instruction
--> 293     return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
    294 
    295 

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
     32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
     33 
---> 34         return function(*args, **kwargs)
     35 
     36     return wrapper

../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in int2python_bytes(item)
     69 
     70     nb_bytes = 2 if python_version.minor == 5 else 1
---> 71     return int.to_bytes(item, nb_bytes, "little")
     72 
     73 

OverflowError: int too big to convert

opened by lfdversluis 7

TypeError: 'generator' object is not subscriptable error in colab - works in VScode

General

Operating System: OSX
Python version: 3.7.13
Pandas version: 1.3.5
Pandarallel version: 1.6.2

Acknowledgement

Issue happens on Colab only. When I use VScode, the problem does not happen

Bug description

I get the error:

100.00%
1 / 1
100.00%
1 / 1
100.00%
1 / 1
100.00%
1 / 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-16-b9ca1a6007f2>](https://localhost:8080/#) in <module>()
      1 data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
      2 df = pd.DataFrame(data)
----> 3 df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)

2 frames
[/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(data, user_defined_function, *user_defined_function_args, **user_defined_function_kwargs)
    324             return wrapped_reduce_function(
    325                 (Path(output_file.name) for output_file in output_files),
--> 326                 reduce_extra,
    327             )
    328 

[/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(output_file_paths, extra)
    197         )
    198 
--> 199         return reduce_function(dfs, extra)
    200 
    201     return closure

[/usr/local/lib/python3.7/dist-packages/pandarallel/data_types/dataframe.py](https://localhost:8080/#) in reduce(datas, extra)
     45             datas: Iterable[pd.DataFrame], extra: Dict[str, Any]
     46         ) -> pd.DataFrame:
---> 47             axis = 0 if isinstance(datas[0], pd.Series) else 1 - extra["axis"]
     48             return pd.concat(datas, copy=False, axis=axis)
     49 

TypeError: 'generator' object is not subscriptable

Minimal but working code sample to ease bug fix for `pandarallel` team

!pip install pandarallel
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

import pandas as pd

data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)
df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)

opened by agiveon 6

Library not working....

Hello everyone!

While trying to run the example cases and I'm receiving the following error:

Code

import pandas as pd
import pandarallel
pandarallel.initialize()

def func(x):
    return math.sin(x.a**2) + math.sin(x.b**2)

if __name__ == '__main__':
    df_size = int(5e6)
    df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
                           b=np.random.rand(df_size)))
    res_parallel = df.parallel_apply(func, axis=1, progress_bar=True)

Error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

~\.conda\envs\ingestion-env\lib\site-packages\pandarallel\pandarallel.py in closure(data, func, *args, **kwargs)
    434         try:
    435             pool = Pool(
--> 436                 nb_workers, worker_init, (prepare_worker(use_memory_fs)(worker),)
    437             )
    438 

~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in Pool(self, processes, initializer, initargs, maxtasksperchild)
    117         from .pool import Pool
    118         return Pool(processes, initializer, initargs, maxtasksperchild,
--> 119                     context=self.get_context())
    120 
    121     def RawValue(self, typecode_or_type, *args):

~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
    174         self._processes = processes
    175         self._pool = []
--> 176         self._repopulate_pool()
    177 
    178         self._worker_handler = threading.Thread(

~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in _repopulate_pool(self)
    239             w.name = w.name.replace('Process', 'PoolWorker')
    240             w.daemon = True
--> 241             w.start()
    242             util.debug('added worker')
    243 

~\.conda\envs\ingestion-env\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

~\.conda\envs\ingestion-env\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)

~\.conda\envs\ingestion-env\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'prepare_worker.<locals>.closure.<locals>.wrapper'

I'm working on a Jupyter Lab notebook with a conda environment: What can I do?

opened by murilobellatini 6

IndexError when there are fewer DataFrame rows than workers

When the number of rows is below the number of workers an IndexError is raised. Minimal example:

Code

import time
import pandas as pd
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

df = pd.DataFrame({'x':[1,2]})
df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)

Output

INFO: Pandarallel will run on 6 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
B
   0.00%                                          |        0 /        1 |                                                                                                                    
   0.00%                                          |        0 /        1 |                                                                                                                    Traceback (most recent call last):
  File "foo.py", line 8, in <module>
    df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
  File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 446, in closure
    map_result,
  File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 382, in get_workers_result
    progress_bars.update(progresses)
  File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/utils/progress_bars.py", line 82, in update
    self.__bars[index][0] = value
IndexError: list index out of range

I'm using python version 3.7.4 with pandas 0.25.3 and pandarallel 1.4.4.

bug

opened by elemakil 6

__main__ imposition in code

Make sure if a novice tries to access the module, he should be aware of encapsulating code within main execution context, pretty much similar to that of requirements of multiprocessing in general. This will reduce wasting of time and delving into multiporcessing more.

opened by karunakar2 4

how to use it in fastapi

I have a server like:

from fastapi import FastAPI
from pandarallel import pandarallel
pandarallel.initialize()
app = FastAPI()

@app.post("/area_quant")
def create_item(event_data: Areadata):
    data = pd.DataFrame(event_data.data)
    data['type_score'] = data['EVENT_TYPE'].applymap(Config.type_map)

if __name__ == '__main__':
    uvicorn.run(app)

as a server,it can only run one time,then it will shutdown,how can i use it in a server?

opened by lim-0 0

'functools.partial' object has no attribute 'code' in Jupyter Notebooks

General

Operating System: Ubuntu 20.04
Python version: 3.9.13
Pandas version: 1.4.4
Pandarallel version: 1.6.3

Acknowledgement

[x] My issue is NOT present when using pandas without alone (without pandarallel)
[x] If I am on Windows, I read the Troubleshooting page before writing a new bug report

Bug description

functool.partial cannot be used in junction with jupyter notebooks and pandarallel

Observed behavior

Write here the observed behavior

Expected behavior

Write here the expected behavior

Minimal but working code sample to ease bug fix for `pandarallel` team

import pandas as pd
from functools import partial
from typing import List
import numpy as np
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

def processing_fn(row: pd.Series, invalid_numbers: List[int], default: int) -> pd.Series:
    cond = row.isin(invalid_numbers)
    row[cond] = default
    return row

data = pd.DataFrame(np.random.randint(low=-10, high=10, size=(10000, 5)))
print("Before", (data.values == 100).sum())

fn = partial(processing_fn, invalid_numbers=[-5, 2, 5], default=100)
new_data = data.apply(fn, axis=1)

print("After serial", (new_data.values == 100).sum())

data = data.parallel_apply(fn, axis=1)
print("After parallel", (data.values == 100).sum())

Works fine in a standalone script, but fails if ran in Jupyter notebook

opened by Meehai 0

Choose which type of progress bar you want in a notebook

Sometimes one doesn't always have control of the environment of the Jupyter instance where one's working (think Jupyterhub) and can't install the necessary extensions for progress bars. In this case it might be nice to have the option to manually request the simple progress bar so that it doesn't just display a widget error.

Is there already a way to do so that I've missed? Otherwise, if it's something you'd consider adding, I'd also be happy to try and draft a PR.

opened by astrojarred 3

TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

General

Operating System: Ubuntu 22.04
Python version: 3.9.7
Pandas version: 1.4.2
Pandarallel version: 1.6.3

Acknowledgement

[x] My issue is NOT present when using pandas without alone (without pandarallel)
[ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

Bug description

I observe this when running parallel_apply in pyCharm. #76 sounds similar to me, but none of the tricks suggested there work for me. At this point I am also not sure if it's more of an issue with pyCharm.

Observed behavior

Traceback (most recent call last):
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-39-d230d86ff5ef>", line 1, in <module>
    iris.groupby("species").parallel_apply(lambda x: np.mean(x))
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/pandarallel/core.py", line 265, in closure
    dilled_user_defined_function = dill.dumps(user_defined_function)
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 364, in dumps
    dump(obj, file, protocol, byref, fmode, recurse, **kwds)#, strictio)
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 336, in dump
    Pickler(file, protocol, **_kwds).dump(obj)
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 620, in dump
    StockPickler.dump(self, obj)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 487, in dump
    self.save(obj)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1963, in save_function
    _save_with_postproc(pickler, (_create_function, (
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1154, in _save_with_postproc
    pickler._batch_setitems(iter(source.items()))
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 692, in save_reduce
    save(args)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 886, in save_tuple
    save(element)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle 'sqlite3.Connection' object

Minimal but working code sample to ease bug fix for `pandarallel` team

import pandas as pd
import numpy as np
from pandarallel import pandarallel
pandarallel.initialize(use_memory_fs=False)
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.groupby("species").parallel_apply(lambda x: np.mean(x))

opened by wiessall 1

Releases(v1.6.3)

v1.6.3(Aug 9, 2022)

Source code(tar.gz)
Source code(zip)
v1.6.2(Aug 3, 2022)

Source code(tar.gz)
Source code(zip)
v1.6.1(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
v1.6.0(Mar 14, 2022)

Source code(tar.gz)
Source code(zip)
v1.5.8(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
v1.5.7(Mar 3, 2022)

Source code(tar.gz)
Source code(zip)
v1.5.6(Mar 3, 2022)

Source code(tar.gz)
Source code(zip)
v1.5.5(Feb 6, 2022)

Thanks a lot @cschan1828!
Source code(tar.gz)
Source code(zip)
v1.5.4(Oct 17, 2021)

Special thanks to @jorenham & @tom-andersson.
Source code(tar.gz)
Source code(zip)
v1.5.3(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
v1.5.2(Feb 4, 2021)

Source code(tar.gz)
Source code(zip)
v1.5.1(Aug 25, 2020)

Source code(tar.gz)
Source code(zip)
v1.5.0(Aug 24, 2020)

Source code(tar.gz)
Source code(zip)
v1.4.8(Apr 5, 2020)

Source code(tar.gz)
Source code(zip)
v1.4.7(Apr 5, 2020)

Handle small datasets
Source code(tar.gz)
Source code(zip)
v1.4.6(Mar 1, 2020)

Source code(tar.gz)
Source code(zip)
v1.4.5(Jan 20, 2020)

Source code(tar.gz)
Source code(zip)
v1.4.4(Jan 1, 2020)

Source code(tar.gz)
Source code(zip)
v1.4.3(Jan 1, 2020)

Fix #60
Source code(tar.gz)
Source code(zip)
v1.4.2(Nov 28, 2019)

Fix #55 and #56
Source code(tar.gz)
Source code(zip)
v1.4.1(Nov 11, 2019)

Source code(tar.gz)
Source code(zip)
v1.4.0(Nov 9, 2019)
Remove pyarrow plasma dependancy

Progress bars is not any more an experimental feature

Implement progress bars for all type of computation

Implement transfer with Memory File System if available

Complete refactor of code

Source code(tar.gz)
Source code(zip)
v1.3.4(Nov 2, 2019)

Source code(tar.gz)
Source code(zip)
v1.3.3(Oct 6, 2019)

Source code(tar.gz)
Source code(zip)
v1.3.2(Aug 3, 2019)

Source code(tar.gz)
Source code(zip)
v1.3.1(Aug 2, 2019)

Source code(tar.gz)
Source code(zip)
v1.3.0(Jul 23, 2019)

Source code(tar.gz)
Source code(zip)
v1.2.0(Jul 9, 2019)

Source code(tar.gz)
Source code(zip)
v1.1.1(May 13, 2019)

Source code(tar.gz)
Source code(zip)
v1.1.0(Apr 4, 2019)

Source code(tar.gz)
Source code(zip)

Owner

Manu NALEPA

Data Scientist / Data Engineer @ Clustree — Sousaphone & saxophone player

GitHub Repository

Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

0 Jan 20, 2022

Port of dplyr and other related R packages in python, using pipda.

Unlike other similar packages in python that just mimic the piping syntax, datar follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from t

179 Dec 21, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

791 Jan 04, 2023

Pipeline to convert a haploid assembly into diploid

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes

50 Jan 05, 2023

Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Spectacular AI SDK examples Spectacular AI SDK fuses data from cameras and IMU sensors (accelerometer and gyroscope) and outputs an accurate 6-degree-

94 Jan 04, 2023

Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

8k Dec 29, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 03, 2023

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

2.2k Dec 25, 2022

ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

0 Jul 07, 2021

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

167 Dec 13, 2022

This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

4 Mar 06, 2022

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

1 Nov 17, 2021

A simple and efficient tool to parallelize Pandas operations on all available CPUs

Related tags

Overview

Pandaral·lel

Installation

Requirements

Warning

Examples

Benchmark

API

Troubleshooting

Comments

General

Acknowledgement

Bug description

Observed behavior

Expected behavior

Minimal but working code sample to ease bug fix for pandarallel team

General

Acknowledgement

Bug description

Minimal but working code sample to ease bug fix for pandarallel team

Code

Error

General

Acknowledgement

Bug description

Observed behavior

Expected behavior

Minimal but working code sample to ease bug fix for pandarallel team

General

Acknowledgement

Bug description

Observed behavior

Minimal but working code sample to ease bug fix for pandarallel team

Releases(v1.6.3)

v1.6.3(Aug 9, 2022)

v1.6.2(Aug 3, 2022)

v1.6.1(Mar 15, 2022)

v1.6.0(Mar 14, 2022)

v1.5.8(Mar 12, 2022)

v1.5.7(Mar 3, 2022)

v1.5.6(Mar 3, 2022)

v1.5.5(Feb 6, 2022)

v1.5.4(Oct 17, 2021)

v1.5.3(Oct 4, 2021)

v1.5.2(Feb 4, 2021)

v1.5.1(Aug 25, 2020)

v1.5.0(Aug 24, 2020)

v1.4.8(Apr 5, 2020)

v1.4.7(Apr 5, 2020)

v1.4.6(Mar 1, 2020)

v1.4.5(Jan 20, 2020)

v1.4.4(Jan 1, 2020)

v1.4.3(Jan 1, 2020)

v1.4.2(Nov 28, 2019)

v1.4.1(Nov 11, 2019)

v1.4.0(Nov 9, 2019)

v1.3.4(Nov 2, 2019)

v1.3.3(Oct 6, 2019)

v1.3.2(Aug 3, 2019)

v1.3.1(Aug 2, 2019)

v1.3.0(Jul 23, 2019)

v1.2.0(Jul 9, 2019)

v1.1.1(May 13, 2019)

v1.1.0(Apr 4, 2019)

Owner

Manu NALEPA

Analyzing Covid-19 Outbreaks in Ontario

Port of dplyr and other related R packages in python, using pipda.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Pipeline to convert a haploid assembly into diploid

Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Statsmodels: statistical modeling and econometrics in Python

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

ETL pipeline on movie data using Python and postgreSQL

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Minimal but working code sample to ease bug fix for `pandarallel` team

Minimal but working code sample to ease bug fix for `pandarallel` team

Minimal but working code sample to ease bug fix for `pandarallel` team

Minimal but working code sample to ease bug fix for `pandarallel` team