A simple and efficient tool to parallelize Pandas operations on all available CPUs

Overview

Pandaral·lel

PyPI version fury.io PyPI license PyPI download month

Without parallelization Without Pandarallel
With parallelization With Pandarallel

Installation

$ pip install pandarallel [--upgrade] [--user]

Requirements

On Windows, Pandaral·lel will works only if the Python session (python, ipython, jupyter notebook, jupyter lab, ...) is executed from Windows Subsystem for Linux (WSL).

On Linux & macOS, nothing special has to be done.

Warning

  • Parallelization has a cost (instantiating new processes, sending data via shared memory, ...), so parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.

Examples

An example of each API is available here.

Benchmark

For some examples, here is the comparative benchmark with and without using Pandaral·lel.

Computer used for this benchmark:

  • OS: Linux Ubuntu 16.04
  • Hardware: Intel Core i7 @ 3.40 GHz - 4 cores

Benchmark

For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map which runs only 3.2x faster).

API

First, you have to import pandarallel:

from pandarallel import pandarallel

Then, you have to initialize it.

pandarallel.initialize()

This method takes 5 optional parameters:

  • shm_size_mb: Deprecated.
  • nb_workers: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.
  • progress_bar: Display progress bars if set to True. (bool, False by default)
  • verbose: The verbosity level (int, 2 by default)
    • 0 - don't display any logs
    • 1 - display only warning logs
    • 2 - display all logs
  • use_memory_fs: (bool, None by default)
    • If set to None and if memory file system is available, Pandarallel will use it to transfer data between the main process and workers. If memory file system is not available, Pandarallel will default on multiprocessing data transfer (pipe).
    • If set to True, Pandarallel will use memory file system to transfer data between the main process and workers and will raise a SystemError if memory file system is not available.
    • If set to False, Pandarallel will use multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Using memory file system reduces data transfer time between the main process and workers, especially for big data.

Memory file system is considered as available only if the directory /dev/shm exists and if the user has read and write rights on it.

Basically, memory file system is only available on some Linux distributions (including Ubuntu).

With df a pandas DataFrame, series a pandas Series, func a function to apply/map, args, args1, args2 some arguments, and col_name a column name:

Without parallelization With parallelization
df.apply(func) df.parallel_apply(func)
df.applymap(func) df.parallel_applymap(func)
df.groupby(args).apply(func) df.groupby(args).parallel_apply(func)
df.groupby(args1).col_name.rolling(args2).apply(func) df.groupby(args1).col_name.rolling(args2).parallel_apply(func)
df.groupby(args1).col_name.expanding(args2).apply(func) df.groupby(args1).col_name.expanding(args2).parallel_apply(func)
series.map(func) series.parallel_map(func)
series.apply(func) series.parallel_apply(func)
series.rolling(args).apply(func) series.rolling(args).parallel_apply(func)

You will find a complete example here for each row in this table.

Troubleshooting

I have 8 CPUs but parallel_apply speeds up computation only about x4. Why?

Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.

On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo.


I use Jupyter Lab and instead of progress bars, I see these kind of things:
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…

Run the following 3 lines, and you should be able to see the progress bars:

$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

(You may also have to install nodejs if asked)

Comments
  • Maximum size exceeded

    Maximum size exceeded

    hi I set pandarallel.initialize(shm_size_mb=10000) and after apply parallel_apply to my column i get the net error Maximum size exceeded (2GB)

    why i get this message when i set more than 2gb?

    bug 
    opened by vvssttkk 23
  • Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's

    Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's

    When progress_bar=True, I noticed that the execution of my parallel_apply task stopped right before all parallel processes reached 1% progress mark. Here are some further details of what I was encountering -

    • I turned on logging with DEBUG messages, but no messages were displayed when the execution stopped. There were no error messages either. The dataframe rows simply stopped processing further and the process seemed to be frozen.
    • I have two CPU's. It seems that the progress bar only updates in 1% increments. One of the progress bars reaches 1% mark, but when the number of processed rows reaches the 2% mark (which I assume is associated with the second progress bar updating to 1% as well), that's when the process froze.
    • The process runs fine with progress_bar=False.
    opened by abhineetgupta 22
  • pandarallel_apply crashes with OverflowError: int too big to convert

    pandarallel_apply crashes with OverflowError: int too big to convert

    Hi everyone,

    I am getting this error here using parallel_apply in pandas:

      File "extract_specifications.py", line 156, in <module>
        extracted_data = df.parallel_apply(extract_raw_infos, axis=1)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 367, in closure
        kwargs,
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 239, in get_workers_args
        zip(input_files, output_files, chunk_lengths)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 238, in <listcomp>
        for index, (input_file, output_file, chunk_length) in enumerate(
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 169, in wrapper
        time=time,
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 464, in inline
        func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in shift_instructions
        for instruction in instructions
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in <genexpr>
        for instruction in instructions
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 293, in shift_instruction
        return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 71, in int2python_bytes
        return int.to_bytes(item, nb_bytes, "little")
    OverflowError: int too big to convert
    

    I am using

    pandarallel == 1.4.2
    pandas == 0.24.2
    python == 3.6.9
    

    Any idea how to proceed from here? I have basically no idea what could cause this bug. I suspect it might be related to the size of the data I have in one column (I save html from web pages in there). But otherwise no idea. I would help removing this bug(?) if I had some guidance here. Thx for helping.

    opened by yeus 21
  • Fails with

    Fails with "_wrap_applied_output() missing 1 required positional argument" where a simple pandas apply succeeds

    Hello,

    I'm using python 3.8.10 (anaconda distribution, GCC 7.5.10) in Ubuntu LTS 20 64bits x86

    From my pip freeze:

    pandarallel 1.5.2 pandas 1.3.0 numpy 1.20.3

    I'm working with a dataFrame that looks like this one:

    HoleID scaffold tpl strand base score tMean tErr modelPrediction ipdRatio coverage isboundary identificationQv context experiment isbegin_bondary isend_boundary isin_IES uniqueID No_known_IES_retention_this_CCS detailed_classif
    1025444 70189477 scaffold_024_with_IES 688203 0 T 2 0.517 0.190 0.555 0.931 11 True NaN TTAAATAGAAATTAAAATCAGCTGC NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025446 70189477 scaffold_024_with_IES 688204 0 A 4 1.347 0.367 1.251 1.077 13 True NaN TAAATAGAAATTAAAATCAGCTGCT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025448 70189477 scaffold_024_with_IES 688205 0 A 5 1.913 0.779 1.464 1.307 16 True NaN AAATAGAAATTAAAATCAGCTGCTT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025450 70189477 scaffold_024_with_IES 688206 0 A 4 1.535 0.712 1.328 1.156 18 True NaN AATAGAAATTAAAATCAGCTGCTTA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025452 70189477 scaffold_024_with_IES 688207 0 A 5 1.655 0.565 1.391 1.190 18 True NaN ATAGAAATTAAAATCAGCTGCTTAA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES

    I defined the following function

    def get_distance_from_nearest_criteria(df,criteria):
        begins = df[df[criteria]].copy()
        
        if len(begins) == 0:
            return pd.Series([np.nan for x in range(len(df))])
        else:
            list_return = []
    
            for idx, nt in df.iterrows():
                distances = [abs(nt["tpl"] - x) for x in begins["tpl"]]
                mindistance = min(distances,default=np.nan)
                list_return.append(mindistance)
    
            return pd.Series(list_return)
    

    Then using :

    from pandarallel import pandarallel
    pandarallel.initialize(progress_bar=False, nb_workers=12)
    out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))
    

    leads to :

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-49-02fc7c0589e3> in <module>
    ----> 1 out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))
    
    ~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
        463             )
        464 
    --> 465             return reduce(results, reduce_meta_args)
        466 
        467         finally:
    
    ~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/data_types/dataframe_groupby.py in reduce(results, df_grouped)
         14         keys, values, mutated = zip(*results)
         15         mutated = any(mutated)
    ---> 16         return df_grouped._wrap_applied_output(
         17             keys, values, not_indexed_same=df_grouped.mutated or mutated
         18         )
    
    TypeError: _wrap_applied_output() missing 1 required positional argument: 'values'
    
    

    For me, the error is not clear enough (I can't tell what's happening)

    However, when I run it with a simple pandas apply :

    uniqueID           
    HT2_10354935    0      297.0
                    1      297.0
                    2      296.0
                    3      296.0
                    4      295.0
                           ...  
    NM9_10_9568952  502      NaN
                    503      NaN
                    504      NaN
                    505      NaN
                    506      NaN
    Length: 1028437, dtype: float64
    

    I'm running all of this in a jupyter notebook

    ipykernel 5.3.4 ipython 7.22.0 ipython-genutils 0.2.0 notebook 6.4.0 jupyter 1.0.0 jupyter-client 6.1.12 jupyter-console 6.4.0 jupyter-core 4.7.1 jupyter-dash 0.4.0 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0

    I was wondering if someone could explain me what's hapenning, and how to fix it if the error is mine. Because it works out of the box with a simple pandas apply, I suppose that there is a small problem in pandarallel

    NB: Note also that this code leaves unkilled processes even after I interrupted or restarted the ipython kernel EDIT: Would it be linked to the fact that I'm using a lambda function ?

    opened by GDelevoye 18
  • Add `parallel_apply` for `Resampler` class

    Add `parallel_apply` for `Resampler` class

    I implemented parallel_apply for the Resampler class to have some important time series functionality. For now it is still using the default _chunk method, but it can lead to some processes terminating much quicker than others i.e. if the time series gets denser over time. A potential upgrade would be to random sample the contents of the chunks, so each chunk gets a similar distribution of workloads.

    P.S.: I noticed that 30/188 of the tests fail due to a ZeroDivisionError. This is unrelated to my pull request, but an important issue.

    opened by alvail 17
  • Connection to IPC socket failed for pathname

    Connection to IPC socket failed for pathname

    Hello, ask, pandarallel.initialize () appears warning how is it? Thank you: WARNING: Logging before InitGoogleLogging() is written to STDERR E0812 19:11:57.484051 2409853824 http://io.cc:168] Connection to IPC socket failed for pathname /var/folders/sp/vz74h1tx3jlb3jqrq__bjwh00000gp/T/pandarallel-32ts0h6r/plasma_sock, retrying 20 more times please help me,thank

    opened by lsircc 10
  • ZeroDivisionError: float division by zero

    ZeroDivisionError: float division by zero

    General

    • Operating System: windows 11
    • Python version: 3.8.3
    • Pandas version: 1.4.2
    • Pandarallel version: 1.6.1

    Acknowledgement

    • [ ] My issue is NOT present when using pandas without alone (without pandarallel)
    • [ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    WARNING: You are on Windows. If you detect any issue with pandarallel, be sure you checked out the Troubleshooting page: 0.99% | 46 / 4628 | 0.00% | 0 / 4627 | multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\core.py", line 158, in call results = self.work_function( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\data_types\series.py", line 26, in work return data.apply( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\series.py", line 4433, in apply return SeriesApply(self, func, convert_dtype, args, kwargs).apply() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1082, in apply return self.apply_standard() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1137, in apply_standard mapped = lib.map_infer( File "pandas_libs\lib.pyx", line 2870, in pandas._libs.lib.map_infer File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\progress_bars.py", line 206, in closure state.next_put_iteration += max(int((delta_i / delta_t) * 0.25), 1) ZeroDivisionError: float division by zero

    Observed behavior

    Write here the observed behavior

    Expected behavior

    Write here the expected behavior

    Minimal but working code sample to ease bug fix for pandarallel team

    Write here the minimal code sample to ease bug fix for pandarallel team

    bug 
    opened by heya5 8
  • OverflowError

    OverflowError

    Using python 3.8, I am getting an OverflowError running apply_parallel:

    OverflowError                             Traceback (most recent call last)
    <ipython-input-5-a78fd5119887> in <module>
         37     grouped_df = df.groupby("id")
         38 
    ---> 39     grouped_df.parallel_apply(lookahead) \
         40         .to_parquet(output_location_look_ahead, compression='snappy', engine='pyarrow')
         41 
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
        429         queue = manager.Queue()
        430 
    --> 431         workers_args, chunk_lengths, input_files, output_files = get_workers_args(
        432             use_memory_fs,
        433             nb_requested_workers,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in get_workers_args(use_memory_fs, nb_workers, progress_bar, chunks, worker_meta_args, queue, func, args, kwargs)
        284             raise OSError(msg)
        285 
    --> 286         workers_args = [
        287             (
        288                 input_file.name,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in <listcomp>(.0)
        293                 progress_bar == PROGRESS_IN_WORKER,
        294                 dill.dumps(
    --> 295                     progress_wrapper(
        296                         progress_bar >= PROGRESS_IN_FUNC, queue, index, chunk_length
        297                     )(func)
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in wrapper(func)
        203     def wrapper(func):
        204         if progress_bar:
    --> 205             wrapped_func = inline(
        206                 progress_pre_func,
        207                 func,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in inline(pre_func, func, pre_func_arguments)
        485 
        486     func_instructions = tuple(get_instructions(func))
    --> 487     shifted_func_instructions = shift_instructions(
        488         func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
        489     )
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instructions(instructions, qty)
        301     If Python version not in 3.{5, 6, 7}, a SystemError is raised.
        302     """
    --> 303     return tuple(
        304         shift_instruction(instruction, qty)
        305         if bytes((instruction[0],))
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in <genexpr>(.0)
        302     """
        303     return tuple(
    --> 304         shift_instruction(instruction, qty)
        305         if bytes((instruction[0],))
        306         in (
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instruction(instruction, qty)
        291     """
        292     operation, *values = instruction
    --> 293     return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
        294 
        295 
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in int2python_bytes(item)
         69 
         70     nb_bytes = 2 if python_version.minor == 5 else 1
    ---> 71     return int.to_bytes(item, nb_bytes, "little")
         72 
         73 
    
    OverflowError: int too big to convert
    
    opened by lfdversluis 7
  • TypeError: 'generator' object is not subscriptable error in colab - works in VScode

    TypeError: 'generator' object is not subscriptable error in colab - works in VScode

    General

    • Operating System: OSX
    • Python version: 3.7.13
    • Pandas version: 1.3.5
    • Pandarallel version: 1.6.2

    Acknowledgement

    • Issue happens on Colab only. When I use VScode, the problem does not happen

    Bug description

    I get the error:

    100.00%
    1 / 1
    100.00%
    1 / 1
    100.00%
    1 / 1
    100.00%
    1 / 1
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    [<ipython-input-16-b9ca1a6007f2>](https://localhost:8080/#) in <module>()
          1 data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
          2 df = pd.DataFrame(data)
    ----> 3 df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)
    
    2 frames
    [/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(data, user_defined_function, *user_defined_function_args, **user_defined_function_kwargs)
        324             return wrapped_reduce_function(
        325                 (Path(output_file.name) for output_file in output_files),
    --> 326                 reduce_extra,
        327             )
        328 
    
    [/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(output_file_paths, extra)
        197         )
        198 
    --> 199         return reduce_function(dfs, extra)
        200 
        201     return closure
    
    [/usr/local/lib/python3.7/dist-packages/pandarallel/data_types/dataframe.py](https://localhost:8080/#) in reduce(datas, extra)
         45             datas: Iterable[pd.DataFrame], extra: Dict[str, Any]
         46         ) -> pd.DataFrame:
    ---> 47             axis = 0 if isinstance(datas[0], pd.Series) else 1 - extra["axis"]
         48             return pd.concat(datas, copy=False, axis=axis)
         49 
    
    TypeError: 'generator' object is not subscriptable
    

    Minimal but working code sample to ease bug fix for pandarallel team

    !pip install pandarallel
    from pandarallel import pandarallel
    pandarallel.initialize(progress_bar=True)
    
    import pandas as pd
    
    data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
    df = pd.DataFrame(data)
    df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)
    
    opened by agiveon 6
  • Library not working....

    Library not working....

    Hello everyone!

    While trying to run the example cases and I'm receiving the following error:

    Code

    import pandas as pd
    import pandarallel
    pandarallel.initialize()
    
    def func(x):
        return math.sin(x.a**2) + math.sin(x.b**2)
    
    if __name__ == '__main__':
        df_size = int(5e6)
        df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
                               b=np.random.rand(df_size)))
        res_parallel = df.parallel_apply(func, axis=1, progress_bar=True)
    

    Error

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <timed exec> in <module>
    
    ~\.conda\envs\ingestion-env\lib\site-packages\pandarallel\pandarallel.py in closure(data, func, *args, **kwargs)
        434         try:
        435             pool = Pool(
    --> 436                 nb_workers, worker_init, (prepare_worker(use_memory_fs)(worker),)
        437             )
        438 
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in Pool(self, processes, initializer, initargs, maxtasksperchild)
        117         from .pool import Pool
        118         return Pool(processes, initializer, initargs, maxtasksperchild,
    --> 119                     context=self.get_context())
        120 
        121     def RawValue(self, typecode_or_type, *args):
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
        174         self._processes = processes
        175         self._pool = []
    --> 176         self._repopulate_pool()
        177 
        178         self._worker_handler = threading.Thread(
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in _repopulate_pool(self)
        239             w.name = w.name.replace('Process', 'PoolWorker')
        240             w.daemon = True
    --> 241             w.start()
        242             util.debug('added worker')
        243 
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\process.py in start(self)
        110                'daemonic processes are not allowed to have children'
        111         _cleanup()
    --> 112         self._popen = self._Popen(self)
        113         self._sentinel = self._popen.sentinel
        114         # Avoid a refcycle if the target function holds an indirect
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in _Popen(process_obj)
        320         def _Popen(process_obj):
        321             from .popen_spawn_win32 import Popen
    --> 322             return Popen(process_obj)
        323 
        324     class SpawnContext(BaseContext):
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
         87             try:
         88                 reduction.dump(prep_data, to_child)
    ---> 89                 reduction.dump(process_obj, to_child)
         90             finally:
         91                 set_spawning_popen(None)
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
         58 def dump(obj, file, protocol=None):
         59     '''Replacement for pickle.dump() using ForkingPickler.'''
    ---> 60     ForkingPickler(file, protocol).dump(obj)
         61 
         62 #
    
    AttributeError: Can't pickle local object 'prepare_worker.<locals>.closure.<locals>.wrapper'
    

    I'm working on a Jupyter Lab notebook with a conda environment: What can I do?

    opened by murilobellatini 6
  • IndexError when there are fewer DataFrame rows than workers

    IndexError when there are fewer DataFrame rows than workers

    When the number of rows is below the number of workers an IndexError is raised. Minimal example:

    Code

    import time
    import pandas as pd
    from pandarallel import pandarallel
    
    pandarallel.initialize(progress_bar=True)
    
    df = pd.DataFrame({'x':[1,2]})
    df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
    

    Output

    INFO: Pandarallel will run on 6 workers.
    INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
    B
       0.00%                                          |        0 /        1 |                                                                                                                    
       0.00%                                          |        0 /        1 |                                                                                                                    Traceback (most recent call last):
      File "foo.py", line 8, in <module>
        df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 446, in closure
        map_result,
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 382, in get_workers_result
        progress_bars.update(progresses)
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/utils/progress_bars.py", line 82, in update
        self.__bars[index][0] = value
    IndexError: list index out of range
    

    I'm using python version 3.7.4 with pandas 0.25.3 and pandarallel 1.4.4.

    bug 
    opened by elemakil 6
  • __main__ imposition in code

    __main__ imposition in code

    Make sure if a novice tries to access the module, he should be aware of encapsulating code within main execution context, pretty much similar to that of requirements of multiprocessing in general. This will reduce wasting of time and delving into multiporcessing more.

    opened by karunakar2 4
  • how to use it in fastapi

    how to use it in fastapi

    I have a server like:

    from fastapi import FastAPI
    from pandarallel import pandarallel
    pandarallel.initialize()
    app = FastAPI()
    
    @app.post("/area_quant")
    def create_item(event_data: Areadata):
        data = pd.DataFrame(event_data.data)
        data['type_score'] = data['EVENT_TYPE'].applymap(Config.type_map)
    
    if __name__ == '__main__':
        uvicorn.run(app)
    

    as a server,it can only run one time,then it will shutdown,how can i use it in a server?

    opened by lim-0 0
  • 'functools.partial' object has no attribute '__code__' in Jupyter Notebooks

    'functools.partial' object has no attribute '__code__' in Jupyter Notebooks

    General

    • Operating System: Ubuntu 20.04
    • Python version: 3.9.13
    • Pandas version: 1.4.4
    • Pandarallel version: 1.6.3

    Acknowledgement

    • [x] My issue is NOT present when using pandas without alone (without pandarallel)
    • [x] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    functool.partial cannot be used in junction with jupyter notebooks and pandarallel

    Observed behavior

    Write here the observed behavior

    Expected behavior

    Write here the expected behavior

    Minimal but working code sample to ease bug fix for pandarallel team

    import pandas as pd
    from functools import partial
    from typing import List
    import numpy as np
    from pandarallel import pandarallel
    
    pandarallel.initialize(progress_bar=True)
    
    def processing_fn(row: pd.Series, invalid_numbers: List[int], default: int) -> pd.Series:
        cond = row.isin(invalid_numbers)
        row[cond] = default
        return row
    
    data = pd.DataFrame(np.random.randint(low=-10, high=10, size=(10000, 5)))
    print("Before", (data.values == 100).sum())
    
    fn = partial(processing_fn, invalid_numbers=[-5, 2, 5], default=100)
    new_data = data.apply(fn, axis=1)
    
    print("After serial", (new_data.values == 100).sum())
    
    data = data.parallel_apply(fn, axis=1)
    print("After parallel", (data.values == 100).sum())
    
    

    Works fine in a standalone script, but fails if ran in Jupyter notebook

    opened by Meehai 0
  • Choose which type of progress bar you want in a notebook

    Choose which type of progress bar you want in a notebook

    Sometimes one doesn't always have control of the environment of the Jupyter instance where one's working (think Jupyterhub) and can't install the necessary extensions for progress bars. In this case it might be nice to have the option to manually request the simple progress bar so that it doesn't just display a widget error.

    Is there already a way to do so that I've missed? Otherwise, if it's something you'd consider adding, I'd also be happy to try and draft a PR.

    opened by astrojarred 3
  • TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

    TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

    General

    • Operating System: Ubuntu 22.04
    • Python version: 3.9.7
    • Pandas version: 1.4.2
    • Pandarallel version: 1.6.3

    Acknowledgement

    • [x] My issue is NOT present when using pandas without alone (without pandarallel)
    • [ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    I observe this when running parallel_apply in pyCharm. #76 sounds similar to me, but none of the tricks suggested there work for me. At this point I am also not sure if it's more of an issue with pyCharm.

    Observed behavior

    Traceback (most recent call last):
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-39-d230d86ff5ef>", line 1, in <module>
        iris.groupby("species").parallel_apply(lambda x: np.mean(x))
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/pandarallel/core.py", line 265, in closure
        dilled_user_defined_function = dill.dumps(user_defined_function)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 364, in dumps
        dump(obj, file, protocol, byref, fmode, recurse, **kwds)#, strictio)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 336, in dump
        Pickler(file, protocol, **_kwds).dump(obj)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 620, in dump
        StockPickler.dump(self, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 487, in dump
        self.save(obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1963, in save_function
        _save_with_postproc(pickler, (_create_function, (
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1154, in _save_with_postproc
        pickler._batch_setitems(iter(source.items()))
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 692, in save_reduce
        save(args)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 886, in save_tuple
        save(element)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
        save(state)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
        save(state)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 578, in save
        rv = reduce(self.proto)
    TypeError: cannot pickle 'sqlite3.Connection' object
    

    Minimal but working code sample to ease bug fix for pandarallel team

    import pandas as pd
    import numpy as np
    from pandarallel import pandarallel
    pandarallel.initialize(use_memory_fs=False)
    iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    iris.groupby("species").parallel_apply(lambda x: np.mean(x))
    
    opened by wiessall 1
Releases(v1.6.3)
Owner
Manu NALEPA
Data Scientist / Data Engineer @ Clustree — Sousaphone & saxophone player
Manu NALEPA
Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

Vishwaajeeth Kamalakkannan 0 Jan 20, 2022
Port of dplyr and other related R packages in python, using pipda.

Unlike other similar packages in python that just mimic the piping syntax, datar follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from t

179 Dec 21, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
Pipeline to convert a haploid assembly into diploid

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes

Mikhail Kolmogorov 50 Jan 05, 2023
Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Spectacular AI SDK examples Spectacular AI SDK fuses data from cameras and IMU sensors (accelerometer and gyroscope) and outputs an accurate 6-degree-

Spectacular AI 94 Jan 04, 2023
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8k Dec 29, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 03, 2023
Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

pgmpy 2.2k Dec 25, 2022
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 07, 2021
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

Donald F. Ferguson 4 Mar 06, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

Himanshu Kumar singh 2 Dec 04, 2021
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

Zekeriyya Demirci 5 Nov 13, 2022
MoRecon - A tool for reconstructing missing frames in motion capture data.

MoRecon - A tool for reconstructing missing frames in motion capture data.

Yuki Nishidate 38 Dec 03, 2022
Tokyo 2020 Paralympics, Analytics

Tokyo 2020 Paralympics, Analytics Thanks for checking out my app! It was built entirely using matplotlib and Tokyo 2020 Paralympics data. This applica

Petro Ivaniuk 1 Nov 18, 2021
Nobel Data Analysis

Nobel_Data_Analysis This project is for analyzing a set of data about people who have won the Nobel Prize in different fields and different countries

Mohammed Hassan El Sayed 1 Jan 24, 2022
Data Analytics on Genomes and Genetics

Data Analytics performed on On genomes and Genetics dataset to predict genetic disorder and disorder subclass. DONE by TEAM SIGMA!

1 Jan 12, 2022
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022