CUDA integration for Python, plus shiny features

Overview

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about PyCUDA?

https://badge.fury.io/py/pycuda.png
  • Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code. PyCUDA knows about dependencies, too, so (for example) it won't detach from a context before all memory allocated in it is also freed.
  • Convenience. Abstractions like pycuda.driver.SourceModule and pycuda.gpuarray.GPUArray make CUDA programming even more convenient than with Nvidia's C-based runtime.
  • Completeness. PyCUDA puts the full power of CUDA's driver API at your disposal, if you wish. It also includes code for interoperability with OpenGL.
  • Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions.
  • Speed. PyCUDA's base layer is written in C++, so all the niceties above are virtually free.
  • Helpful Documentation and a Wiki.

Relatedly, like-minded computing goodness for OpenCL is provided by PyCUDA's sister project PyOpenCL.

Comments
  • Shipped Boost.Python is incompatible with Python 3.11

    Shipped Boost.Python is incompatible with Python 3.11

    Hello,

    in order to test Python 3.11 with pycuda, i have just installed latest pycuda version : 2022.1

    I use CUDA 11.6.2 with a windows 11 laptop.

    When i try my program (using pycuda), i get this error message :

    Traceback (most recent call last):
      File "D:\Alain\Astro\Soft\PC\PC_Traitement_Video_V7_51g.py", line 19, in <module>
        import pycuda.driver as drv
      File "C:\Users\apail\AppData\Local\Programs\Python\Python311\Lib\site-packages\pycuda\driver.py", line 65, in <module>
        from pycuda._driver import *  # noqa
    SystemError: type Boost.Python.enum has the Py_TPFLAGS_HAVE_GC flag but has no traverse function
    

    Any help will be appreciate.

    Alain

    bug 
    opened by easybob95 22
  • DeviceMemoryPool limited to 2**32 bytes (4GB)

    DeviceMemoryPool limited to 2**32 bytes (4GB)

    Working with arrays larger than 4 GB leads to a memory error when using a DeviceMemoryPool (cuMemcpyHtoD failed: invalid argument). The allocation itself (without a pool) is not a problem above 4GB. Tested under debian9/x86_64 (P6000) and ubuntu20.04/ppc64le with a V100.

    A simple test:

    import pycuda.autoinit
    import pycuda.gpuarray as cua
    import pycuda.tools as cut
    import numpy as np
    
    m = cut.DeviceMemoryPool()
    
    a= np.ones(2**30-1,dtype=np.float32)
    b= cua.to_gpu(a, allocator=m.allocate)  # Passes
    
    a= np.ones(2**30,dtype=np.float32)
    b= cua.to_gpu(a, allocator=m.allocate)  # Error: cuMemcpyHtoD failed: invalid argument
    

    This is going to to become a more frequent issue as data sizes grow along with the cards memory.

    The reason seems to be the use of boost' Allocator::size_type which apparently maps to uint32. However given how intricate boost headers are, I failed to find exactly where that is defined, so I am not sure how to tackle this issue.

    @inducer I'd be happy to work on a PR but I'd really need some pointers where to search for the origin of size_type. Also, is the boost subset up-to-date ?

    opened by vincefn 14
  • Allocating arrays greater than 4GB

    Allocating arrays greater than 4GB

    Setup:

    • pycuda: 2018.1.1+cuda100,
    • Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
    • Windows Server 2012 R2 Standard

    When I try move an np.ndarray to the gpu as follows:

    x = np.random.rand(900, 700, 2000)
    gpuarray.to_gpu(x)
    

    I get:

    C:\ProgramData\Anaconda3\lib\site-packages\pycuda\gpuarray.py in __init__(self,
    shape, dtype, allocator, base, gpudata, strides, order)
        208         if gpudata is None:
        209             if self.size:
    --> 210                 self.gpudata = self.allocator(self.size * self.dtype.itemsize)
        211             else:
        212                 self.gpudata = None
    
    OverflowError: Python int too large to convert to C unsigned long
    

    It appears I can't allocate any arrays larger 4GB. Is there any way to fix this issue?

    opened by rhacking 14
  • Fix operations issues on large arrays

    Fix operations issues on large arrays

    About

    This MR fixes a "pycuda hanging forever" issue when array sizes exceed 2**34 bytes. It's done by replacing some occurrences of unsigned (int) with size_t in template kernels (element-wise, reduction, scan).

    Close #375

    The tests had to be done on arrays of double to avoid numerical issues.

    ElementWise

    import numpy as np
    import pycuda.autoinit
    import pycuda.gpuarray as garray
    from pycuda.elementwise import ElementwiseKernel
    
    eltwise = ElementwiseKernel("double* d_arr", "d_arr[i] = i", "linspace")
    d_arr = garray.empty((512, 2048, 2048), np.float64)
    eltwise(d_arr)
    result = d_arr.get()[()]
    reference = np.arange(d_arr.size, dtype=np.float64).reshape(d_arr.shape)
    assert np.allclose(result, reference)
    

    Reduction

    import numpy as np
    import pycuda.autoinit
    import pycuda.gpuarray as garray
    from pycuda.reduction import ReductionKernel
    
    reduction = ReductionKernel(np.float64, neutral="0", reduce_expr="a+b", map_expr="x[i]", arguments="double* x")
    d_arr = garray.zeros((512, 2048, 2048), np.float64)
    d_arr.fill(1) # elementwise
    result = reduction(d_arr.ravel()).get()[()]
    assert result == d_arr.size
    

    Scan

    import numpy as np
    import pycuda.autoinit
    import pycuda.gpuarray as garray
    from pycuda.scan import InclusiveScanKernel
    
    cumsum = InclusiveScanKernel(np.float64, "a+b")
    d_arr = garray.zeros((512, 2048, 2048), np.float64)
    d_arr.fill(1)
    result = cumsum(d_arr.ravel()).get()[()]
    assert result[-1] == d_arr.size
    
    opened by pierrepaleo 13
  • pycuda._driver.Error: cuInit failed: unknown error

    pycuda._driver.Error: cuInit failed: unknown error

    Enviament: WSL, Ubuntu 18, And I make sure my cuda is connected. image nvidia-smi: image

    I got an error when I use the pycuda. There is no error when I input such code image And then I use the command sudo nvidia-modprobe -u and reboot my device, It doesn't work. What should I do?

    opened by Mulbetty 13
  • copy() for some discontiguous arrays; __setitem__; get2() provisional…

    copy() for some discontiguous arrays; __setitem__; get2() provisional…

    Adds a private function _copy() that copies either a GPUArray/ndarray to another GPUArray/ndarray. The two arrays must have the same shape and dtype. They must be <= 3d. They must have the same order and must be contiguous along the minor axis, but otherwise don't have to have the same strides. Sorry that it's verbose; I can compact it later if it's decided to keep it.

    This function is used in copy() and setitem(), and a dumbly-named get2() method which doesn't automatically reshape arrays with the same size but different shape. I wasn't sure what the right thing to do here was.

    There isn't an asynchronous version because I'm not familiar yet with how that works.

    opened by davidweichiang 13
  • pycuda with python mutliple host thread

    pycuda with python mutliple host thread

    I want use mutliple thread with python on host. I tried following methods:

    1. use autoinit in main thread, and try trt inference in python ThreadPoolExecutor, but get "no activity context" error when use cuda API:cuda.mem_alloc
    2. then I try create ThreadPoolExecutor with initializer like this:
    def cuda_ctx_init():
        _device = cuda.Device(0)
        ctx = _device.make_context()
    

    after done this, I can successfully allocate gpu memory with cuda.mem_alloc in the thread,but get another error when do trt inference:

    Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::165, condition: cudaEventDestroy(context.start) failure.
    Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::170, condition: cudaEventDestroy(context.stop) failure.
     ../rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 77 (an illegal memory access was encountered)
    terminate called after throwing an instance of 'nvinfer1::CudaError'
    

    I found a solution with cpp:DLA and GPU cores at the same time and a python solution : How to use TensorRT by the multi-threading package of python

    I noticed that unlike cpp version, the Python version requires additional operations :

    
    ctx.push()
    
    ctx.pop()
    

    In my test,this operations spend too much time each interface.And I want ask is there any other way to use pycuda with python mutlipe thread?

    Thanks!

    opened by trobr 10
  • Add test for reverse slicing

    Add test for reverse slicing

    Hi,

    I am currently trying to use reverse slicing with a gpuarray (e.g. [end:start:-step]), but I am getting an error. I have written quickly a new test in order to reproduce the bug (feel free to discard my merge request if you are able to work on it).

    >       copy.src_pitch = src_strides[1]
    E       OverflowError: can't convert negative value to unsigned int
    
    /usr/local/lib/python3.5/dist-packages/pycuda-2017.1.1-py3.5-linux-x86_64.egg/pycuda/gpuarray.py:1300: OverflowError
    

    I have been through the code looking for the definition of src_pitch and I suppose that it comes from cudaMemcpy defined by Nvidia, right? Therefore, it would not be possible to change the unsigned int to a signed one.

    opened by loikki 10
  • Windows+Py3.8: A dynamic link library (DLL) initialization routine failed.

    Windows+Py3.8: A dynamic link library (DLL) initialization routine failed.

    @cgohlke reported in #213, moved here:

    ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed

    > py -3.8 -c"import pycuda._driver"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.
    

    It seems that this error is not due to CUDA, boost, or pybind11 since the simple _pvt_struct extension and pyopencl-2019.1 also fail:

    > py -3.8 -c"import pycuda._pvt_struct"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.
    
    > py -3.8 -c"import pyopencl"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "X:\Python38\lib\site-packages\pyopencl\__init__.py", line 39, in <module>
        import pyopencl._cl as _cl
    ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.
    
    opened by inducer 9
  • Kernel cache not working  (hexdigest broken?)

    Kernel cache not working (hexdigest broken?)

    On my systems the kernel caching mechanic is not working. Have tested this on both linux and windows.

    The checksum.hexdigest() seems to return a different value when passed the same kernel. I have no idea why... does anyone else see this behaviour?

    opened by pwalsh0 9
  • Add in-place poisson random-number generation

    Add in-place poisson random-number generation

    With this version, the input array can be used to supply the per-element lamba value. This is similar to what numpy.random.poisson() allows, either supplying a shape and one lambda value, or an array of lambda values.

    This is very useful when simulating detector data for imaging, where each point has a different expected value.

    Let me know if you want examples, there are currently none for the random-number generators.

    opened by vincefn 8
  • Update `_pvt_struct` based on Python 3.8

    Update `_pvt_struct` based on Python 3.8

    This was a heavy-handed attempt at #395 until the simpler solution there emerged. This PR serves to preserve that effort, although there is not currently a need for this. Worse, the new struct module causes test failures.

    opened by inducer 0
  • rebase demo_cdpSimplePrint.py on the BSD licensed cdpSimplePrint.cu

    rebase demo_cdpSimplePrint.py on the BSD licensed cdpSimplePrint.cu

    examples/demo_cdpSimplePrint.py comes with a problematic license:

     * Adapted from NVIDIA's "cdpSimplePrint - Simple Print (CUDA Dynamic Parallelism)" sample
     * http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-print--cuda-dynamic-parallelism-
     * http://ecee.colorado.edu/~siewerts/extra/code/example_code_archive/a490dmis_code/CUDA/cuda_work/samples/0_Simple/cdpSimplePrint/cdpSimplePrint.cu
     *
     * From cdpSimplePrint.cu (not sure if this is Ok with NVIDIA's 38-page EULA though...):
     * ---------------------------------------------------------------------------
     * Copyright 1993-2012 NVIDIA Corporation.  All rights reserved.
     *
     * Please refer to the NVIDIA end user license agreement (EULA) associated
     * with this source code for terms and conditions that govern your use of
     * this software. Any use, reproduction, disclosure, or distribution of
     * this software and related documentation outside the terms of the EULA
     * is strictly prohibited.
     * ---------------------------------------------------------------------------
    

    Fortunately NVIDIA recently released their code samples under the 3-clause BSD license: https://github.com/NVIDIA/cuda-samples

    Please rebase that example on the updated free source code https://github.com/NVIDIA/cuda-samples/blob/master/Samples/3_CUDA_Features/cdpSimplePrint/cdpSimplePrint.cu

    bug 
    opened by anbe42 0
  • ```__add__``` fails for empty array inputs

    ```__add__``` fails for empty array inputs

    Here's the MWE

    >>> import pycuda.autoinit
    >>> import pycuda.gpuarray as gpuarray
    >>> import numpy as np
    >>> empty_array = np.array([])
    >>> empty_array_gpu = gpuarray.to_gpu(empty_array)
    >>> result_array = empty_array + 0 # array([], dtype=float64)
    >>> result_array_gpu = -empty_array_gpu + 0 # Fails
    

    Here's the error trace

    ---------------------------------------------------------------------------
    ArgumentError                             Traceback (most recent call last)
    Input In [29], in <cell line: 1>()
    ----> 1 empty_array_gpu + 0
    
    File ~/pycuda/pycuda/gpuarray.py:593, in GPUArray.__add__(self, other)
        590 elif np.isscalar(other):
        591     # add a scalar
        592     if other == 0:
    --> 593         return self.copy()
        594     else:
        595         result = self._new_like_me(_get_common_dtype(self, other))
    
    File ~/pycuda/pycuda/gpuarray.py:393, in GPUArray.copy(self)
        391 def copy(self):
        392     new = GPUArray(self.shape, self.dtype, self.allocator)
    --> 393     _memcpy_discontig(new, self)
        394     return new
    
    File ~/pycuda/pycuda/gpuarray.py:1570, in _memcpy_discontig(dst, src, async_, stream)
       1566         drv.memcpy_dtod_async(
       1567             dst.gpudata, src.gpudata, src.nbytes, stream=stream
       1568         )
       1569     else:
    -> 1570         drv.memcpy_dtod(dst.gpudata, src.gpudata, src.nbytes)
       1571 else:
       1572     # The arrays might be contiguous in the sense of
       1573     # having no gaps, but the axes could be transposed
       1574     # so that the order is neither Fortran or C.
       1575     # So, we attempt to get a contiguous view of dst.
       1576     dst = _as_strided(dst, shape=(dst.size,), strides=(dst.dtype.itemsize,))
    
    ArgumentError: Python argument types in
        pycuda._driver.memcpy_dtod(NoneType, NoneType, int)
    did not match C++ signature:
        memcpy_dtod(unsigned long long dest, unsigned long long src, unsigned long size)
    
    bug 
    opened by mitkotak 0
  • ```__neg__``` failing for empty arrays

    ```__neg__``` failing for empty arrays

    Here's the MWE

    >>> import pycuda.autoinit
    >>> import pycuda.gpuarray as gpuarray
    >>> import numpy as np
    >>> empty_array = np.array([])
    >>> empty_array_gpu = gpuarray.to_gpu(empty_array)
    >>> neg_empty_array = -empty_array # array([], dtype=float64)
    >>> neg_empty_array_gpu = -empty_array_gpu # Fails
    

    Here's the error trace

    ---------------------------------------------------------------------------
    error                                     Traceback (most recent call last)
    Input In [17], in <cell line: 1>()
    ----> 1 -gpuarray.to_gpu(a)
    
    File ~/pycuda/pycuda/gpuarray.py:643, in GPUArray.__neg__(self)
        641 def __neg__(self):
        642     result = self._new_like_me()
    --> 643     return self._axpbz(-1, 0, result)
    
    File ~/pycuda/pycuda/gpuarray.py:468, in GPUArray._axpbz(self, selffac, other, out, stream)
        463     raise RuntimeError(
        464         "only contiguous arrays may " "be used as arguments to this operation"
        465     )
        467 func = elementwise.get_axpbz_kernel(self.dtype, out.dtype)
    --> 468 func.prepared_async_call(
        469     self._grid,
        470     self._block,
        471     stream,
        472     selffac,
        473     self.gpudata,
        474     other,
        475     out.gpudata,
        476     self.mem_size,
        477 )
        479 return out
    
    File ~/pycuda/pycuda/driver.py:626, in _add_functionality.<locals>.function_prepared_async_call(func, grid, block, stream, *args, **kwargs)
        620     raise TypeError(
        621         "unknown keyword arguments: " + ", ".join(kwargs.keys())
        622     )
        624 from pycuda._pvt_struct import pack
    --> 626 arg_buf = pack(func.arg_format, *args)
        628 for texref in func.texrefs:
        629     func.param_set_texref(texref)
    
    error: required argument is not an integer
    
    bug 
    opened by mitkotak 2
  • ``GPUArray.zeros_like | ones_like`` failing for scalar inputs

    ``GPUArray.zeros_like | ones_like`` failing for scalar inputs

    Here's the MWE

    >>> import pycuda.autoinit
    >>> import pycuda.gpuarray as gpuarray
    >>> import numpy as np
    >>> c = 42.0
    >>> zero_array = np.zeros_like(c) # array(0.)
    >>> zero_array_gpu = gpuarray.zeros_like(c) # Fails
    >>> one_array = np.ones_like(c) # array(1.)
    >>> one_array_gpu = gpuarray.ones_like(c) # Fails
    

    Here's the error trace

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Input In [15], in <cell line: 1>()
    ----> 1 gpuarray.zeros_like(c)
    
    File ~/pycuda/pycuda/gpuarray.py:1422, in zeros_like(other_ary, dtype, order)
       1421 def zeros_like(other_ary, dtype=None, order="K"):
    -> 1422     dtype, order, strides = _array_like_helper(other_ary, dtype, order)
       1423     result = GPUArray(
       1424         other_ary.shape, dtype, other_ary.allocator, order=order, strides=strides
       1425     )
       1426     zero = np.zeros((), result.dtype)
    
    File ~/pycuda/pycuda/gpuarray.py:1394, in _array_like_helper(other_ary, dtype, order)
       1392         order = "C"
       1393 elif order == "K":
    -> 1394     if other_ary.flags.c_contiguous or (other_ary.ndim <= 1):
       1395         order = "C"
       1396     elif other_ary.flags.f_contiguous:
    
    AttributeError: 'float' object has no attribute 'flags'
    
    bug 
    opened by mitkotak 0
Releases(v2022.2)
Python interface to GPU-powered libraries

Package Description scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries

Lev E. Givon 924 Dec 26, 2022
📊 A simple command-line utility for querying and monitoring GPU status

gpustat Just less than nvidia-smi? NOTE: This works with NVIDIA Graphics Devices only, no AMD support as of now. Contributions are welcome! Self-Promo

Jongwook Choi 3.2k Jan 04, 2023
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Jan 04, 2023
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

NVIDIA DALI The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provi

NVIDIA Corporation 4.2k Jan 08, 2023
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 4k Dec 29, 2022
Conda package for artifact creation that enables offline environments. Ideal for air-gapped deployments.

Conda-Vendor Conda Vendor is a tool to create local conda channels and manifests for vendored deployments Installation To install with pip, run: pip i

MetroStar - Tech 13 Nov 17, 2022
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Resources cuDF Reference Documentation: Python API refe

RAPIDS 5.2k Jan 08, 2023
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases.

Vulkan Kompute The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabl

The Institute for Ethical Machine Learning 1k Dec 26, 2022
Library for faster pinned CPU <-> GPU transfer in Pytorch

SpeedTorch Faster pinned CPU tensor - GPU Pytorch variabe transfer and GPU tensor - GPU Pytorch variable transfer, in certain cases. Update 9-29-1

Santosh Gupta 657 Dec 19, 2022
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code her

NVIDIA Corporation 6.9k Dec 28, 2022
cuGraph - RAPIDS Graph Analytics Library

cuGraph - GPU Graph Analytics The RAPIDS cuGraph library is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames

RAPIDS 1.2k Jan 01, 2023
Python 3 Bindings for the NVIDIA Management Library

====== pyNVML ====== *** Patched to support Python 3 (and Python 2) *** ------------------------------------------------ Python bindings to the NVID

Nicolas Hennion 95 Jan 01, 2023
jupyter/ipython experiment containers for GPU and general RAM re-use

ipyexperiments jupyter/ipython experiment containers and utils for profiling and reclaiming GPU and general RAM, and detecting memory leaks. About Thi

Stas Bekman 153 Dec 07, 2022
A Python function for Slurm, to monitor the GPU information

Gpu-Monitor A Python function for Slurm, where I couldn't use nvidia-smi to monitor the GPU information. whole repo is not finish Installation TODO Mo

Squidward Tentacles 2 Feb 11, 2022
QPT-Quick packaging tool 前项式Python环境快捷封装工具

QPT - Quick packaging tool 快捷封装工具 GitHub主页 | Gitee主页 QPT是一款可以“模拟”开发环境的多功能封装工具,一行命令即可将普通的Python脚本打包成EXE可执行程序,与此同时还可轻松引入CUDA等深度学习加速库, 尽可能在用户使用时复现您的开发环境。

GT-Zhang 545 Dec 28, 2022
A NumPy-compatible array library accelerated by CUDA

CuPy : A NumPy-compatible array library accelerated by CUDA Website | Docs | Install Guide | Tutorial | Examples | API Reference | Forum CuPy is an im

CuPy 6.6k Jan 05, 2023
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem. Get Started on app.blazingsql.com Getting Started | Documentation | Examp

BlazingSQL 1.8k Jan 02, 2023
Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

py3nvml Documentation also available at readthedocs. Python 3 compatible bindings to the NVIDIA Management Library. Can be used to query the state of

Fergal Cotter 212 Jan 04, 2023
cuSignal - RAPIDS Signal Processing Library

cuSignal The RAPIDS cuSignal project leverages CuPy, Numba, and the RAPIDS ecosystem for GPU accelerated signal processing. In some cases, cuSignal is

RAPIDS 646 Dec 30, 2022
CUDA integration for Python, plus shiny features

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about P

Andreas Klöckner 1.4k Jan 02, 2023