You like pytorch? You like micrograd? You love tinygrad! ❤️

Overview


Unit Tests

For something in between a pytorch and a karpathy/micrograd

This may not be the best deep learning framework, but it is a deep learning framework.

Due to its extreme simplicity, it aims to be the easiest framework to add new accelerators to, with support for both inference and training. Support the simple basic ops, and you get SOTA vision extra/efficientnet.py and language extra/transformer.py models. We are working on support for the Apple Neural Engine.

Eventually, we will build custom hardware for tinygrad, and it will be blindingly fast. Now, it is slow.

Installation

pip3 install git+https://github.com/geohot/tinygrad.git --upgrade

Example

from tinygrad.tensor import Tensor

x = Tensor.eye(3)
y = Tensor([[2.0,0,-2.0]])
z = y.matmul(x).sum()
z.backward()

print(x.grad)  # dz/dx
print(y.grad)  # dz/dy

Same example in torch

import torch

x = torch.eye(3, requires_grad=True)
y = torch.tensor([[2.0,0,-2.0]], requires_grad=True)
z = y.matmul(x).sum()
z.backward()

print(x.grad)  # dz/dx
print(y.grad)  # dz/dy

Neural networks?

It turns out, a decent autograd tensor library is 90% of what you need for neural networks. Add an optimizer (SGD, RMSprop, and Adam implemented) from tinygrad.optim, write some boilerplate minibatching code, and you have all you need.

Neural network example (from test/test_mnist.py)

from tinygrad.tensor import Tensor
import tinygrad.optim as optim

class TinyBobNet:
  def __init__(self):
    self.l1 = Tensor.uniform(784, 128)
    self.l2 = Tensor.uniform(128, 10)

  def forward(self, x):
    return x.dot(self.l1).relu().dot(self.l2).logsoftmax()

model = TinyBobNet()
optim = optim.SGD([model.l1, model.l2], lr=0.001)

# ... and complete like pytorch, with (x,y) data

out = model.forward(x)
loss = out.mul(y).mean()
optim.zero_grad()
loss.backward()
optim.step()

GPU and Accelerator Support

tinygrad supports GPUs through PyOpenCL.

from tinygrad.tensor import Tensor
(Tensor.ones(4,4).gpu() + Tensor.ones(4,4).gpu()).cpu()

ANE Support?!

If all you want to do is ReLU, you are in luck! You can do very fast ReLU (at least 30 MEGAReLUs/sec confirmed)

Requires your Python to be signed with ane/lib/sign_python.sh to add the com.apple.ane.iokit-user-access entitlement, which also requires amfi_get_out_of_my_way=0x1 in your boot-args. Build the library with ane/lib/build.sh

from tinygrad.tensor import Tensor

a = Tensor([-2,-1,0,1,2]).ane()
b = a.relu()
print(b.cpu())

Warning: do not rely on the ANE port. It segfaults sometimes. So if you were doing something important with tinygrad and wanted to use the ANE, you might have a bad time.

Adding an accelerator

You need to support 14 first class ops:

Relu, Log, Exp                  # unary ops
Sum, Max                        # reduce ops (with axis argument)
Add, Sub, Mul, Pow              # binary ops (with broadcasting)
Reshape, Transpose, Slice       # movement ops
Matmul, Conv2D                  # processing ops

While more ops may be added, I think this base is stable.

ImageNet inference

Despite being tiny, tinygrad supports the full EfficientNet. Pass in a picture to discover what it is.

ipython3 examples/efficientnet.py https://upload.wikimedia.org/wikipedia/commons/4/41/Chicken.jpg

Or, if you have a webcam and cv2 installed

ipython3 examples/efficientnet.py webcam

PROTIP: Set "GPU=1" environment variable if you want this to go faster.

PROPROTIP: Set "DEBUG=1" environment variable if you want to see why it's slow.

tinygrad also supports GANs

See examples/mnist_gan.py

The promise of small

tinygrad will always be below 1000 lines. If it isn't, we will revert commits until tinygrad becomes smaller.

Running tests

python3 -m pytest

TODO

  • Train an EfficientNet on ImageNet
  • Add a language model. BERT?
  • Add a detection model. EfficientDet?
  • Reduce code
  • Increase speed
  • Add features
Comments
  • GPU EfficientNet is weirdly slow

    GPU EfficientNet is weirdly slow

    did inference in 0.28 s
                     Mul : 163       29.18 ms
                     Add : 140       25.53 ms
                     Pow :  98       18.43 ms
                   Pad2D :  17       16.97 ms
                  Conv2D :  81       14.49 ms
                 Sigmoid :  65       10.23 ms
                 Reshape : 230        9.94 ms
                     Sub :  49        9.75 ms
               AvgPool2D :  17        5.93 ms
                     Dot :   1        1.06 ms
    

    Run with DEBUG=1 for profiling. Conv2D isn't even close to the top in time users.

    opened by geohot 18
  • made smoler -- and a couple of other things

    made smoler -- and a couple of other things

    I should've done this in multiple branches, but here goes:

    1. made the ctx.save_for_backward return it's params, that made it possible to cut down on lines by not having an extra line just to save the params
    2. converted a bunch of the backwards into lambdas, this cut down on the lines for un-marshaling from the save
    3. replaced instances of input with _input since it's a python reserved word
    opened by EvanSchalton 12
  • Backward Error Running on Windows Anaconda Enviroment

    Backward Error Running on Windows Anaconda Enviroment

    torch forward pass: 20.993 ms torch backward pass: 210.071 ms


                                      Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    

                              aten::addmm_        19.93%      51.973ms        19.93%      51.973ms      30.217us          1720
                  aten::threshold_backward        11.94%      31.134ms        12.04%      31.387ms       3.139ms            10
                           aten::threshold        11.57%      30.184ms        11.65%      30.384ms       3.038ms            10
                aten::thnn_conv2d_backward         9.75%      25.432ms        40.07%     104.499ms      10.450ms            10
                               aten::fill_         9.45%      24.652ms         9.45%      24.652ms      50.310us           490
             aten::max_pool2d_with_indices         8.99%      23.435ms         9.10%      23.720ms       2.372ms            10
                              aten::select         7.79%      20.309ms         9.17%      23.919ms       6.165us          3880
                 aten::thnn_conv2d_forward         5.39%      14.059ms        14.23%      37.100ms       3.710ms            10
    aten::max_pool2d_with_indices_backward         2.61%       6.814ms         6.66%      17.368ms       1.737ms            10
                                  aten::mm         2.47%       6.444ms         2.51%       6.540ms     435.980us            15
    

    Self CPU time total: 260.772ms

    E

    ERROR: test_mnist (main.TestConvSpeed)

    Traceback (most recent call last): File "c:\Users\Nehad Hirmiz\Documents\Programming\Python\Tutorials\tinygrad\test_speedynet.py", line 83, in test_mnist out.backward() File "c:\ProgramData\Anaconda3\envs\deeptorch\lib\site-packages\tinygrad\tensor.py", line 68, in backward t.backward(False) File "c:\ProgramData\Anaconda3\envs\deeptorch\lib\site-packages\tinygrad\tensor.py", line 68, in backward t.backward(False) File "c:\ProgramData\Anaconda3\envs\deeptorch\lib\site-packages\tinygrad\tensor.py", line 68, in backward t.backward(False) [Previous line repeated 1 more time] File "c:\ProgramData\Anaconda3\envs\deeptorch\lib\site-packages\tinygrad\tensor.py", line 63, in backward if g.shape != t.data.shape: AttributeError: 'tuple' object has no attribute 'shape'

    opened by BiophysNinja 12
  • Added support for strided max & average pooling

    Added support for strided max & average pooling

    This pull request adds support for strided max & average pooling, with a numpy solution. It's not optimal, however I don't know if its possible to do with just reshape ops, so had to resort to it. Only uses the numpy solution if stride is specifically specified.

    Alternatively max & avg pooling ops could be first class ops, with CPU & GPU implementations with stride support.

    opened by skoshx 11
  • Currently at 989 lines. Reduce lines while improving readability!

    Currently at 989 lines. Reduce lines while improving readability!

    Remember, this isn't code golf. The goal is to make the code easier to maintain. In general that's also smaller, but smaller that's harder to maintain can't be merged.

    opened by geohot 11
  • EfficientNet runs slower on GPU than CPU

    EfficientNet runs slower on GPU than CPU

    Running EfficientNet in examples/efficientnet.py runs slower on the GPU than CPU for some reason. Benchmarks:

    PYTHONPATH=. GPU=1 python3.8 examples/efficientnet.py https://image.shutterstock.com/image-illustration/compact-white-car-3d-render-260nw-405716083.jpg Output (GPU):

    656 7.561172 minivan
    did inference in 1.13 s
    

    PYTHONPATH=. python3.8 examples/efficientnet.py https://image.shutterstock.com/image-illustration/compact-white-car-3d-render-260nw-405716083.jpg Output (CPU):

    656 7.5611706 minivan
    did inference in 0.71 s
    

    What could be causing this? Im runninng this with Python 3.8 on a MacBook Pro 2018 with Intel Iris Plus Graphics 1536 MB running macOS Catalina.

    opened by skoshx 10
  • Towards a faster (sum) reduce

    Towards a faster (sum) reduce

    I looked into an optimized sum reduce following https://dournac.org/info/gpu_sum_reduction

    • This implementation uses local memory and does reduction by a size sz up to 256 using the above algorithm.
    • If the reduction size sz >256 it computes partial sums of groups ~ sz/256 working groups and creates an intermediate output of size (osize, groups) which is reduced recursively by summing over groups
    • It gets fast for long reductions for example: testing [100000000] torch/tinygrad fp: 40.46 / 0.94 ms that same test crashes after 5minutes on the current implementation
    • This looses most of the ability of modifying the extra code pieces. Some can likely be built in but it is only ever used in logsoftmax which I think should really be a second kind op.

    Just want to get an opinion

    ToDo's

    • needs to be cleaned up a lot but passes test as far as I can see.
    • breaks backward in serious_mnist and there is a bug for general sums if sz>256
    • needs better test, definitely does not work in general
    • loop over recursion?
    opened by marcelbischoff 9
  • refactor/softmax

    refactor/softmax

    Was just reading through the library and trying to understand it. Made some changes that might/might-not be worth it! Let me know what you guys think.

    Generalized logsoftmax and sigmoid with softmax.

    • 0.3% less lines 😊
    • fewer calculations
      • we need to calculate the gradient of logsoftmax in backward pass
      • previous impl does this by exponentiating the forward pass output -> this means the exp of a log of a exp
      • instead we can remove one of the exponentiation ops by going straight out of logspace
        • by calculating the softmax
          • taking the log for the forward pass
          • reusing the softmax in the backward pass -> since softmax is the derivative of logsumexp
    • more numerically stable (log of softmax vs exp of logsumexp)
    • maybe bad because more memory for sigmoid padding

    References:

    • https://math.stackexchange.com/a/2340848
    • https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/
    • https://blog.feedly.com/tricks-of-the-trade-logsumexp/
    opened by iainwo 9
  • Print out the tensor using numpy().

    Print out the tensor using numpy().

    This commit resolves issue https://github.com/geohot/tinygrad/issues/453

    In the example code in the README.md, when it is run, it prints for Tiny Grad the tensors as: <Tensor <LB (3, 3) op:MovementOps.RESHAPE> with grad None> <Tensor <LB (1, 3) op:MovementOps.RESHAPE> with grad None>

    But to be equivalent to the output of the Torch example, we need to use numpy() to get it to show: [[ 2. 2. 2.] [ 0. 0. 0.] [-2. -2. -2.]] [[1. 1. 1.]]

    opened by faisalmemon 0
  • Example code doesn't show actual tensor

    Example code doesn't show actual tensor

    When you run the README.md example for tiny grad:

    from tinygrad.tensor import Tensor
    
    x = Tensor.eye(3, requires_grad=True)
    y = Tensor([[2.0,0,-2.0]], requires_grad=True)
    z = y.matmul(x).sum()
    z.backward()
    
    print(x.grad)  # dz/dx
    print(y.grad)  # dz/dy
    

    It prints out

    <Tensor <LB (3, 3) op:MovementOps.RESHAPE> with grad None>
    <Tensor <LB (1, 3) op:MovementOps.RESHAPE> with grad None>
    

    but it should be printed out via numpy. e.g. use x.grad.numpy() and y.grad.numpy() to get output

    [[ 2.  2.  2.]
     [ 0.  0.  0.]
     [-2. -2. -2.]]
    [[1. 1. 1.]]
    
    opened by faisalmemon 0
  • Suggestion to add type hints to TinyGrad

    Suggestion to add type hints to TinyGrad

    I want to suggest adding type hints to the TinyGrad library to improve code readability and maintainability.

    For example, consider the following without type hints:

    class Tensor:
      training, no_grad = False, False
    
      def __init__(self, data, device=Device.DEFAULT, requires_grad=None):
    

    Without type hints, it is not clear what types of values data should be. This can lead to confusion and potential bugs if the class is created with values of the wrong type.

    On the other hand, if we add type hints to the function, it becomes much clearer when type hints are added:

    class Tensor:
      training: bool = False
      no_grad: bool = False
    
      def __init__(self, data: Union[list, np.ndarray, LazyBuffer], device: Device=Device.DEFAULT, requires_grad: Optional[bool]=None):
    

    Overall, I believe that adding type hints to TinyGrad would significantly improve the readability and maintainability of the library, and would greatly benefit anyone working with it.

    If you think it would be useful, I'm happy to start adding them to the library.

    Love what you're doing here. Thanks for reading!

    opened by 0xArty 0
  • python3 examples/yolov3.py: assert self.lazydata.realized is None AssertionError

    python3 examples/yolov3.py: assert self.lazydata.realized is None AssertionError

    python3 examples/yolov3.py:

    ... Modules length: 24 Loading weights file (237MB). This might take a while… running inference… Traceback (most recent call last): File "/home/dh/tinygrad/examples/yolov3.py", line 628, in prediction = infer(model, img) File "/home/dh/tinygrad/examples/yolov3.py", line 244, in infer prediction = model.forward(Tensor(img.astype(np.float32))) File "/home/dh/tinygrad/examples/yolov3.py", line 561, in forward x = predict_transform(x, inp_dim, anchors, num_classes) File "/home/dh/tinygrad/examples/yolov3.py", line 321, in predict_transform prediction.gpu_() File "/home/dh/tinygrad/tinygrad/tensor.py", line 74, in to_ assert self.lazydata.realized is None AssertionError

    What is the problem? thank you!

    opened by dh-bccw 2
Releases(v0.4.0)
  • v0.4.0(Nov 8, 2022)

    So many changes since 0.3.0

    Fairly stable and correct, though still not fast. The hlops/mlops are solid, just needs work on the llops.

    The first automated release, so hopefully it works?

    Source code(tar.gz)
    Source code(zip)
Owner
George Hotz
We will win self driving cars.
George Hotz
On the Variance of the Adaptive Learning Rate and Beyond

RAdam On the Variance of the Adaptive Learning Rate and Beyond We are in an early-release beta. Expect some adventures and rough edges. Table of Conte

Liyuan Liu 2.5k Dec 27, 2022
PyTorch toolkit for biomedical imaging

farabio is a minimal PyTorch toolkit for out-of-the-box deep learning support in biomedical imaging. For further information, see Wikis and Docs.

San Askaruly 47 Dec 28, 2022
Learning Sparse Neural Networks through L0 regularization

Example implementation of the L0 regularization method described at Learning Sparse Neural Networks through L0 regularization, Christos Louizos, Max W

AMLAB 202 Nov 10, 2022
A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

micrograd A tiny Autograd engine (with a bite! :)). Implements backpropagation (reverse-mode autodiff) over a dynamically built DAG and a small neural

Andrej 3.5k Jan 08, 2023
Code snippets created for the PyTorch discussion board

PyTorch misc Collection of code snippets I've written for the PyTorch discussion board. All scripts were testes using the PyTorch 1.0 preview and torc

461 Dec 26, 2022
Training RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755)

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

ASAPP Research 2.1k Jan 01, 2023
PyTorch Lightning Optical Flow models, scripts, and pretrained weights.

PyTorch Lightning Optical Flow models, scripts, and pretrained weights.

Henrique Morimitsu 105 Dec 16, 2022
A Closer Look at Structured Pruning for Neural Network Compression

A Closer Look at Structured Pruning for Neural Network Compression Code used to reproduce experiments in https://arxiv.org/abs/1810.04622. To prune, w

Bayesian and Neural Systems Group 140 Dec 05, 2022
A PyTorch implementation of L-BFGS.

PyTorch-LBFGS: A PyTorch Implementation of L-BFGS Authors: Hao-Jun Michael Shi (Northwestern University) and Dheevatsa Mudigere (Facebook) What is it?

Hao-Jun Michael Shi 478 Dec 27, 2022
PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

Jacob Gildenblat 836 Dec 26, 2022
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Jan 06, 2023
A simplified framework and utilities for PyTorch

Here is Poutyne. Poutyne is a simplified framework for PyTorch and handles much of the boilerplating code needed to train neural networks. Use Poutyne

GRAAL/GRAIL 534 Dec 17, 2022
Pytorch implementation of Distributed Proximal Policy Optimization

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

Alexis David Jacq 164 Jan 05, 2023
Pretrained EfficientNet, EfficientNet-Lite, MixNet, MobileNetV3 / V2, MNASNet A1 and B1, FBNet, Single-Path NAS

(Generic) EfficientNets for PyTorch A 'generic' implementation of EfficientNet, MixNet, MobileNetV3, etc. that covers most of the compute/parameter ef

Ross Wightman 1.5k Jan 01, 2023
Bunch of optimizer implementations in PyTorch

Bunch of optimizer implementations in PyTorch

Hyeongchan Kim 76 Jan 03, 2023
PyGCL: Graph Contrastive Learning Library for PyTorch

PyGCL is an open-source library for graph contrastive learning (GCL), which features modularized GCL components from published papers, standardized evaluation, and experiment management.

GCL: Graph Contrastive Learning Library for PyTorch 592 Jan 07, 2023
A pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch.

Compact Bilinear Pooling for PyTorch. This repository has a pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch. This

Grégoire Payen de La Garanderie 234 Dec 07, 2022
3D-RETR: End-to-End Single and Multi-View3D Reconstruction with Transformers

3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers (BMVC 2021) Zai Shi*, Zhao Meng*, Yiran Xing, Yunpu Ma, Roger Wattenhofe

Zai Shi 36 Dec 21, 2022
A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision

🤗 Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.

Hugging Face 3.5k Jan 08, 2023
A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

Fidelity Investments 56 Sep 13, 2022