TensorFlow ROCm port

Overview

Python PyPI

Documentation
Documentation

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

TensorFlow was originally developed by researchers and engineers working on the Google Brain team within Google's Machine Intelligence Research organization to conduct machine learning and deep neural networks research. The system is general enough to be applicable in a wide variety of other domains, as well.

TensorFlow provides stable Python and C++ APIs, as well as non-guaranteed backward compatible API for other languages.

Keep up-to-date with release announcements and security updates by subscribing to [email protected]. See all the mailing lists.

Tensorflow ROCm port

Please follow the instructions here to set up your ROCm stack. A docker container: rocm/tensorflow:latest(https://hub.docker.com/r/rocm/tensorflow/) is readily available to be used:

alias drun='sudo docker run \
      -it \
      --network=host \
      --device=/dev/kfd \
      --device=/dev/dri \
      --ipc=host \
      --shm-size 16G \
      --group-add video \
      --cap-add=SYS_PTRACE \
      --security-opt seccomp=unconfined \
      -v $HOME/dockerx:/dockerx'

drun rocm/tensorflow:latest

We maintain tensorflow-rocm whl packages on PyPI here, to install tensorflow-rocm package using pip:

# Install some ROCm dependencies
sudo apt install rocm-libs rccl

# Pip3 install the whl package from PyPI
pip3 install --user tensorflow-rocm --upgrade

For details on Tensorflow ROCm port, please take a look at the ROCm-specific README file.

Install

See the TensorFlow install guide for the pip package, to enable GPU support, use a Docker container, and build from source.

To install the current release, which includes support for CUDA-enabled GPU cards (Ubuntu and Windows):

$ pip install tensorflow

A smaller CPU-only package is also available:

$ pip install tensorflow-cpu

To update TensorFlow to the latest version, add --upgrade flag to the above commands.

Nightly binaries are available for testing using the tf-nightly and tf-nightly-cpu packages on PyPi.

Try your first TensorFlow program

$ python
>>> import tensorflow as tf
>>> tf.add(1, 2).numpy()
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
b'Hello, TensorFlow!'

For more examples, see the TensorFlow tutorials.

Contribution guidelines

If you want to contribute to TensorFlow, be sure to review the contribution guidelines. This project adheres to TensorFlow's code of conduct. By participating, you are expected to uphold this code.

We use GitHub issues for tracking requests and bugs, please see TensorFlow Discuss for general questions and discussion, and please direct specific questions to Stack Overflow.

The TensorFlow project strives to abide by generally accepted best practices in open-source software development:

Fuzzing Status CII Best Practices Contributor Covenant

Continuous build status

Official Builds

Build Type Status Artifacts
Linux CPU Status PyPI
Linux GPU Status PyPI
Linux XLA Status TBA
macOS Status PyPI
Windows CPU Status PyPI
Windows GPU Status PyPI
Android Status Download
Raspberry Pi 0 and 1 Status Py3
Raspberry Pi 2 and 3 Status Py3
Libtensorflow MacOS CPU Status Nightly GCS Official GCS
Libtensorflow Linux CPU Status Nightly GCS Official GCS
Libtensorflow Linux GPU Status Nightly GCS Official GCS
Libtensorflow Windows CPU Status Nightly GCS Official GCS
Libtensorflow Windows GPU Status Nightly GCS Official GCS

Community Supported Builds

Build Type Status Artifacts
Linux AMD ROCm GPU Nightly Build Status Nightly
Linux AMD ROCm GPU Stable Release Build Status Release 1.15 / 2.x
Linux s390x Nightly Build Status Nightly
Linux s390x CPU Stable Release Build Status Release
Linux ppc64le CPU Nightly Build Status Nightly
Linux ppc64le CPU Stable Release Build Status Release 1.15 / 2.x
Linux ppc64le GPU Nightly Build Status Nightly
Linux ppc64le GPU Stable Release Build Status Release 1.15 / 2.x
Linux aarch64 CPU Nightly (Linaro) Build Status Nightly
Linux aarch64 CPU Stable Release (Linaro) Build Status Release 1.x & 2.x
Linux aarch64 CPU Nightly (OpenLab)
Python 3.6
Build Status Nightly
Linux aarch64 CPU Stable Release (OpenLab) Build Status Release 1.15 / 2.x
Linux CPU with Intel oneAPI Deep Neural Network Library (oneDNN) Nightly Build Status Nightly
Linux CPU with Intel oneAPI Deep Neural Network Library (oneDNN) Stable Release Build Status Release 1.15 / 2.x
Red Hat® Enterprise Linux® 7.6 CPU & GPU
Python 2.7, 3.6
Build Status 1.13.1 PyPI

Community Supported Containers

Container Type Status Artifacts
TensorFlow aarch64 Neoverse-N1 CPU Stable (Linaro)
Debian
Static Release 2.3

Resources

Learn more about the TensorFlow community and how to contribute.

License

Apache License 2.0

Comments
  • Seemingly random shape error during gradient calculation

    Seemingly random shape error during gradient calculation

    edit: Important point I missed to mention: I did not encounter this issue with CUDA backend.

    Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

    System information

    • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mint 19.1
    • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    • TensorFlow installed from (source or binary): binary (pypi)
    • TensorFlow version (use command below): v1.12.0-871-gf480b4a 1.12.0
    • Python version: 3.6.7
    • Bazel version (if compiling from source):
    • GCC/Compiler version (if compiling from source):
    • ROCm/MIOpen version: Rocm: 2.1.96, MiOpen: 1.7.1 (both installed through apt)
    • GPU model and memory: Radeon VII, 16GB (gfx906)

    You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

    Describe the current behavior After training a model for a variable number of epochs, the program throws an exception because of inco,patible shapes during gradient calculation for a tile op inside a tf.while_loop. The exception occurs inside the _TileGrad method, which interleaves the multiples and the shapes of the original tile op by stacking, transposing and reshaping. From the behaviour that I could see by printing the input tensors and intermediate steps in _TileGrad, it seems that something goes wrong during the interleaving. The interleaved shape at times ends up as nonsense like: [949434578 -1198049073 1 16 1 25] , while something like [50 1 1 21 1 25] would be expected.

    The output of the transpose at one of these exceptions was:

     [[1036548730 1061580315]
     [-1110934980 -1085778476]
     [-1085903306 1061705196]]
    

    resulting in the following interleaved shape: [1036548730 1061580315 -1110934980 -1085778476 -1085903306 1061705196]

    I wasn't able to find the related stack output or input shapes, so I can't tell if the shape error is caused by something further upstream. My reply to this issue includes an example with parallel_iterations=1, including all the steps.

    A full stacktrace can be found at the bottom of this issue.

    The error is somewhat hard to reproduce and seems to happen at random. I don't believe it is directly related to tf.while_loop as the exception never occured in an RNN layer.

    Describe the expected behavior No InvalidArgumentError during gradient calculation.

    Code to reproduce the issue I ran this code for about 25 minutes before the exception happened. It might not be the minimal code required to reproduce the error, but since it's not reliably reproducable I can't narrow it down easily.

    import tensorflow as tf
    import numpy as np
    
    def loop_cond_dist(i, _l, hs, __ow, _dist):
        return tf.less(i, tf.shape(hs)[1])
    
    
    def loop_body_dist(i, l, hs, out_weights, dist_lookup):
        dists = tf.nn.embedding_lookup(dist_lookup, tf.clip_by_value(tf.range(1, limit=tf.shape(hs)[1] - i + 1), 0, 50))
        dists = tf.expand_dims(dists, axis=0)
        dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1]) #Error seems to happen in gradients for this op
        cur = tf.einsum('ijk,kl -> ijl', dists, out_weights, name="out_mul")
        pre_pad = tf.zeros([tf.shape(l)[0], tf.shape(l)[1] - tf.reduce_sum(tf.range(tf.shape(hs)[1] - i + 1)), 2])
        post_pad = tf.zeros([tf.shape(l)[0], tf.reduce_sum(tf.range(tf.shape(hs)[1] - i)), 2])
        cur = tf.concat([pre_pad, cur, post_pad], axis=1)
        i += 1
        return i, tf.add(l, cur), hs, out_weights, dist_lookup
    
    def build():
        dist_lookup = tf.get_variable('distance_embeds', dtype=tf.float32, shape=[51, 25])
        hs = tf.placeholder(dtype=tf.float32, shape=[None, None, 50])
        out_weights = tf.get_variable('out_weights', dtype=tf.float32, shape=[25, 2])
        logits = tf.zeros([50, tf.cast(((tf.shape(hs)[1] * tf.shape(hs)[1]) - tf.shape(hs)[1]) / 2, dtype=tf.float32), 2])
        loop_vars = [1, logits, hs, out_weights, dist_lookup]
        logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits')[1]
    
        targets = tf.placeholder(tf.int32)
    
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
        train = tf.train.AdamOptimizer(0.005).minimize(loss)
        return train, targets, hs
    
    if __name__ == "__main__":
        with tf.Session() as sess:
            train, y, hs = build()
            sess.run([tf.global_variables_initializer()])
            while True:
                timesteps = np.random.randint(low=1, high=150)
                targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
                rand_hs = np.random.rand(50, timesteps, 50)
                _ = sess.run([train], {y: targets, hs: rand_hs})
    

    Provide a reproducible test case that is the bare minimum necessary to generate the problem.

    Other info / logs

    --------------------------------------------------------------------------
    InvalidArgumentError                      Traceback (most recent call last)
    ~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
       1333     try:
    -> 1334       return fn(*args)
       1335     except errors.OpError as e:
    
    ~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
       1318       return self._call_tf_sessionrun(
    -> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
       1320 
    
    ~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
       1406         self._session, options, feed_dict, fetch_list, target_list,
    -> 1407         run_metadata)
       1408 
    
    InvalidArgumentError: Size 2 must be non-negative, not -1110934980
    	 [[{{node gradients/clause_logits/Tile_grad/Reshape_1}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
    	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
    
    During handling of the above exception, another exception occurred:
    
    InvalidArgumentError                      Traceback (most recent call last)
    ~/.cargo/toponn/python/bug.py in <module>
         45             targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
         46             rand_hs = np.random.rand(50, timesteps, 50)
    ---> 47             _ = sess.run([train], {y: targets, hs: rand_hs})
         48 
    
    ~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
        927     try:
        928       result = self._run(None, fetches, feed_dict, options_ptr,
    --> 929                          run_metadata_ptr)
        930       if run_metadata:
        931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
    
    ~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
       1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
       1151       results = self._do_run(handle, final_targets, final_fetches,
    -> 1152                              feed_dict_tensor, options, run_metadata)
       1153     else:
       1154       results = []
    
    ~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
       1326     if handle is None:
       1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
    -> 1328                            run_metadata)
       1329     else:
       1330       return self._do_call(_prun_fn, handle, feeds, fetches)
    
    ~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
       1346           pass
       1347       message = error_interpolation.interpolate(message, self._graph)
    -> 1348       raise type(e)(node_def, op, message)
       1349 
       1350   def _extend_graph(self):
    
    InvalidArgumentError: Size 2 must be non-negative, not -1110934980
    	 [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
    	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
    
    Caused by op 'gradients/clause_logits/Tile_grad/Reshape_1', defined at:
      File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
        sys.exit(start_ipython())
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/__init__.py", line 125, in start_ipython
        return launch_new_instance(argv=argv, **kwargs)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
        app.initialize(argv)
      File "</home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/decorator.py:decorator-gen-112>", line 2, in initialize
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
        return method(app, *args, **kwargs)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/terminal/ipapp.py", line 323, in initialize
        self.init_code()
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 288, in init_code
        self._run_cmd_line_code()
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 408, in _run_cmd_line_code
        self._exec_file(fname, shell_futures=True)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 340, in _exec_file
        raise_exceptions=True)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2683, in safe_execfile
        self.compile if shell_futures else None)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/utils/py3compat.py", line 188, in execfile
        exec(compiler(f.read(), fname, 'exec'), glob, loc)
    
      File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
        train, y, hs = build()
      File "/home/seb/.cargo/toponn/python/bug.py", line 34, in build
        train = tf.train.AdamOptimizer(0.005).minimize(loss)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 400, in minimize
        grad_loss=grad_loss)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients
        colocate_gradients_with_ops=colocate_gradients_with_ops)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 674, in gradients
        unconnected_gradients)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in _GradientsHelper
        lambda: grad_fn(op, *out_grads))
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 409, in _MaybeCompile
        return grad_fn()  # Exit early
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in <lambda>
        lambda: grad_fn(op, *out_grads))
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 599, in _TileGrad
        input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
        "Reshape", tensor=tensor, shape=shape, name=name)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
        op_def=op_def)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
        return func(*args, **kwargs)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
        op_def=op_def)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
        self._traceback = tf_stack.extract_stack()
    
    ...which was originally created as op 'clause_logits/Tile', defined at:
      File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
        sys.exit(start_ipython())
    [elided 10 identical lines from previous traceback]
      File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
        train, y, hs = build()
      File "/home/seb/.cargo/toponn/python/bug.py", line 29, in build
        logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits', parallel_iterations=250)[1]
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3295, in while_loop
        return_same_structure)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3007, in BuildLoop
        pred, body, original_loop_vars, loop_vars, shape_invariants)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2942, in _BuildLoop
        body_result = body(*packed_vars_for_body)
      File "/home/seb/.cargo/toponn/python/bug.py", line 13, in loop_body_dist
        dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1])
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8805, in tile
        "Tile", input=input, multiples=multiples, name=name)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
        op_def=op_def)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
        return func(*args, **kwargs)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
        op_def=op_def)
      File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
        self._traceback = tf_stack.extract_stack()
    
    InvalidArgumentError (see above for traceback): Size 2 must be non-negative, not -1110934980
    	 [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
    	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
    

    Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

    bug 
    opened by sebpuetz 130
  • Tensorflow 2.0 AMD support

    Tensorflow 2.0 AMD support

    I would be curious if Tensorflow 2.0 works with AMD Radeon VII?

    Also, if it is available, are there any benchmark comparison with 2080Ti on some standard network to see if we should invest in Radeon VII clusters?

    opened by Cvikli 58
  • Memory access fault by GPU node-1 (Agent handle: 0x2e0dbf0) on address 0x6dccc0000. Reason: Page not present or supervisor privilege.

    Memory access fault by GPU node-1 (Agent handle: 0x2e0dbf0) on address 0x6dccc0000. Reason: Page not present or supervisor privilege.

    Hello guys..

    I am having issue to run rocm tensorflow with detail as follow:

    System information

    • Have I written custom code : No I try to run this keras tensorflow codes : Keras Mask RCNN : https://github.com/matterport/Mask_RCNN Keras SSD : https://github.com/pierluigiferrari/ssd_keras
    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.1 LTS
    • TensorFlow installed from whl package : pip3 install --user tensorflow-rocm
    • TensorFlow version (use command below): 1.12
    • Python version: 3.6.7
    • ROCM version : 2.0
    • CPU Memory: 16GB
    • GPU model and memory: RADEON RX 580 8 GB recongnized as: name: Ellesmere [Radeon RX 470/480] AMDGPU ISA: gfx803 memoryClockRate (GHz) 1.34 pciBusID 0000:01:00.0 Total memory: 8.00GiB Free memory: 7.75GiB

    Describe the current behavior Epoch 1/30 2019-01-29 22:25:46.392668: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data 2019-01-29 22:25:46.446704: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter Memory access fault by GPU node-1 (Agent handle: 0x2e0dbf0) on address 0x6dccc0000. Reason: Page not present or supervisor privilege. Aborted (core dumped)

    Describe the expected behavior Running normally until epoch 30/30

    Code to reproduce the issue Keras Mask RCNN python3 platno.py train --dataset=/home/path/to/dataset --weights=coco Always getting error with core dumped as above message

    Keras SSD python3 ssd300_training.py can run normally when lowering batch size from 32 to 8

    python3 ssd7_training.py getting core dumped even lowering batch size to 1

    Other info / logs Have tried to enable some env variable for debug but still get error: HSA_ENABLE_SDMA=0 HSA_ENABLE_INTERRUPT=0 HSA_SVM_GUARD_PAGES=0 HSA_DISABLE_CACHE=1

    Please assist how to resolve this problem

    Thanks and Regards

    bug gfx803 
    opened by fendiwira 44
  • errors in pin-in-place path in HCC unpinned copy engine

    errors in pin-in-place path in HCC unpinned copy engine

    Using latest develop-upstream branch and latest benchmarks master. Running the tf_cnn_benchmarks.py code like so:

    python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu
    

    Eventually produces during warmup the following message

    terminate called after throwing an instance of 'Kalmar::runtime_exception'
      what():  HCC unpinned copy engine error
    Aborted (core dumped)
    

    If you set --local_parameter_device=gpu instead, the problem doesn't manifest.

    However, the problem happens again even with --local_parameter_device=gpu during distributed training. Running 1 worker and 1 server like so:

    # worker
    python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=worker --task_index=0 --server_protocol=grpc
    # ps
    python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --wo^Cer_hosts=prj47-rack-02:50001 --job_name=ps --task_index=0 --server_protocol=grpc
    

    At least with the distributed training, my guess is that tensors are moving from GPU to CPU prior to being packed into protobufs and shipped via grpc. Not sure why this is also happening during warm-up except that I specified the parameter device to be CPU, forcing a device to host copy for storing the params.

    misc system info

    c++ (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609

    lscpu

    AMD EPYC 7551 32-Core Processor

    uname -a Linux prj47-rack-02 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

    LD_LIBRARY_PATH /home/jdaily/openmpi-3.1.0-install/lib DYLD_LIBRARY_PATH is unset

    rocm-clang-ocl/Ubuntu 16.04,now 0.3.0-c1b678e amd64 [installed,automatic] rocm-dev/Ubuntu 16.04,now 1.8.151 amd64 [installed] rocm-device-libs/Ubuntu 16.04,now 0.0.1 amd64 [installed] rocm-dkms/Ubuntu 16.04,now 1.8.151 amd64 [installed] rocm-libs/Ubuntu 16.04,now 1.8.151 amd64 [installed] rocm-opencl/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed] rocm-opencl-dev/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed] rocm-profiler/Ubuntu 16.04,now 5.4.6797 amd64 [installed] rocm-smi/Ubuntu 16.04,now 1.0.0-42-g0ae1c36 amd64 [installed,automatic] rocm-utils/Ubuntu 16.04,now 1.8.151 amd64 [installed] rocminfo/now 1.0.7 amd64 [installed,local]

    opened by jeffdaily 39
  • Crash when performing inference

    Crash when performing inference

    System information

    • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): I suppose. I'm using a software package that uses tensorflow-gpu under the hood, but I manually installed tensorflow-rocm in to their environment.

    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04

    • TensorFlow installed from (source or binary): from pypi binary

    • TensorFlow version (use command below): 1.14.1 (Problem also occurs on 1.14.0, I can't test it on 1.13.x)

    • Python version: 3.6.2

    • ROCm/MIOpen version:

    miopen-hip/Ubuntu 16.04,now 2.0.1.7405-rocm-rel-2.7-22-4e39a83 amd64 [installed]
      AMD's DNN Library
    
    miopen-opencl/Ubuntu 16.04 2.0.1.7405-rocm-rel-2.7-22-4e39a83 amd64
      AMD's DNN Library
    
    miopengemm/Ubuntu 16.04,now 1.1.6.645-rocm-rel-2.7-22-6275a87 amd64 [installed]
      A tool for generating OpenCL matrix multiplication (GEMM) kernels
    
    • GPU model and memory: Vega 7 16GB

    Describe the current behavior

    I don't know the entire lingo, as I'm new to all of this and I didn't implement any of the tensorflow stuff.

    So I used a software package that uses tensorflow-gpu to perform Deep Learning. My colleages have generated a few networks and it works on their machines and others that have a nvidia card.

    When I try using those networks on my computer, with tensorflow-rocm, and I try to use those trained networks for inference, it crashes my computer. Like it reboots itself.

    The networks are saved in h5 format. I haven't tried just generating a new network and training a new network.

    Describe the expected behavior For it to not crash my whole computer. At least only crash python.

    Other info / logs

    It's been a while that I have installed rocm, so I don't remember how I did it, but is it normal that my miopen packages are called ubuntu 16.04 but I'm on ubuntu 18.04?

    The Vega is also driving my desktop environment.

    miopen 
    opened by thejinx0r 30
  • Unable to find a suitable algorithm for doing forward convolution

    Unable to find a suitable algorithm for doing forward convolution

    Hi, I get a weird error about Unable to find a suitable algorithm for doing forward convolution when I run the session. From what I understand, there is a kernel compiled with -DLOCAL_MEM_SIZE=19008 that is not something coming from my code. Even with a batch size of 1 I get the same error.

    ml_1  | 2018-08-23 21:03:11.045474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1451] Found device 0 with properties:
    ml_1  | name: Device 687f
    ml_1  | AMDGPU ISA: gfx900
    ml_1  | memoryClockRate (GHz) 1.63
    ml_1  | pciBusID 0000:0c:00.0
    ml_1  | Total memory: 7.98GiB
    ml_1  | Free memory: 7.73GiB
    ml_1  | 2018-08-23 21:03:11.045489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Adding visible gpu devices: 0
    ml_1  | 2018-08-23 21:03:11.045503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Device interconnect StreamExecutor with strength 1 edge matrix:
    ml_1  | 2018-08-23 21:03:11.045510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:995]      0
    ml_1  | 2018-08-23 21:03:11.045516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1008] 0:   N
    ml_1  | 2018-08-23 21:03:11.045547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1124] Created TensorFlow device (/device:GPU:0 with 7524 MB memory) -> physical GPU (device: 0, name: Device 687f, pci bus id: 0000:0c:00.0)
    ml_1  | 2018-08-23 21:03:26.581328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Adding visible gpu devices: 0
    ml_1  | 2018-08-23 21:03:26.581382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Device interconnect StreamExecutor with strength 1 edge matrix:
    ml_1  | 2018-08-23 21:03:26.581396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:995]      0
    ml_1  | 2018-08-23 21:03:26.581407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1008] 0:   N
    ml_1  | 2018-08-23 21:03:26.581440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1124] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7524 MB memory) -> physical GPU (device: 0, name: Device 687f, pci bus id: 0000:0c:00.0)
    ml_1  | 2018-08-23 21:04:20.430885: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
    ml_1  | 2018-08-23 21:04:20.495395: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
    ml_1  | 2018-08-23 21:04:20.557689: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
    ml_1  | error: local memory limit exceeded (76032) in Im2Col
    ml_1  | MIOpen Error: /data/repo/MIOpen/src/tmp_dir.cpp:18: Can't execute cd /tmp/miopen-MIOpenUtilKernels.cl-faa6-605d-295b-fc2e; /opt/rocm/bin/clang-ocl  -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=3 -DNUM_IM_BLKS=9 -DLOCAL_MEM_SIZE=19008 -DSTRIDE_GT_1=1 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -mcpu=gfx900 -Wno-everything MIOpenUtilKernels.cl -o /tmp/miopen-MIOpenUtilKernels.cl-faa6-605d-295b-fc2e/MIOpenUtilKernels.cl.o
    ml_1  | 2018-08-23 21:04:20.879002: F tensorflow/stream_executor/rocm/rocm_dnn.cc:1803] Check failed: status == miopenStatusSuccess (7 vs. 0)Unable to find a suitable algorithm for doing forward convolution
    ml_1  | [I 21:04:21.291 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
    ml_1  | WARNING:root:kernel 0d0fea33-23e8-4e97-8fa9-0bda0c19ea6f restarted
    ml_1  | [I 21:04:37.435 NotebookApp] Saving file at /Road Segmentation.ipynb
    
    miopen 
    opened by Sumenia 30
  • Dramatic difference in perf between 1080ti and VEGA FE

    Dramatic difference in perf between 1080ti and VEGA FE

    Hey there,

    I'm trialing some code to benchmark my VEGA vs a colleagues 1080ti.

    I've noticed some very peculiar differences in time to fit per epoch, I'm guessing I'm messing up in some way.

    For one epoch on AMD 2990WX ~400 seconds

    For one epoch on 1080ti < 100s

    For one epoch on VEGA FE > 40 minutes

    enhancement question 
    opened by PhilipDeegan 28
  • Integrate rocPRIM 0.3.1 milestone

    Integrate rocPRIM 0.3.1 milestone

    A couple of notes:

    • rocPRIM is still marked experimental
    • in this PR, most of the reduction kernels, l2loss, and softmax are converted from cub to rocPRIM
    • complex types are not supported out of the box - hence no complex reduction yet
    • some cub kernels are not yet converted (where, topk, ...) due to issues w/ rocPRIM and/or the TF interface to cub. There will be follow-up PRs for these.
    • all rocPRIM kernels in this PR are marked P (for in Progress) in the documentation until we have confirmed it works and the patch is accepted, then I'll mark done
    opened by iotamudelta 28
  • Fix for finding RCCL that works on 5.1 and 5.2

    Fix for finding RCCL that works on 5.1 and 5.2

    With the various ROCm libs moving around once fix got added that wasn't backward compatible with ROCm 5.1. This fixes the TF2.7 build on 5.1 and works on 5.2. (assuming we're finally settled on lib and include locations)

    opened by jayfurmanek 27
  • low performance ?

    low performance ?

    if I take this benchmarks for reference, Inception v3 performs way slower on Vega 56 than Nvidia 1080

    I'm a bit disappointed about the performance of my cards, are those results normal ?

    python3.5 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=2 --model resnet50 --batch_size 64

    --> total images/sec: 192.75

    python3.5 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=2 --model inception3 --batch_size 64

    --> total images/sec: 92.29

    CPU: AMD Threadripper 1900X GPU 1: AMD Vega 56 GPU 2: AMD Vega 56 Memory: 32 Go DDR4

    opened by Sumenia 27
  • Building (and using) libtensorflow.so

    Building (and using) libtensorflow.so

    Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

    System information

    • OS Platform and Distribution: Linux Mint 19.1
    • TensorFlow installed from (source or binary): Source
    • TensorFlow version: 1.12
    • Python version: 3.6.7
    • Installed using virtualenv? pip? conda?: pyenv
    • Bazel version (if compiling from source): 0.16.0, 0.19.2 and 0.21.0
    • GCC/Compiler version (if compiling from source): 7.3.0
    • ROCm version: 2.1
    • GPU model and memory: Radeon VII, 16GB

    Describe the problem I want to use tensorflow from rust, to do so I need to build the libtensorflow.so shared library. Compilation goes through on r1.12 but when trying to execute the graph I get a runtime exception (see other info/logs section).

    I don't encounter any issues with tensorflow in python, running a graph and training model works like a charm there. Although that was not compiled from source but installed from pypi.

    Provide the exact sequence of commands / steps that you executed before running into the problem

    Install bazel 19.2 as recommended in #304 
    git clone -b r1.12-rocm [email protected]:ROCmSoftwarePlatform/tensorflow-upstream
    cd tensorflow-upstream
    ./configure n for everything except ROCm support
    bazel build --config=opt --config=rocm --action_env=HIP_PLATFORM=hcc tensorflow:libtensorflow.so
    

    Any other info / logs

    Runtime exception:

    2019-02-09 11:14:33.291267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1531] Found device 0 with properties: 
    name: Device 66af
    AMDGPU ISA: gfx906
    memoryClockRate (GHz) 1.802
    pciBusID 0000:28:00.0
    Total memory: 15.98GiB
    Free memory: 15.73GiB
    2019-02-09 11:14:33.291334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1642] Adding visible gpu devices: 0
    2019-02-09 11:14:33.291371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Device interconnect StreamExecutor with strength 1 edge matrix:
    2019-02-09 11:14:33.291383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1059]      0 
    2019-02-09 11:14:33.291391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1072] 0:   N 
    2019-02-09 11:14:33.291489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Device 66af, pci bus id: 0000:28:00.0)
    terminate called after throwing an instance of 'std::runtime_error'
      what():  Missing metadata for __global__ function: _ZN10tensorflow7functor28FillPhiloxRandomKernelLaunchINS_6random19UniformDistributionINS2_12PhiloxRandomEfEEEEvS4_PNT_17ResultElementTypeExS6_
    [1]    11952 abort (core dumped)  LD_PRELOAD="/home/seb/.libtf/libtensorflow.so" 
    

    hipconfig

    HIP version  : 1.5.19025
    
    == hipconfig
    HIP_PATH     : /opt/rocm/hip
    HIP_PLATFORM : hcc
    CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=   -I/opt/rocm/hip/include -I/opt/rocm/hcc/include
    
    == hcc
    HSA_PATH     : /opt/rocm/hsa
    HCC_HOME     : /opt/rocm/hcc
    HCC clang version 8.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 683c680a6bff215baa3bd9d3099ba1a43e24cf2e) (ssh://gerritgit/lightning/ec/llvm 6e349ce344586b4254654aea8f34444a13aedb67) (based on HCC 1.3.19045-fea3e2b-683c680-6e349ce )
    Target: x86_64-unknown-linux-gnu
    Thread model: posix
    InstalledDir: /opt/rocm/hcc/bin
    LLVM (http://llvm.org/):
      LLVM version 8.0.0svn
      Optimized build.
      Default target: x86_64-unknown-linux-gnu
      Host CPU: znver1
    
      Registered Targets:
        amdgcn - AMD GCN GPUs
        r600   - AMD GPUs HD2XXX-HD6XXX
        x86    - 32-bit X86: Pentium-Pro and above
        x86-64 - 64-bit X86: EM64T and AMD64
    HCC-cxxflags :  -hc -std=c++amp -I/opt/rocm/hcc/includeHCC-ldflags  :  -hc -std=c++amp -L/opt/rocm/hcc/lib -Wl,--rpath=/opt/rocm/hcc/lib -ldl -lm -lpthread -lhc_am -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive
    
    === Environment Variables
    PATH=/opt/rocm/hcc/bin:/opt/rocm/hip/bin:/home/seb/.pyenv/shims:/home/seb/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin/x86_64:/home/seb/.pyenv/bin
    LD_LIBRARY_PATH=/opt/rocm/opencl/lib
    HIP_PATH=/opt/rocm/hip
    HCC_HOME=/opt/rocm/hcc
    
    == Linux Kernel
    Hostname     : seb-desktop
    Linux seb-desktop 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    No LSB modules are available.
    Distributor ID:	LinuxMint
    Description:	Linux Mint 19.1 Tessa
    Release:	19.1
    Codename:	tessa
    

    hcc --version

    HCC clang version 8.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 683c680a6bff215baa3bd9d3099ba1a43e24cf2e) (ssh://gerritgit/lightning/ec/llvm 6e349ce344586b4254654aea8f34444a13aedb67) (based on HCC 1.3.19045-fea3e2b-683c680-6e349ce )
    Target: x86_64-unknown-linux-gnu
    Thread model: posix
    InstalledDir: /opt/rocm/hcc/bin
    

    rocminfo

    =====================    
    HSA System Attributes    
    =====================    
    Runtime Version:         1.1
    System Timestamp Freq.:  1000.000000MHz
    Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
    Machine Model:           LARGE                              
    System Endianness:       LITTLE                             
    
    ==========               
    HSA Agents               
    ==========               
    *******                  
    Agent 1                  
    *******                  
      Name:                    AMD Ryzen 7 2700X Eight-Core Processor
      Vendor Name:             CPU                                
      Feature:                 None specified                     
      Profile:                 FULL_PROFILE                       
      Float Round Mode:        NEAR                               
      Max Queue Number:        0                                  
      Queue Min Size:          0                                  
      Queue Max Size:          0                                  
      Queue Type:              MULTI                              
      Node:                    0                                  
      Device Type:             CPU                                
      Cache Info:              
        L1:                      32768KB                            
      Chip ID:                 0                                  
      Cacheline Size:          64                                 
      Max Clock Frequency (MHz):3700                               
      BDFID:                   0                                  
      Compute Unit:            16                                 
      Features:                None
      Pool Info:               
        Pool 1                   
          Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
          Size:                    49448920KB                         
          Allocatable:             TRUE                               
          Alloc Granule:           4KB                                
          Alloc Alignment:         4KB                                
          Acessible by all:        TRUE                               
        Pool 2                   
          Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
          Size:                    49448920KB                         
          Allocatable:             TRUE                               
          Alloc Granule:           4KB                                
          Alloc Alignment:         4KB                                
          Acessible by all:        TRUE                               
      ISA Info:                
        N/A                      
    *******                  
    Agent 2                  
    *******                  
      Name:                    gfx906                             
      Vendor Name:             AMD                                
      Feature:                 KERNEL_DISPATCH                    
      Profile:                 BASE_PROFILE                       
      Float Round Mode:        NEAR                               
      Max Queue Number:        128                                
      Queue Min Size:          4096                               
      Queue Max Size:          131072                             
      Queue Type:              MULTI                              
      Node:                    1                                  
      Device Type:             GPU                                
      Cache Info:              
        L1:                      16KB                               
      Chip ID:                 26287                              
      Cacheline Size:          64                                 
      Max Clock Frequency (MHz):1802                               
      BDFID:                   10240                              
      Compute Unit:            60                                 
      Features:                KERNEL_DISPATCH 
      Fast F16 Operation:      FALSE                              
      Wavefront Size:          64                                 
      Workgroup Max Size:      1024                               
      Workgroup Max Size Per Dimension:
        Dim[0]:                  67109888                           
        Dim[1]:                  671089664                          
        Dim[2]:                  0                                  
      Grid Max Size:           4294967295                         
      Waves Per CU:            40                                 
      Max Work-item Per CU:    2560                               
      Grid Max Size per Dimension:
        Dim[0]:                  4294967295                         
        Dim[1]:                  4294967295                         
        Dim[2]:                  4294967295                         
      Max number Of fbarriers Per Workgroup:32                                 
      Pool Info:               
        Pool 1                   
          Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
          Size:                    16760832KB                         
          Allocatable:             TRUE                               
          Alloc Granule:           4KB                                
          Alloc Alignment:         4KB                                
          Acessible by all:        FALSE                              
        Pool 2                   
          Segment:                 GROUP                              
          Size:                    64KB                               
          Allocatable:             FALSE                              
          Alloc Granule:           0KB                                
          Alloc Alignment:         0KB                                
          Acessible by all:        FALSE                              
      ISA Info:                
        ISA 1                    
          Name:                    amdgcn-amd-amdhsa--gfx906          
          Machine Models:          HSA_MACHINE_MODEL_LARGE            
          Profiles:                HSA_PROFILE_BASE                   
          Default Rounding Mode:   NEAR                               
          Default Rounding Mode:   NEAR                               
          Fast f16:                TRUE                               
          Workgroup Max Dimension: 
            Dim[0]:                  67109888                           
            Dim[1]:                  1024                               
            Dim[2]:                  16777217                           
          Workgroup Max Size:      1024                               
          Grid Max Dimension:      
            x                        4294967295                         
            y                        4294967295                         
            z                        4294967295                         
          Grid Max Size:           4294967295                         
          FBarrier Max Size:       32                                 
    *** Done ***            
    
    enhancement 
    opened by sebpuetz 26
  • Memory access fault by GPU node-2 (Agent handle: 0x38b6960) on address 0x1000. Reason: Page not present or supervisor privilege.

    Memory access fault by GPU node-2 (Agent handle: 0x38b6960) on address 0x1000. Reason: Page not present or supervisor privilege.

    Issue Type

    Bug

    Source

    source

    Tensorflow Version

    tensorflow-rocm 2.2

    Custom Code

    Yes

    OS Platform and Distribution

    Ubuntu 20.04

    Mobile device

    No response

    Python version

    3.8

    Bazel version

    No response

    GCC/Compiler version

    No response

    CUDA/cuDNN version

    ROCm v3.5

    GPU model and memory

    2 x RX 480 4Go

    Current Behaviour?

    When switching my LSTM neurons from 'relu' activation function to 'tanh' I get the following error : `Memory access fault by GPU node-2 (Agent handle: 0x38b6960) on address 0x1000. Reason: Page not present or supervisor privilege.`
    
    It also appears when this error doesn't occur (ie. when the program work) I have this warnings printed at the beginning:
    WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
    WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
    WARNING:tensorflow:Layer lstm_2 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
    

    Standalone code to reproduce the issue

    import os
    import random
    import time
    
    import numpy as np
    import tensorflow as tf
    from tqdm import tqdm
    from collections import deque
    
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
    
    
    print(tf.config.experimental.list_physical_devices("GPU"))
    mirrored_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(tf.distribute.experimental.CollectiveCommunication.RING)
    
    window_size = 5
    episodes = 20
    batch_size = 32
    NAME = f"Blackstonev1-LSTM-32x64x64-{int(time.time())}"
    tensorboard = tf.keras.callbacks.TensorBoard(log_dir="logs\{}".format(NAME))
    
    class AIAgent:
        def __init__(self, state_size, action_space=3, model_name=NAME):  # Stay, Buy, Sell
            self.state_size = state_size
            self.action_space = action_space
            self.memory = deque(maxlen=2000)
            self.inventory = []
            self.margin_inventory = []
            self.model_name = model_name
    
            self.gamma = 0.95
            self.epsilon = 1.0
            self.epsilon_final = 0.05
            self.epsilon_decay = 0.995
    
            self.model = self.model_builder()
    
        def model_builder(self):
            with mirrored_strategy.scope():
                model = tf.keras.models.Sequential()
    
                model.add(tf.keras.Input(shape=(window_size, 2)))
    
                model.add(tf.keras.layers.LSTM(units=32, activation='relu', return_sequences=True))
                model.add(tf.keras.layers.LSTM(units=64, activation='relu', return_sequences=True))
                model.add(tf.keras.layers.LSTM(units=64, activation='relu', return_sequences=False))
                model.add(tf.keras.layers.Dense(units=self.action_space, activation='linear'))
                model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=0.001))
    
            return model
    
        def trade(self, state):
            rdm = random.random()
            if rdm <= self.epsilon:
                rdm_act = random.randrange(self.action_space)
                print(f"random: {rdm_act}")
                return rdm_act
    
            actions = self.model.predict(state)
            argmax = np.argmax(actions[0])
            print(f'model: {argmax}')
            return argmax
    
        def batch_train(self, batch_size):
            batch = []
            for i in range(len(self.memory) - batch_size + 1, len(self.memory)):
                batch.append(self.memory[i])
    
            for state, action, reward, next_state, done in batch:
                reward = reward
    
                if not done:
                    reward = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
    
                target = self.model.predict(state)
                target[0][action] = reward
    
                self.model.fit(state, target, epochs=1, verbose=0, callbacks=[tensorboard])
    
            if self.epsilon > self.epsilon_final:
                self.epsilon *= self.epsilon_decay
    
    
    def state_creator(data, timestep, window_size):
        starting_id = timestep - window_size + 1
    
        if starting_id >= 0:
            windowed_data = data[starting_id:timestep + 1]
        else:
            windowed_data = - starting_id * [data[0]] + list(data[0:timestep + 1])
    
        state = windowed_data
    
        return np.array([state])
    
    
    def main(batch_size, window_size, episodes):
        data = load_data(stock_name) # Replace with your own input here
        data_samples = len(data) - 1
        agent = AIAgent(window_size)
        agent.model.summary()
    
    
        for episode in range(1, episodes + 1):
            print("Episode: {}/{}".format(episode, episodes))
            state = state_creator(data, 0, window_size)
    
            total_profit = 0
            agent.inventory = []
    
            for t in tqdm(range(data_samples)):
                action = agent.trade(state)
    
                next_state = state_creator(data, t + 1, window_size)
                reward = 0
    
                if action == 1:
                    # Do that
                    continue
                elif action == 2:
                    # Do that
                    continue
    
                elif action == 0:
                    # Do that
                    continue
    
                if t == data_samples - 1:
                    done = True
                else:
                    done = False
    
                agent.memory.append((state, action, reward, next_state, done))
                state = next_state
    
                if len(agent.memory) > batch_size:
                    agent.batch_train(batch_size)
    
            agent.model.save(f"{agent.model_name}_{episode}.h5")
    

    Relevant log output

    No response

    opened by hugo-mrc 0
  • 7900 XTX Fails to Run

    7900 XTX Fails to Run

    Issue Type

    Bug

    Tensorflow Version

    Tensorflow-rocm v2.11.0-3797-gfe65ef3bbcf 2.11.0

    rocm Version

    5.4.1

    Custom Code

    Yes

    OS Platform and Distribution

    Archlinux: Kernel 6.1.1

    Python version

    3.10

    GPU model and memory

    7900 XTX 24GB

    Current Behaviour?

    I am not entirely sure whether this is an upstream (ROCM) issue, or with Tensorflow-rocm specifically, so I am reporting it to both repo's. A toy example refuses to run and dumps core. I would have expected it to train successfully.

    Standalone code to reproduce the issue

    import tensorflow as tf
    import numpy as np
    
    features = np.random.randn(10000,25)
    targets = np.random.randn(10000)
    
    model = tf.keras.Sequential([
         tf.keras.layers.Dense(1)
    ])
    
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
                  loss=tf.keras.losses.MeanSquaredError())
    
    model.fit(x=features, y=targets)
    

    Relevant log output

    [[email protected] code]$ pipenv run python testNN.py
    2022-12-24 11:18:37.178811: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    python: /build/hsa-rocr/src/ROCR-Runtime-rocm-5.4.1/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.
    
    opened by Mushoz 0
  • I'm not sure if ROCm or the GPU are working properly based on two console outputs

    I'm not sure if ROCm or the GPU are working properly based on two console outputs

    Issue Type

    Support

    Source

    binary

    Tensorflow Version

    2.11.0

    Custom Code

    Yes

    OS Platform and Distribution

    Kubuntu 20.04

    Mobile device

    No response

    Python version

    3.7

    Bazel version

    No response

    GCC/Compiler version

    No response

    CUDA/cuDNN version

    No response

    GPU model and memory

    No response

    Current Behaviour?

    ROCM Fusion seems to be enabled, but GPU doesn't appear on tf.config.list_physical_devices('GPU'). This seems a bit contradictory to me.
    

    Standalone code to reproduce the issue

    import tensorflow as tf
    import numpy as np
    
    tensor = tf.constant(np.random.rand(117120,1))
    

    Relevant log output

    2022-12-21 09:16:33.582542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1990] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 6600 XT, pci bus id: 0000:0c:00.0) with AMDGPU version : gfx1032. The supported AMDGPU versions are gfx1030, gfx900, gfx906, gfx908, gfx90a.
    2022-12-21 09:16:35.013263: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
    
    opened by tvandraren 1
  • Unable to use profiler

    Unable to use profiler

    Issue Type

    Bug

    Source

    source

    Tensorflow Version

    2.11.0

    Custom Code

    Yes

    OS Platform and Distribution

    Archlinux kernel: 6.0.12

    Mobile device

    No response

    Python version

    3.9

    Bazel version

    No response

    GCC/Compiler version

    No response

    CUDA/cuDNN version

    ROCM 5.4.0

    GPU model and memory

    6900XT

    Current Behaviour?

    I am unable to run the profiler without tensorflow-rocm crashing.

    Standalone code to reproduce the issue

    This example triggers the issue:

    import tensorflow as tf
    import tensorflow as np
    
    tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = '../logs/',
                                                     histogram_freq = 1,
                                                     profile_batch = '500,520')
    
    features = np.random.randn(10000,25)
    targets = np.random.randn(10000)
    
    model = tf.keras.Sequential([
         tf.keras.layers.Dense(1)
    ])
    
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
                  loss=tf.keras.losses.MeanSquaredError())
    
    model.fit(x=features, y=targets, callbacks=[tboard_callback])
    

    Relevant log output

    Fatal Python error: Aborted
    
    
    Main thread:
    Current thread 0x00007f3889bec740 (most recent call first):
      File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/tensorflow/python/profiler/profiler_v2.py", line 117 in start
      File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/keras/callbacks.py", line 2882 in _start_profiler
      File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/keras/callbacks.py", line 2672 in _init_profile_batch
      File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/keras/callbacks.py", line 2421 in __init__
      File "/home/jaap/Dropbox/Projects/Google_Trends_Analysis/code/testNN.py", line 12 in <module>
      File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/spyder_kernels/py3compat.py", line 356 in compat_exec
      File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/spyder_kernels/customize/spydercustomize.py", line 469 in exec_code
      File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/spyder_kernels/customize/spydercustomize.py", line 611 in _exec_file
      File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/spyder_kernels/customize/spydercustomize.py", line 524 in runfile
      File "/tmp/ipykernel_99678/3578018583.py", line 1 in <cell line: 1>
    
    
    Restarting kernel...
    
    opened by Mushoz 0
  • rocWMMA support?

    rocWMMA support?

    Issue Type

    Feature Request

    Source

    binary

    Tensorflow Version

    tf 2.10.0.530

    Custom Code

    No

    OS Platform and Distribution

    Linux Ubuntu 20.04

    Mobile device

    N/A

    Python version

    3.8

    Bazel version

    N/A

    GCC/Compiler version

    N/A

    CUDA/cuDNN version

    N/A

    GPU model and memory

    N/A

    Current Behaviour?

    With the impending release of GFX11 GPUs which support WMMA instructions, there seems to be currently no support for such WMMA instructions integrated into the ROCm tensorflow stack yet. I'm a bit concerned as the main competition already has TensorFloat-32 support for their GPUs in their tensorflow stack, and if tensorflow-ROCm is to remain relevant and competitive with the competition then at least I believe that support for WMMA instructions should be integrated into the ROCm tensorflow stack ASAP. And if the ROCm + proprietary amdgpu stack can _also_ start fully supporting GFX11 GPUs within the next few months after they launch then that'd be huge incentive for me to purchase a RX 7000 series GPU with at least 12 GB VRAM and rocWMMA support ;)
    

    Standalone code to reproduce the issue

    N/A
    

    Relevant log output

    N/A
    
    opened by tedliosu 0
Releases(v2.0.0-rocm)
Owner
ROCm Software Platform
ROCm Software Platform Repository
ROCm Software Platform
PyTorch implementation of MulMON

MulMON This repository contains a PyTorch implementation of the paper: Learning Object-Centric Representations of Multi-object Scenes from Multiple Vi

NanboLi 16 Nov 03, 2022
Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Official implementation of the Efficient Conforme

Maxime Burchi 145 Dec 30, 2022
Implementation of "Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification"

hypergraph_reid Implementation of "Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification" If you find this help your research,

62 Dec 21, 2022
A `Neural = Symbolic` framework for sound and complete weighted real-value logic

Logical Neural Networks LNNs are a novel Neuro = symbolic framework designed to seamlessly provide key properties of both neural nets (learning) and s

International Business Machines 138 Dec 19, 2022
Deep Implicit Moving Least-Squares Functions for 3D Reconstruction

DeepMLS: Deep Implicit Moving Least-Squares Functions for 3D Reconstruction This repository contains the implementation of the paper: Deep Implicit Mo

103 Dec 22, 2022
High-Fidelity Pluralistic Image Completion with Transformers (ICCV 2021)

Image Completion Transformer (ICT) Project Page | Paper (ArXiv) | Pre-trained Models | Supplemental Material This repository is the official pytorch i

Ziyu Wan 243 Jan 03, 2023
A minimal TPU compatible Jax implementation of NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

NeRF Minimal Jax implementation of NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Result of Tiny-NeRF RGB Depth

Soumik Rakshit 11 Jul 24, 2022
Accelerating BERT Inference for Sequence Labeling via Early-Exit

Sequence-Labeling-Early-Exit Code for ACL 2021 paper: Accelerating BERT Inference for Sequence Labeling via Early-Exit Requirement: Please refer to re

李孝男 23 Oct 14, 2022
A project that uses optical flow and machine learning to detect aimhacking in video clips.

waldo-anticheat A project that aims to use optical flow and machine learning to visually detect cheating or hacking in video clips from fps games. Che

waldo.vision 542 Dec 03, 2022
AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation

AirPose AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation Check the teaser video This repository contains the code of A

Robot Perception Group 41 Dec 05, 2022
Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 01, 2022
Pytorch implementation of "Get To The Point: Summarization with Pointer-Generator Networks"

About this repository This repo contains an Pytorch implementation for the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Netwo

wxDai 7 Oct 14, 2022
Code repo for "Cross-Scale Internal Graph Neural Network for Image Super-Resolution" (NeurIPS'20)

IGNN Code repo for "Cross-Scale Internal Graph Neural Network for Image Super-Resolution" [paper] [supp] Prepare datasets 1 Download training dataset

Shangchen Zhou 278 Jan 03, 2023
Deploy a ML inference service on a budget in less than 10 lines of code.

BudgetML is perfect for practitioners who would like to quickly deploy their models to an endpoint, but not waste a lot of time, money, and effort trying to figure out how to do this end-to-end.

1.3k Dec 25, 2022
Collapse by Conditioning: Training Class-conditional GANs with Limited Data

Collapse by Conditioning: Training Class-conditional GANs with Limited Data Moha

Mohamad Shahbazi 33 Dec 06, 2022
Seeing if I can put together an interactive version of 3b1b's Manim in Streamlit

streamlit-manim Seeing if I can put together an interactive version of 3b1b's Manim in Streamlit Installation I had to install pango with sudo apt-get

Adrien Treuille 6 Aug 03, 2022
PiRapGenerator - Make anyone rap the digits of pi

PiRapGenerator Make anyone rap the digits of pi (sample files are of Ted Nivison

7 Oct 02, 2022
PyTorch implementation of DCT fast weight RNNs

DCT based fast weights This repository contains the official code for the paper: Training and Generating Neural Networks in Compressed Weight Space. T

Kazuki Irie 4 Dec 24, 2022
1st Solution For NeurIPS 2021 Competition on ML4CO Dual Task

KIDA: Knowledge Inheritance in Data Aggregation This project releases our 1st place solution on NeurIPS2021 ML4CO Dual Task. Slide and model weights a

MEGVII Research 24 Sep 08, 2022
Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021.

SphereRPN Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021. Authors: Th

Thang Vu 15 Dec 02, 2022