hipCaffe: the HIP port of Caffe

Related tags

Deep LearninghipCaffe
Overview

Caffe

Build Status License

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors.

Check out the project site for all the details like

and step-by-step examples.

Join the chat at https://gitter.im/BVLC/caffe

Please join the caffe-users group or gitter chat to ask questions and talk about methods and models. Framework development discussions and thorough bug reports are collected on Issues.

Happy brewing!

License and Citation

Caffe is released under the BSD 2-Clause license. The BVLC reference models are released for unrestricted use.

Please cite Caffe in your publications if it helps your research:

@article{jia2014caffe,
  Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor},
  Journal = {arXiv preprint arXiv:1408.5093},
  Title = {Caffe: Convolutional Architecture for Fast Feature Embedding},
  Year = {2014}
}

hipCaffe: the HIP Port of Caffe

For details on the HIP Port of Caffe, please take a look at the README.ROCm.md file.

Comments
  • Multi-GPU training problem.

    Multi-GPU training problem.

    Issue summary

    I had succeeded to training bvlc-alexnet and bvlc-googlenet models in single MI25 GPU. When I changed the number of training GPU from 1 to all, caffe show the below message.. CPU memory:256GB swap:16GB db:imagenet lmdb batchsize:64 bvlc_alexnet:

    I0719 10:51:50.941951 2540 solver.cpp:279] Solving AlexNet I0719 10:51:50.941956 2540 solver.cpp:280] Learning Rate Policy: step I0719 10:51:50.955250 2540 solver.cpp:337] Iteration 0, Testing net (#0) I0719 10:54:02.507711 2540 solver.cpp:404] Test net output #0: accuracy = 0.00109375 I0719 10:54:02.508229 2540 solver.cpp:404] Test net output #1: loss = 6.91062 (* 1 = 6.91062 loss) Memory access fault by GPU node-2 on address 0x422ea6b000. Reason: Page not present or supervisor privilege. *** Aborted at 1500432842 (unix time) try "date -d @1500432842" if you are using GNU date *** PC: @ 0x7f64489dc428 gsignal *** SIGABRT (@0x9ec) received by PID 2540 (TID 0x7f642c526700) from PID 2540; stack trace: *** @ 0x7f644ddd0390 (unknown) @ 0x7f64489dc428 gsignal @ 0x7f64489de02a abort @ 0x7f644d9401c9 (unknown) @ 0x7f644d9464e5 (unknown) @ 0x7f644d91e9d7 (unknown) @ 0x7f644ddc66ba start_thread @ 0x7f6448aae3dd clone @ 0x0 (unknown)

    db:imagenet lmdb batchsize:32 bvlc_googlenet:

    I0719 00:12:28.380522 7405 solver.cpp:279] Solving GoogleNet I0719 00:12:28.380544 7405 solver.cpp:280] Learning Rate Policy: step Memory access fault by GPU node-2 on address 0x42309ba000. Reason: Page not present or supervisor privilege. *** Aborted at 1500394348 (unix time) try "date -d @1500394348" if you are using GNU date *** PC: @ 0x7f4078d7a428 gsignal *** SIGABRT (@0x1ced) received by PID 7405 (TID 0x7f405c8c4700) from PID 7405; stack trace: *** @ 0x7f407e16e390 (unknown) @ 0x7f4078d7a428 gsignal @ 0x7f4078d7c02a abort @ 0x7f407dcde1c9 (unknown) @ 0x7f407dce44e5 (unknown) @ 0x7f407dcbc9d7 (unknown) @ 0x7f407e1646ba start_thread @ 0x7f4078e4c3dd clone @ 0x0 (unknown)

    Steps to reproduce

    Using the latest ROCm from debian packages.

    My caffe configuration:

    USE_CUDNN := 0 USE_MIOPEN := 1 USE_LMDB := 1 BLAS := open BLAS_INCLUDE := /opt/openBlas/include BLAS_LIB := /opt/openBlas/lib

    Your system configuration

    Operating system: Ubuntu 16.04.2 LTS with 4.9.0-kfd-compute-rocm-rel-1.6-77 Compiler: GCC v5.4.0, HCC clang 5.0 CUDA version (if applicable): not applicable CUDNN version (if applicable): not applicable BLAS: OpenBlas Python or MATLAB version (for pycaffe and matcaffe respectively): not applicable

    opened by ginsongsong 17
  • Problem with lenet on GPU

    Problem with lenet on GPU

    Issue summary

    I have a problem with lenet.prototxt. ./caffe time -model '../../examples/mnist/lenet.prototxt' - OK ./caffe time -model '../../examples/mnist/lenet.prototxt' -gpu 0 - fails with this message:

    Network initialization done.
    I0714 10:10:04.648849 13230 caffe.cpp:355] Performing Forward
    terminate called after throwing an instance of 'char const*'
    *** Aborted at 1500016205 (unix time) try "date -d @1500016205" if you are using GNU date ***
    PC: @     0x7f76dc314428 gsignal
    *** SIGABRT (@0x3e8000033ae) received by PID 13230 (TID 0x7f76e4675b00) from PID 13230; stack trace: ***
        @     0x7f76e3b0a390 (unknown)
        @     0x7f76dc314428 gsignal
        @     0x7f76dc31602a abort
        @     0x7f76e295970d __gnu_cxx::__verbose_terminate_handler()
        @     0x7f76e294f4b6 __cxxabiv1::__terminate()
        @     0x7f76e294f501 std::terminate()
        @     0x7f76e294f609 __cxa_throw
        @     0x7f76e294dddc hipblasSgemm
        @           0x6b9449 caffe::caffe_gpu_gemm<>()
        @           0x654ec4 caffe::InnerProductLayer<>::Forward_gpu()
        @           0x430b2f caffe::Layer<>::Forward()
        @           0x6ff4d7 caffe::Net<>::ForwardFromTo()
        @           0x6ff3ef caffe::Net<>::Forward()
        @           0x42fa5e time()
        @           0x431412 main
        @     0x7f76dc2ff830 __libc_start_main
        @           0x850cf9 _start
        @                0x0 (unknown)
    

    Steps to reproduce

    Freshly compiled hipCaffe with default Makefile.config. Compilation flags = -O3 I am using the latest ROCm from debian packages. Running the MNIST lenet example works fine only with CPU. test_all.testbin show problems with other networks with GPU.

    Your system configuration

    Operating system: Ubuntu 16.04.2 LTS with 4.9.0-kfd-compute-rocm-rel-1.6-77 Compiler: GCC v5.4.0, HCC clang 5.0 CUDA version (if applicable): not applicable CUDNN version (if applicable): not applicable BLAS: ATLAS Python or MATLAB version (for pycaffe and matcaffe respectively): not applicable

    Hardware: Radeon Pro WX 7100

    opened by BeamOfLight 15
  • Caffe on Vega RX leads to segfault

    Caffe on Vega RX leads to segfault

    Issue summary

    A SIGABRT is received when running the unit tests or the examples on a Vega RX with Ubuntu.

    Steps to reproduce

    I followed all the steps on https://github.com/ROCmSoftwarePlatform/hipCaffe/blob/hip/README.ROCm.md and I got:

    $ ./build/test/test_all.testbin --gtest_filter='DataLayerTest/2.*'
    Cuda number of devices: 1
    Current device id: 0
    Current device name: Vega [Radeon RX Vega]
    Note: Google Test filter = DataLayerTest/2.*
    [==========] Running 12 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 12 tests from DataLayerTest/2, where TypeParam = caffe::GPUDevice<float>
    [ RUN      ] DataLayerTest/2.TestReadLevelDB
    Memory access fault by GPU node-1 on address 0x28000. Reason: Page not present or supervisor privilege.
    *** Aborted at 1507162515 (unix time) try "date -d @1507162515" if you are using GNU date ***
    PC: @     0x7f80572b077f gsignal
    *** SIGABRT (@0x3e800001f89) received by PID 8073 (TID 0x7f803a184700) from PID 8073; stack trace: ***
        @     0x7f805c6b1670 (unknown)
        @     0x7f80572b077f gsignal
        @     0x7f80572b237a abort
        @     0x7f805c220559 (unknown)
        @     0x7f805c226885 (unknown)
        @     0x7f805c1fe9d7 (unknown)
        @     0x7f805c6a76da start_thread
        @     0x7f8057383d7f clone
        @                0x0 (unknown)
    Aborted (core dumped)
    

    Stack with gdb:

    #0  0x00007ffff1c8f77f in __GI_raise ([email protected]=6) at ../sysdeps/unix/sysv/linux/raise.c:58
    #1  0x00007ffff1c9137a in __GI_abort () at abort.c:89
    #2  0x00007ffff6bff559 in  () at /opt/rocm/hsa/lib/libhsa-runtime64.so.1
    #3  0x00007ffff6c05885 in  () at /opt/rocm/hsa/lib/libhsa-runtime64.so.1
    #4  0x00007ffff6bdd9d7 in  () at /opt/rocm/hsa/lib/libhsa-runtime64.so.1
    #5  0x00007ffff70866da in start_thread (arg=0x7fffd4b63700) at pthread_create.c:456
    #6  0x00007ffff1d62d7f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
    

    I have the same stack with ./examples/mnist/train_lenet.sh

    Your system configuration

    Operating system: Ubuntu 17.04 x86-64 Compiler: g++ 6.3.0 BLAS: libblas 3.7.0-1, hcblas 0.1.1, hipblas 0.4.4.6 Python or MATLAB version (for pycaffe and matcaffe respectively): Hardware (should be supported by Rocm):

    • Vega RX 56
    • Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
    opened by Nicop06 14
  • Running caffe on AMD GPUs

    Running caffe on AMD GPUs

    Hello! The reason I am opening issue here is because I didn't find better place for discussion of caffe running on AMD gpus. If there is better place for discussion, please say so.

    So I was able to install ROCm 1.5 on my Ubuntu 16.04 running on i7 6700 and Radeon RX 480 8GB.

    I did manage to build hipCaffe and run mnist example using AMD GPU. I also tried same sample on cpu (using multithreaded OpenBLAS) and the result were not that impressive.

    I am testing using command time ./examples/mnist/train_lenet.sh When using CPU only (8 threads), the result is:

    real	7m38.942s
    user	22m0.692s
    sys	34m6.000s
    

    When using GPU (rx 480 + ROCm 1.5):

    real	5m46.945s
    user	5m27.120s
    sys	0m7.256s
    

    The speed is better also CPU is more free. Compared to out "gpu grid" server when using 1x nvidia Titan X:

    real	2m0.855s
    user	1m34.332s
    sys	0m43.948s
    

    I do understand that Titan X is much faster than RX 480 and nvidia has put MUCH more resources in to deep learning optimisation. Are those results expected, or should I get better? Anyone else tried hipCaffe on any AMD GPU?

    Your system configuration

    Operating system: Ubuntu 16.04 + ROCm 1.5 (kernel 4.9-kfd) + RX 480 8GB BLAS: OpenBLAS (and amd hip BLAS variant) Python or MATLAB version (for pycaffe and matcaffe respectively): no python/matlab

    opened by gsedej 12
  • Getting core dumps on

    Getting core dumps on "real" workloads

    Issue summary

    Initially reported in #19 that I am getting issues with test failures as well as core dumps, but just reporting on core dumps here for now.

    In short, NetTest/0.TestReshape, as well as my attempts at running the MNIST and CaffeNet all end with a core dump. Data for CIFAR-10 has an integrity problem...

    (I've been out of the game long enough that I'm not sure how to get the stack trace with gdb... I'm happy to look into this further.)

    Steps to reproduce

    If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.

    Your system configuration

    Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0, ROCm & DKMS from the ROCm PPA. GPU: AMD RX 580, drivers from rocm PPA CPU: Threadripper 1900X on X399 chipset Compiler: (I think this is the one you want?) hcc version=1.2.18063-7e18c64-ac8732c-710f135, workweek (YYWWD) = 18063 BLAS: ROCBLAS (I assume - that's the default in Makefile.config) Python or MATLAB version (for pycaffe and matcaffe respectively): standard Ubuntu Python 2.7, no matlab

    opened by davclark 11
  • Segmentation Fault when re-run train job which interrupted previsouly

    Segmentation Fault when re-run train job which interrupted previsouly

    Issue summary

    I am not sure if this a problem. I training ImageNet on VGG-16 network with multiple GPUs (MI25 x 4). After interrupted current training job, and re-run this training again, caffe shows below error message. But if I change the training command to use any single GPU in my system, no error found.
    The workaround env "export HCC_UNPINNED_COPY_MODE=2" has been applied before training.

    Error Message: *** Aborted at 1509489472 (unix time) try "date -d @1509489472" if you are using GNU date *** PC: @ 0x7f36ae941faa (unknown) *** SIGSEGV (@0x15b475060) received by PID 8045 (TID 0x7ef26b7fe700) from PID 1531400288; stack trace: *** @ 0x7f36b4a1c390 (unknown) @ 0x7f36ae941faa (unknown) @ 0x7f36b34b1315 Mrg32k3aCreateStream() @ 0x7f36b34b1235 hcrngMrg32k3aCreateOverStreams() @ 0x7f36b352943f hiprngSetPseudoRandomGeneratorSeed @ 0x522ffc caffe::Caffe::SetDevice() @ 0x722dca caffe::P2PSync<>::InternalThreadEntry() @ 0x7f36b0eb15d5 (unknown) @ 0x7f36b4a126ba start_thread @ 0x7f36ae8fb3dd clone @ 0x0 (unknown) Segmentation fault (core dumped)

    Steps to reproduce

    1. Run the below command to train my network [email protected]:/data/ImageNet/imagenet$ /home/amax/hipCaffe/build/tools/caffe train -solver=/data/ImageNet/models/vgg_net/vgg_solver.prototxt -gpu 0,1,2,3
    2. Use Ctrl-C to interrupt this job
    3. Re-Run the command in the step 1. It will show the error message above.
    4. If I change the '-gpu' to any single gpu, the program will run but only one GPU.
    5. If I change the '-gpu' to any number of gpu, the program still show above error message. But during my testing, there still 1/10 chance to run without error message.
    6. This error message also happened when I run the train job when the system just boot up.
    7. After waiting several minutes, some times the job could be run again but some times not.

    I am not sure the root cause, from the error message it seems the device resource has not be released after the previous job terminated.

    Your system configuration

    Operating system: Ubuntu 16.04.3 Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) CUDA version (if applicable): CUDNN version (if applicable): BLAS: USE_ROCBLAS := 1 Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.12 Other: miopen-hip 1.1.4 miopengemm 1.1.5 rocm-libs 1.6.180 Server: Inventec P47 GPU: AMD MI25 x4 CPU: AMD EPYC 7601 x2 Memory: 512GB

    opened by dhzhd1 7
  • Improve hipcaffe port

    Improve hipcaffe port

    • Fix one memory leak

    • Reduce HIP stream numbers by using default HIP stream

    • Reduce MIOpen handle numbers by one handle for fwd / bwd weight / bwd data

    • Polish debug message output

    • Add debug message for CuDNNConvolutionLayer count

    • Remove unused CUDNN integration logic

    opened by whchung 4
  • Error building hipCaffe

    Error building hipCaffe

    I followed the instruction but I'm getting this error when building hipCaffe.

    .build_release/src/caffe/proto/caffe.pb.h:12:2: error: This file was generated by a newer version of protoc which is #error This file was generated by a newer version of protoc which is ^

    Clean set up of Ubuntu 16.04.3 and ROCm-1.6-180. Any advice?

    opened by briansp2020 4
  • test_all fails at InternalThreadTest.TestRandomSeed

    test_all fails at InternalThreadTest.TestRandomSeed

    Issue summary

    test_all.testbin fais at the following test. All preceding tests are OK:

    [----------] 2 tests from InternalThreadTest
    [ RUN      ] InternalThreadTest.TestStartAndExit
    [       OK ] InternalThreadTest.TestStartAndExit (14 ms)
    [ RUN      ] InternalThreadTest.TestRandomSeed
    *** Aborted at 1499717161 (unix time) try "date -d @1499717161" if you are using GNU date ***
    PC: @     0x7f4a6bb45116 (unknown)
    *** SIGSEGV (@0x0) received by PID 7887 (TID 0x7f4a719e2b80) from PID 0; stack trace: ***
        @     0x7f4a70ea0390 (unknown)
        @     0x7f4a6bb45116 (unknown)
        @     0x7f4a4f35d75d HSACopy::syncCopyExt()
        @     0x7f4a4f35c8bc Kalmar::HSAQueue::copy_ext()
        @     0x7f4a715daa5b ihipStream_t::locked_copySync()
        @     0x7f4a716020bf hipMemcpy
        @     0x7f4a6f9da4f8 hiprngSetPseudoRandomGeneratorSeed
        @           0xc41b60 caffe::Caffe::set_random_seed()
        @           0x48827b caffe::InternalThreadTest_TestRandomSeed_Test::TestBody()
        @           0xb47884 testing::internal::HandleExceptionsInMethodIfSupported<>()
        @           0xb47745 testing::Test::Run()
        @           0xb488f0 testing::TestInfo::Run()
        @           0xb49137 testing::TestCase::Run()
        @           0xb4f517 testing::internal::UnitTestImpl::RunAllTests()
        @           0xb4ef64 testing::internal::HandleExceptionsInMethodIfSupported<>()
        @           0xb4ef19 testing::UnitTest::Run()
        @           0xf83cda main
        @     0x7f4a6ba17830 __libc_start_main
        @           0xf76c19 _start
        @                0x0 (unknown)
    

    Steps to reproduce

    Freshly compiled hipCaffe with default Makefile.config. Compilation flags = -O3 I am using the latest ROCm from debian packages. Running the MNIST lenet example works fine (with or without MIOpen).

    Your system configuration

    Operating system: Linux Mint 18 (Debian derivative) Compiler: GCC v5.4.0, HCC clang 5.0 CUDA version (if applicable): not applicable CUDNN version (if applicable): not applicable BLAS: ATLAS Python or MATLAB version (for pycaffe and matcaffe respectively): not applicable

    opened by ptsant 4
  • enabled miopen pooling

    enabled miopen pooling

    This fix enables caffe to use MIOpen for MAX_POOLING. This was disabled for some reason which cuDNN does not / did not support. Looking at NVIDIA's github itself makes me believe that they have fixed this limitation.

    Highly recommend to run training with this fix to ensure this does not manifest in accuracy issues. I am suspecting all this while we were not using MIOpen for pooling in our training runs.

    cc\ @ashishfarmer This might also be the reason for the discrepancy you saw between ocl and hip caffe for pooling.

    opened by dagamayank 3
  • Everything fails, hardcoded jenkins paths in binaries

    Everything fails, hardcoded jenkins paths in binaries

    If I follow the ROCm readme to the letter, I end up successfully building everything, but tests don't run:

    $ ./build/test/test_all.testbin
    ### HCC RUNTIME ERROR: Fail to find compatible kernel at file:/home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.5/external/hcc-tot/lib/mcwamp.cpp line:344
    

    This is aside from my other problem, which is that installing ROCm completely destabilised my Ubuntu install, and on a fresh reinstall of 16.04 it stops any GUI from appearing, so I'm working from TTY1.. When will ROCm work with a real kernel?

    opened by cathalgarvey 3
  • Please enable two factor authentication in your github account

    Please enable two factor authentication in your github account

    @xmaxk

    We are going to enforce two factor authentication in (https://github.com/ROCmSoftwarePlatform/) organization on 29th April, 2022 . Since we identified you as outside collaborator for ROCmSoftwarePlatform organization, you need to enable two factor authentication in your github account else you shall be removed from the organization after the enforcement. Please skip if already done.

    To set up two factor authentication, please go through the steps in below link:

    https://docs.github.com/en/[email protected]/github/authenticating-to-github/configuring-two-factor-authentication

    Please email "[email protected]" for queries

    opened by DeeptimayeeSethi 0
  • Why this repository isn't a simple fork?

    Why this repository isn't a simple fork?

    Why this "hipCaffe" is not a simple fork from bvlc caffe repo? I know it's a stupid question, but, maybe it will be easier to update the "hip" with the other caffe version (like bvlc or opencl version).

    opened by fabiosammy 1
  • ./build/test/test_all.testbin drops core

    ./build/test/test_all.testbin drops core

    Issue summary

    I'm reposting this after closing my original issue because I am now quite confident the install was completely canonical.

    Running ./build/test/test_all.testbin ultimately drops core. It drops core same place with either RX470 or RX Vega 64.

    A few tests fail but at a certain point it always drops core.

    Steps to reproduce

    • Clean install Ubuntu 18.04.1 LTS Server.
    • Use Ubuntu's stock kernels provided by only the apt-get update / upgrade mechanism (ie the kernel in use is that provided by Ubuntu after apt-get dist-upgrade: I have not installed an upstream kernel.
    • Install ROCm 2.0 & hipCaffe per the hipCaffe instructions.
    • Run several of the examples without error.
    • Run ./build/test/test_all.testbin... drops core.

    Problem:

    ...
    [ RUN      ] NetTest/2.TestForcePropagateDown
    [       OK ] NetTest/2.TestForcePropagateDown (2 ms)
    [ RUN      ] NetTest/2.TestAllInOneNetTrain
    [       OK ] NetTest/2.TestAllInOneNetTrain (3 ms)
    [ RUN      ] NetTest/2.TestAllInOneNetVal
    [       OK ] NetTest/2.TestAllInOneNetVal (4 ms)
    [ RUN      ] NetTest/2.TestAllInOneNetDeploy
    [       OK ] NetTest/2.TestAllInOneNetDeploy (1 ms)
    [----------] 26 tests from NetTest/2 (772 ms total)
    
    [----------] 26 tests from NetTest/3, where TypeParam = caffe::GPUDevice<double>
    [ RUN      ] NetTest/3.TestHasBlob
    [       OK ] NetTest/3.TestHasBlob (4 ms)
    [ RUN      ] NetTest/3.TestGetBlob
    [       OK ] NetTest/3.TestGetBlob (4 ms)
    [ RUN      ] NetTest/3.TestHasLayer
    [       OK ] NetTest/3.TestHasLayer (4 ms)
    [ RUN      ] NetTest/3.TestGetLayerByName
    [       OK ] NetTest/3.TestGetLayerByName (4 ms)
    [ RUN      ] NetTest/3.TestBottomNeedBackward
    [       OK ] NetTest/3.TestBottomNeedBackward (4 ms)
    [ RUN      ] NetTest/3.TestBottomNeedBackwardForce
    [       OK ] NetTest/3.TestBottomNeedBackwardForce (4 ms)
    [ RUN      ] NetTest/3.TestBottomNeedBackwardEuclideanForce
    [       OK ] NetTest/3.TestBottomNeedBackwardEuclideanForce (1 ms)
    [ RUN      ] NetTest/3.TestBottomNeedBackwardTricky
    [       OK ] NetTest/3.TestBottomNeedBackwardTricky (5 ms)
    [ RUN      ] NetTest/3.TestLossWeight
    [       OK ] NetTest/3.TestLossWeight (21 ms)
    [ RUN      ] NetTest/3.TestLossWeightMidNet
    [       OK ] NetTest/3.TestLossWeightMidNet (16 ms)
    [ RUN      ] NetTest/3.TestComboLossWeight
    [       OK ] NetTest/3.TestComboLossWeight (18 ms)
    [ RUN      ] NetTest/3.TestBackwardWithAccuracyLayer
    MIOpen Error: /home/dlowell/MIOpenPrivate/src/ocl/softmaxocl.cpp:59: Only alpha=1 and beta=0 is supported
    F0116 04:57:34.752313 24321 cudnn_softmax_layer_hip.cpp:27] Check failed: status == miopenStatusSuccess (7 vs. 0)  miopenStatusUnknownError
    *** Check failure stack trace: ***
        @     0x7f2ab9f720cd  google::LogMessage::Fail()
        @     0x7f2ab9f73f33  google::LogMessage::SendToLog()
        @     0x7f2ab9f71c28  google::LogMessage::Flush()
        @     0x7f2ab9f74999  google::LogMessageFatal::~LogMessageFatal()
        @          0x15364ce  caffe::CuDNNSoftmaxLayer<>::Forward_gpu()
        @           0x4cb540  caffe::Layer<>::Forward()
        @          0x1ea2073  caffe::SoftmaxWithLossLayer<>::Forward_gpu()
        @           0x4cb540  caffe::Layer<>::Forward()
        @          0x1b459d7  caffe::Net<>::ForwardFromTo()
        @          0x1b458f0  caffe::Net<>::Forward()
        @           0x967796  caffe::NetTest_TestBackwardWithAccuracyLayer_Test<>::TestBody()
        @          0x108be34  testing::internal::HandleExceptionsInMethodIfSupported<>()
        @          0x108bcf6  testing::Test::Run()
        @          0x108ceb1  testing::TestInfo::Run()
        @          0x108d5c7  testing::TestCase::Run()
        @          0x1093967  testing::internal::UnitTestImpl::RunAllTests()
        @          0x10933a4  testing::internal::HandleExceptionsInMethodIfSupported<>()
        @          0x1093359  testing::UnitTest::Run()
        @          0x201545a  main
        @     0x7f2ab4b00b97  __libc_start_main
        @          0x20148fa  _start
    Aborted (core dumped)
    
    

    Your system configuration

    I can provide what ever info you need, just tell me what you want.

    Operating system: Ubuntu 18.04.1 LTS Compiler: gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 CUDA version (if applicable): N/A CUDNN version (if applicable): N/A BLAS: rocblas 2.0.0.0 Python or MATLAB version (for pycaffe and matcaffe respectively): Python 2.7.15rc1

    Hardware: RX Vega 64, or RX 470 4GB. Ryzen 5 2600X 16 GB RAM X470 mobo SR-IOV is turned off IOMMU enabled or disabled - same result.

    opened by emerth 0
  • Compilation errors - Function not defined

    Compilation errors - Function not defined

    Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help. Do not post such requests to Issues. Doing so interferes with the development of Caffe.

    Please read the guidelines for contributing before submitting this issue.

    Issue summary

    Compilation error with latest and greatest checkout

    Steps to reproduce

    Usual process of CMake followed by make If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.

    Your system configuration

    Operating system: ubuntu 18.04 Compiler: hcc (latest version - CLang 8.0) CUDA version (if applicable): CUDNN version (if applicable): BLAS: Python or MATLAB version (for pycaffe and matcaffe respectively):

    [ 2%] Building CXX object src/caffe/CMakeFiles/caffe.dir/layers/absval_layer_hip.cpp.o /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:13:3: error: use of undeclared identifier 'caffe_gpu_abs' caffe_gpu_abs(count, bottom[0]->gpu_data(), top_data); ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:29:1: note: in instantiation of member function 'caffe::AbsValLayer::Forward_gpu' requested here INSTANTIATE_LAYER_GPU_FUNCS(AbsValLayer); ^ /home/naths/srcs_new/hipCaffe/include/caffe/common.hpp:85:3: note: expanded from macro 'INSTANTIATE_LAYER_GPU_FUNCS' INSTANTIATE_LAYER_GPU_FORWARD(classname);
    ^ /home/naths/srcs_new/hipCaffe/include/caffe/common.hpp:67:35: note: expanded from macro 'INSTANTIATE_LAYER_GPU_FORWARD' template void classname::Forward_gpu(
    ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:13:3: error: use of undeclared identifier 'caffe_gpu_abs' caffe_gpu_abs(count, bottom[0]->gpu_data(), top_data); ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:29:1: note: in instantiation of member function 'caffe::AbsValLayer::Forward_gpu' requested here INSTANTIATE_LAYER_GPU_FUNCS(AbsValLayer); ^ /home/naths/srcs_new/hipCaffe/include/caffe/common.hpp:85:3: note: expanded from macro 'INSTANTIATE_LAYER_GPU_FUNCS' INSTANTIATE_LAYER_GPU_FORWARD(classname);
    ^ /home/naths/srcs_new/hipCaffe/include/caffe/common.hpp:70:36: note: expanded from macro 'INSTANTIATE_LAYER_GPU_FORWARD' template void classname::Forward_gpu(
    ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:24:5: error: use of undeclared identifier 'caffe_gpu_sign' caffe_gpu_sign(count, bottom_data, bottom_diff); ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:29:1: note: in instantiation of member function 'caffe::AbsValLayer::Backward_gpu' requested here INSTANTIATE_LAYER_GPU_FUNCS(AbsValLayer); ^ /home/naths/srcs_new/hipCaffe/include/caffe/common.hpp:86:3: note: expanded from macro 'INSTANTIATE_LAYER_GPU_FUNCS' INSTANTIATE_LAYER_GPU_BACKWARD(classname) ^ /home/naths/srcs_new/hipCaffe/include/caffe/common.hpp:75:35: note: expanded from macro 'INSTANTIATE_LAYER_GPU_BACKWARD' template void classname::Backward_gpu(
    ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:25:5: error: use of undeclared identifier 'caffe_gpu_mul' caffe_gpu_mul(count, bottom_diff, top_diff, bottom_diff); ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:24:5: error: use of undeclared identifier 'caffe_gpu_sign' caffe_gpu_sign(count, bottom_data, bottom_diff); ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:29:1: note: in instantiation of member function 'caffe::AbsValLayer::Backward_gpu' requested here INSTANTIATE_LAYER_GPU_FUNCS(AbsValLayer); ^ /home/naths/srcs_new/hipCaffe/include/caffe/common.hpp:86:3: note: expanded from macro 'INSTANTIATE_LAYER_GPU_FUNCS' INSTANTIATE_LAYER_GPU_BACKWARD(classname) ^ /home/naths/srcs_new/hipCaffe/include/caffe/common.hpp:79:36: note: expanded from macro 'INSTANTIATE_LAYER_GPU_BACKWARD' template void classname::Backward_gpu(
    ^ /home/naths/srcs_new/hipCaffe/src/caffe/layers/absval_layer_hip.cpp:25:5: error: use of undeclared identifier 'caffe_gpu_mul' caffe_gpu_mul(count, bottom_diff, top_diff, bottom_diff); ^ 6 errors generated. src/caffe/CMakeFiles/caffe.dir/build.make:254: recipe for target 'src/caffe/CMakeFiles/caffe.dir/layers/absval_layer_hip.cpp.o' failed make[2]: *** [src/caffe/CMakeFiles/caffe.dir/layers/absval_layer_hip.cpp.o] Error 1 CMakeFiles/Makefile2:235: recipe for target 'src/caffe/CMakeFiles/caffe.dir/all' failed make[1]: *** [src/caffe/CMakeFiles/caffe.dir/all] Error 2 Makefile:129: recipe for target 'all' failed make: *** [all] Error 2

    opened by skn123 2
  • Make target for pyCaffe, and a hardware question.

    Make target for pyCaffe, and a hardware question.

    Issue summary

    Two straightforward questions:

    1. to build python caffe interface do I run "make", or "make py" or "make python"?

    2. Will this code work on an RX470, or with an RX Vega 64? I'd rather experiment using the RX 470 if I can get away with it.

    Thanks!

    opened by emerth 2
  • build/test/test_all.testbin Core Dump MIOpen Error

    build/test/test_all.testbin Core Dump MIOpen Error

    Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help. Do not post such requests to Issues. Doing so interferes with the development of Caffe.

    Please read the guidelines for contributing before submitting this issue.

    Issue summary

    Fresh install of Ubuntu 16.x Desktop, ROCM and started to build hipCafee however ran into a error when running ./build/test/test_all.testbin

    Stack Trace:

    [ RUN      ] SoftmaxWithLossLayerTest/2.TestGradientUnnormalized
    [       OK ] SoftmaxWithLossLayerTest/2.TestGradientUnnormalized (89 ms)
    [----------] 4 tests from SoftmaxWithLossLayerTest/2 (2434 ms total)
    
    [----------] 4 tests from SoftmaxWithLossLayerTest/3, where TypeParam = caffe::GPUDevice<double>
    [ RUN      ] SoftmaxWithLossLayerTest/3.TestGradient
    MIOpen Error: /home/dlowell/MIOpenPrivate/src/ocl/softmaxocl.cpp:59: Only alpha=1 and beta=0 is supported
    F1227 11:47:35.865128 23305 cudnn_softmax_layer_hip.cpp:27] Check failed: status == miopenStatusSuccess (7 vs. 0)  miopenStatusUnknownError
    *** Check failure stack trace: ***
        @     0x7f6f6c7d15cd  google::LogMessage::Fail()
        @     0x7f6f6c7d3433  google::LogMessage::SendToLog()
        @     0x7f6f6c7d115b  google::LogMessage::Flush()
        @     0x7f6f6c7d3e1e  google::LogMessageFatal::~LogMessageFatal()
        @          0x14a1f2e  caffe::CuDNNSoftmaxLayer<>::Forward_gpu()
        @           0x4ed860  caffe::Layer<>::Forward()
        @          0x1d9a153  caffe::SoftmaxWithLossLayer<>::Forward_gpu()
        @           0x4ed860  caffe::Layer<>::Forward()
        @           0x51c010  caffe::GradientChecker<>::CheckGradientSingle()
        @           0x51b76d  caffe::GradientChecker<>::CheckGradientExhaustive()
        @           0xa21807  caffe::SoftmaxWithLossLayerTest_TestGradient_Test<>::TestBody()
        @          0x1033034  testing::internal::HandleExceptionsInMethodIfSupported<>()
        @          0x1032ef6  testing::Test::Run()
        @          0x1034051  testing::TestInfo::Run()
        @          0x10348b7  testing::TestCase::Run()
        @          0x103ada7  testing::internal::UnitTestImpl::RunAllTests()
        @          0x103a7e4  testing::internal::HandleExceptionsInMethodIfSupported<>()
        @          0x103a799  testing::UnitTest::Run()
        @          0x1ef98fa  main
        @     0x7f6f6783f830  __libc_start_main
        @          0x1ef8d99  _start
        @              (nil)  (unknown)
    Aborted (core dumped)
    

    Scrolling up to see if any errors I did find this , I believe this was on make test.

    Expected: data[i]
    Which is: 1.6294847
    src/caffe/test/test_inner_product_layer.cpp:384: Failure
    Value of: data_t[i]
      Actual: 2.9882355
    Expected: data[i]
    Which is: 2.474798
    src/caffe/test/test_inner_product_layer.cpp:384: Failure
    Value of: data_t[i]
      Actual: 2.1015618
    Expected: data[i]
    Which is: 2.0466099
    

    Thoughts, suggestions I could try..

    Your system configuration

    Operating system: Ubuntu 16 Desktop Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) CUDA version (if applicable): NA CUDNN version (if applicable): BLAS: Python or MATLAB version (for pycaffe and matcaffe respectively):

    opened by gateway 1
Releases(rocm-1.7.1)
Owner
ROCm Software Platform
ROCm Software Platform Repository
ROCm Software Platform
Data Consistency for Magnetic Resonance Imaging

Data Consistency for Magnetic Resonance Imaging Data Consistency (DC) is crucial for generalization in multi-modal MRI data and robustness in detectin

Dimitris Karkalousos 19 Dec 12, 2022
[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Chasing Sparsity in Vision Transformers: An End-to-End Exploration Codes for [Preprint] Chasing Sparsity in Vision Transformers: An End-to-End Explora

VITA 64 Dec 08, 2022
JFB: Jacobian-Free Backpropagation for Implicit Models

JFB: Jacobian-Free Backpropagation for Implicit Models

Typal Research 28 Dec 11, 2022
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in 🇰🇷 Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Jan 05, 2023
Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object

151 Dec 26, 2022
A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.

TorchRL Disclaimer This library is not officially released yet and is subject to change. The features are available before an official release so that

Meta Research 860 Jan 07, 2023
Official Pytorch Implementation for Splicing ViT Features for Semantic Appearance Transfer presenting Splice

Splicing ViT Features for Semantic Appearance Transfer [Project Page] Splice is a method for semantic appearance transfer, as described in Splicing Vi

Omer Bar Tal 253 Jan 06, 2023
Official implementation of "StyleCariGAN: Caricature Generation via StyleGAN Feature Map Modulation" (SIGGRAPH 2021)

StyleCariGAN in PyTorch Official implementation of StyleCariGAN:Caricature Generation via StyleGAN Feature Map Modulation in PyTorch Requirements PyTo

PeterZhouSZ 49 Oct 31, 2022
Minimalist Error collection Service compatible with Rollbar clients. Sentry or Rollbar alternative.

Minimalist Error collection Service Features Compatible with any Rollbar client(see https://docs.rollbar.com/docs). Just change the endpoint URL to yo

Haukur Rósinkranz 381 Nov 11, 2022
The code for our NeurIPS 2021 paper "Kernelized Heterogeneous Risk Minimization".

Kernelized-HRM Jiashuo Liu, Zheyuan Hu The code for our NeurIPS 2021 paper "Kernelized Heterogeneous Risk Minimization"[1]. This repo contains the cod

Liu Jiashuo 8 Nov 20, 2022
Implement the Pareto Optimizer and pcgrad to make a self-adaptive loss for multi-task

multi-task_losses_optimizer Implement the Pareto Optimizer and pcgrad to make a self-adaptive loss for multi-task 已经实验过了,不会有cuda out of memory情况 ##Par

14 Dec 25, 2022
Learning Optical Flow from a Few Matches (CVPR 2021)

Learning Optical Flow from a Few Matches This repository contains the source code for our paper: Learning Optical Flow from a Few Matches CVPR 2021 Sh

Shihao Jiang (Zac) 159 Dec 16, 2022
Real-time Neural Representation Fusion for Robust Volumetric Mapping

NeuralBlox: Real-Time Neural Representation Fusion for Robust Volumetric Mapping Paper | Supplementary This repository contains the implementation of

ETHZ ASL 106 Dec 24, 2022
Scalable training for dense retrieval models.

Scalable implementation of dense retrieval. Training on cluster By default it trains locally: PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py traine

Facebook Research 90 Dec 28, 2022
[内测中]前向式Python环境快捷封装工具,快速将Python打包为EXE并添加CUDA、NoAVX等支持。

QPT - Quick packaging tool 快捷封装工具 GitHub主页 | Gitee主页 QPT是一款可以“模拟”开发环境的多功能封装工具,最短只需一行命令即可将普通的Python脚本打包成EXE可执行程序,并选择性添加CUDA和NoAVX的支持,尽可能兼容更多的用户环境。 感觉还可

QPT Family 545 Dec 28, 2022
[NeurIPS 2021] Garment4D: Garment Reconstruction from Point Cloud Sequences

Garment4D [PDF] | [OpenReview] | [Project Page] Overview This is the codebase for our NeurIPS 2021 paper Garment4D: Garment Reconstruction from Point

Fangzhou Hong 112 Dec 23, 2022
OrienMask: Real-time Instance Segmentation with Discriminative Orientation Maps

OrienMask This repository implements the framework OrienMask for real-time instance segmentation. It achieves 34.8 mask AP on COCO test-dev at the spe

45 Dec 13, 2022
FedGS: A Federated Group Synchronization Framework Implemented by LEAF-MX.

FedGS: Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT Preparation For instructions on generating data, plea

Lizonghang 9 Dec 22, 2022
Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Minimal code and simple experiments to play with Denoising Diffusion Probabilist

Rithesh Kumar 16 Oct 06, 2022
Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

CSRA This is the official code of ICCV 2021 paper: Residual Attention: A Simple But Effective Method for Multi-Label Recoginition Demo, Train and Vali

163 Dec 22, 2022