A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

Overview

Tensorpack

Tensorpack is a neural network training interface based on TensorFlow.

ReadTheDoc Gitter chat model-zoo

Features:

It's Yet Another TF high-level API, with speed, and flexibility built together.

  1. Focus on training speed.

    • Speed comes for free with Tensorpack -- it uses TensorFlow in the efficient way with no extra overhead. On common CNNs, it runs training 1.2~5x faster than the equivalent Keras code. Your training can probably gets faster if written with Tensorpack.

    • Data-parallel multi-GPU/distributed training strategy is off-the-shelf to use. It scales as well as Google's official benchmark.

    • See tensorpack/benchmarks for some benchmark scripts.

  2. Focus on large datasets.

    • You don't usually need tf.data. Symbolic programming often makes data processing harder. Tensorpack helps you efficiently process large datasets (e.g. ImageNet) in pure Python with autoparallelization.
  3. It's not a model wrapper.

    • There are too many symbolic function wrappers in the world. Tensorpack includes only a few common models. But you can use any symbolic function library inside Tensorpack, including tf.layers/Keras/slim/tflearn/tensorlayer/....

See tutorials and documentations to know more about these features.

Examples:

We refuse toy examples. Instead of showing tiny CNNs trained on MNIST/Cifar10, we provide training scripts that reproduce well-known papers.

We refuse low-quality implementations. Unlike most open source repos which only implement papers, Tensorpack examples faithfully reproduce papers, demonstrating its flexibility for actual research.

Vision:

Reinforcement Learning:

Speech / NLP:

Install:

Dependencies:

  • Python 3.3+.
  • Python bindings for OpenCV. (Optional, but required by a lot of features)
  • TensorFlow ≥ 1.5, < 2
    • TF is not not required if you only want to use tensorpack.dataflow alone as a data processing library
    • TF2 is supported if used in graph mode (and use tf.compat.v1 when needed)
pip install --upgrade git+https://github.com/tensorpack/tensorpack.git
# or add `--user` to install to user's local directories

Please note that tensorpack is not yet stable. If you use tensorpack in your code, remember to mark the exact version of tensorpack you use as your dependencies.

Citing Tensorpack:

If you use Tensorpack in your research or wish to refer to the examples, please cite with:

@misc{wu2016tensorpack,
  title={Tensorpack},
  author={Wu, Yuxin and others},
  howpublished={\url{https://github.com/tensorpack/}},
  year={2016}
}
Comments
  • Run Inference after training

    Run Inference after training

    Hello! I am sorry if it is unrelated to Tensorpack. I runned the ResNet on Cifar10 dataset with Trained Ternary Quantization. Now i dont know how to run Inference on the saved checkpoint after training. I have already read "Don’t Use Training Metagraph for Inference" in Tensorpack documentation. However, i still dont know how to use this one as below exactly:

    a, b = tf.placeholder(...), tf.placeholder(...)
    with TowerContext('', is_training=False):
          model.build_graph(a, b)
    

    Could you guide me to do that? Thanks you in advance!

    usage 
    opened by minhson 58
  • error running alexnet_dorefa.py

    error running alexnet_dorefa.py

    environment: tensorflow1.13.0(in docker) cuda8.0 cudnn6 anaconda2

    error running alexnet_dorefa.py. it is weird that in the /root/tensorpack_data, there is a caffe_ilsvrc12.tar.gz file but it is only 4kb in size, which should be in 17MB in size. These are a little confusing to me. Any help is appreciated! @ppwwyyxx the error looks like this:

    [email protected]:/data/home/users/ccc/projects/tensorpack/examples/DoReFa-Net# ./alexnet-dorefa.py --dorefa 1,2,6 --data /data/data/ImageNetOrigin --gpu 4,5,6,7
    /root/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
      from ._conv import register_converters as _register_converters
    [0703 06:54:57 @logger.py:109] WRN Log directory train_log/alexnet-dorefa-1,2,6 exists! Use 'd' to delete it. 
    [0703 06:54:57 @logger.py:112] WRN If you're resuming from a previous run, you can choose to keep it.
    Press any other key to exit. 
    Select Action: k (keep) / d (delete) / q (quit):d
    [0703 06:54:58 @logger.py:74] Argv: ./alexnet-dorefa.py --dorefa 1,2,6 --data /data/data/ImageNetOrigin --gpu 4,5,6,7
    [0703 06:54:58 @alexnet-dorefa.py:222] Batch per tower: 64
    [0703 06:54:58 @fs.py:88] WRN Env var $TENSORPACK_DATASET not set, using /root/tensorpack_data for datasets.
    caffe_ilsvrc12.tar.gz: 8.19kB [00:00, 26.0kB/s]
    Succesfully downloaded caffe_ilsvrc12.tar.gz. 2942 bytes.
    Traceback (most recent call last):
      File "./alexnet-dorefa.py", line 224, in <module>
        config = get_config()
      File "./alexnet-dorefa.py", line 147, in get_config
        data_train = get_data('train')
      File "./alexnet-dorefa.py", line 143, in get_data
        args.data, dataset_name, BATCH_SIZE, augmentors)
      File "/data/home/users/ccc/projects/tensorpack/examples/DoReFa-Net/imagenet_utils.py", line 101, in get_imagenet_dataflow
        ds = dataset.ILSVRC12(datadir, name, shuffle=True)
      File "/root/anaconda2/lib/python2.7/site-packages/tensorpack/dataflow/dataset/ilsvrc.py", line 247, in __init__
        dir, name, meta_dir, shuffle, dir_structure)
      File "/root/anaconda2/lib/python2.7/site-packages/tensorpack/dataflow/dataset/ilsvrc.py", line 158, in __init__
        meta = ILSVRCMeta(meta_dir)
      File "/root/anaconda2/lib/python2.7/site-packages/tensorpack/dataflow/dataset/ilsvrc.py", line 32, in __init__
        self._download_caffe_meta()
      File "/root/anaconda2/lib/python2.7/site-packages/tensorpack/dataflow/dataset/ilsvrc.py", line 57, in _download_caffe_meta
        tarfile.open(fpath, 'r:gz').extractall(self.dir)
      File "/root/anaconda2/lib/python2.7/tarfile.py", line 1693, in open
        return func(name, filemode, fileobj, **kwargs)
      File "/root/anaconda2/lib/python2.7/tarfile.py", line 1751, in gzopen
        raise ReadError("not a gzip file")
    tarfile.ReadError: not a gzip file
    
    examples 
    opened by brisker 55
  • Quantizing Gradients - Meaning of max0() operator in DoReFa v2 paper?

    Quantizing Gradients - Meaning of max0() operator in DoReFa v2 paper?

    Thank you for your help so far.

    (1) In section 2.5 on quantizing gradients you use an operator called max0 but do not define it. I did not find a definition in the XNOR or BNN papers either. What does this operator do? How is it different from the regular max() operator?

    (2) Second, you say that dr / 2max0(|dr|) + 1/2 is an affine transform to map the gradient into [0,1], but it seems like in your code you apply an additional step to manually clip the values. Why do you need this additional step?

    Code: https://github.com/ppwwyyxx/tensorpack/blob/master/examples/DoReFa-Net/dorefa.py

     def grad_fg(op, x):
                rank = x.get_shape().ndims
                assert rank is not None
                maxx = tf.reduce_max(tf.abs(x), list(range(1,rank)), keep_dims=True)
                x = x / maxx
                n = float(2**bitG-1)
                x = x * 0.5 + 0.5 + tf.random_uniform(
                        tf.shape(x), minval=-0.5/n, maxval=0.5/n)
                x = tf.clip_by_value(x, 0.0, 1.0) # this is the extra step not in the paper
                x = quantize(x, bitG) - 0.5
                return x * maxx * 2
    

    (3) I am also having trouble understanding this line, could you please explain? - maxx = tf.reduce_max(tf.abs(x), list(range(1,rank)), keep_dims=True).

    It seems like list(range(1,rank)) is somehow related to your statement that "Here dr = ∂c/∂r is the back-propagated gradient of the output r of some layer, and the maximum is taken over all axis of the gradient tensor dr except for the mini-batch axis (therefore each instance in a mini-batch will have its own scaling factor)", but I do not understand this sentence either. Thank you for your help!

    examples 
    opened by the-bobo 35
  • train on an Atari game: Breakout-v0 (Utilization of gpu and convergence)

    train on an Atari game: Breakout-v0 (Utilization of gpu and convergence)

    Hello Yuxin,

    I am doing training on Atari Game and I noticed that utilization of gpu ( nvidia smi -l ) is very low ( ~ 10-50%). Could you comment that, please?

    nvidia-smi-l.txt

    Could you also tell wherever my training is going all right, please? It runs for quite a lot of time and I would like to make sure that there is a progress.

    Part of the output: ................ [0120 23:32:23 @timer.py:46] Epoch 273 (global_step 1638000) finished, time:2611.25sec. [0120 23:32:24 @stats.py:101] SummaryGradient/conv0/W/rms: 0.0015963 [0120 23:32:24 @stats.py:101] SummaryGradient/conv0/b/rms: 0.034784 [0120 23:32:24 @stats.py:101] SummaryGradient/conv1/W/rms: 0.00075034 [0120 23:32:24 @stats.py:101] SummaryGradient/conv1/b/rms: 0.014863 [0120 23:32:24 @stats.py:101] SummaryGradient/conv2/W/rms: 0.00071202 [0120 23:32:24 @stats.py:101] SummaryGradient/conv2/b/rms: 0.0056869 [0120 23:32:24 @stats.py:101] SummaryGradient/conv3/W/rms: 0.00084989 [0120 23:32:24 @stats.py:101] SummaryGradient/conv3/b/rms: 0.0093001 [0120 23:32:24 @stats.py:101] SummaryGradient/fc-pi/W/rms: 0.0036259 [0120 23:32:24 @stats.py:101] SummaryGradient/fc-pi/b/rms: 0.0050046 [0120 23:32:24 @stats.py:101] SummaryGradient/fc-v/W/rms: 0.023725 [0120 23:32:24 @stats.py:101] SummaryGradient/fc-v/b/rms: 0.030802 [0120 23:32:24 @stats.py:101] SummaryGradient/fc0/W/rms: 0.00015396 [0120 23:32:24 @stats.py:101] SummaryGradient/fc0/b/rms: 0.0010555 [0120 23:32:24 @stats.py:101] SummaryGradient/prelu/alpha/rms: 0.083734 [0120 23:32:24 @stats.py:101] async_global_step: 1.638e+06 [0120 23:32:24 @stats.py:101] cost: 0.010786 [0120 23:32:24 @stats.py:101] input_queue_size: 2.3367e-37 [0120 23:32:24 @stats.py:101] learning_rate: 0.0001 [0120 23:32:24 @stats.py:101] policy_loss: -0.57677 [0120 23:32:24 @stats.py:101] predict_reward: 2.8047 [0120 23:32:24 @stats.py:101] rms_advantage: 0.20093 [0120 23:32:24 @stats.py:101] value_loss: 2.9039 [0120 23:32:24 @stats.py:101] xentropy_loss: -189.29 [0120 23:32:25 @timer.py:42] Start Epoch 274 (global_step 1644000) ... 100%|#####################################################################|6000/6000[43:57<00:00, 2.22it/s] [0121 00:16:22 @timer.py:46] Epoch 274 (global_step 1644000) finished, time:2637.45sec. [2017-01-21 00:16:24,998] Making new env: Breakout-v0 [2017-01-21 00:16:25,189] Making new env: Breakout-v0 100%|#########################################################################|16/16[06:02<00:00, 0.05it/s] [0121 00:22:28 @common.py:76] Waiting for all the workers to finish the last run... [0121 00:22:28 @stats.py:101] SummaryGradient/conv0/W/rms: 0.0017033 [0121 00:22:28 @stats.py:101] SummaryGradient/conv0/b/rms: 0.030689 [0121 00:22:28 @stats.py:101] SummaryGradient/conv1/W/rms: 0.00074152 [0121 00:22:28 @stats.py:101] SummaryGradient/conv1/b/rms: 0.01373 [0121 00:22:28 @stats.py:101] SummaryGradient/conv2/W/rms: 0.00068949 [0121 00:22:28 @stats.py:101] SummaryGradient/conv2/b/rms: 0.005354 [0121 00:22:28 @stats.py:101] SummaryGradient/conv3/W/rms: 0.00080288 [0121 00:22:28 @stats.py:101] SummaryGradient/conv3/b/rms: 0.0079926 [0121 00:22:28 @stats.py:101] SummaryGradient/fc-pi/W/rms: 0.0033409 [0121 00:22:28 @stats.py:101] SummaryGradient/fc-pi/b/rms: 0.0056811 [0121 00:22:28 @stats.py:101] SummaryGradient/fc-v/W/rms: 0.01776 [0121 00:22:28 @stats.py:101] SummaryGradient/fc-v/b/rms: 0.026071 [0121 00:22:28 @stats.py:101] SummaryGradient/fc0/W/rms: 0.00015412 [0121 00:22:28 @stats.py:101] SummaryGradient/fc0/b/rms: 0.001081 [0121 00:22:28 @stats.py:101] SummaryGradient/prelu/alpha/rms: 0.088892 [0121 00:22:28 @stats.py:101] async_global_step: 1.644e+06 [0121 00:22:28 @stats.py:101] cost: 0.0021201 [0121 00:22:28 @stats.py:101] input_queue_size: 0.00082628 [0121 00:22:28 @stats.py:101] learning_rate: 0.0001 [0121 00:22:28 @stats.py:101] max_score: 864 [0121 00:22:28 @stats.py:101] mean_score: 543.19 [0121 00:22:28 @stats.py:101] policy_loss: -1.5347 [0121 00:22:28 @stats.py:101] predict_reward: 2.6608 [0121 00:22:28 @stats.py:101] rms_advantage: 0.19512 [0121 00:22:28 @stats.py:101] value_loss: 2.762 [0121 00:22:28 @stats.py:101] xentropy_loss: -191.18 [0121 00:22:28 @group.py:42] Callbacks took 364.255 sec in total. Periodic-Evaluator: 363.350sec [0121 00:22:28 @timer.py:42] Start Epoch 275 (global_step 1650000) ... ......................

    examples 
    opened by ghost 33
  • Train Faster RCNN

    Train Faster RCNN

    I get an error to train faster rcnn based on your example; however, with your model, I am able to evaluate its performance and get the same results you posted on github.

    Always include the following:

    1. What you did. (command you run if using examples; post or describe your code if not)

    ./examples/FasterRCNN/train.py --load snapshots/tensorpack/COCO-ResNet50-FasterRCNN.npz --gpu 2,3 --datadir /path/to/COCO14 --logdir snapshots/fasterRCNN-ResNet50

    1. What you observed. (training logs)
    [1116 16:23:10 @graph.py:70] Running Op sync_variables_from_main_tower ...  
    2017-11-16 16:23:10.457645: E tensorflow/stream_executor/cuda/cuda_driver.cc:1299] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED  
    [1116 16:23:14 @param.py:144] After epoch 0, learning_rate will change to 0.00300000  
    [1116 16:23:15 @base.py:209] Start Epoch 1 ...
    

    and then the program is idle there forever, does it related to the line about CUDA_ERROR_NOT_INITIALIZED

    1. Your environment (TF version, GPUs), if it matters. TF version 1.4.0, Python-3.6, CUDA 9, CUDNN-7. Tensorpack version: the newest commit.

    2. Others:

    • if I commented out the ds = PrefetchDataZMQ(ds, 1) in get_train_dataflow function. of data.py file, the training is running. Or if I replace ds = PrefetchDataZMQ(ds, 1) by ds = PrefetchData(ds, 500, 1), it will work as well.

    Thanks.

    opened by chunfuchen 32
  • Build ZMQ-operator

    Build ZMQ-operator

    I tried to compile your custom-operator on my machine and get

    Compiling user ops ...
    make: Entering directory '/home/patwie/git/tensorpack/tensorpack/user_ops'
    [dep] zmq_recv_op.cc ...
    In file included from zmq_conn.h:8:0,
                     from zmq_recv_op.cc:10:
    zmq.hpp:84:36: error: missing binary operator before token "("
     #if ZMQ_VERSION >= ZMQ_MAKE_VERSION(3, 3, 0)
    

    Can you shortly comment, which zmq version do you use. I had to change

    //#include <zmq.hpp> into
    #include "zmq.hpp"
    

    and use https://github.com/zeromq/cppzmq

    But still getting the error.

    enhancement 
    opened by PatWie 32
  • Bug Reports: How to deal with ValueError: Cannot feed value of shape (224, 224, 3) for Tensor 'input:0', which has shape '(?, 224, 224, 3)'

    Bug Reports: How to deal with ValueError: Cannot feed value of shape (224, 224, 3) for Tensor 'input:0', which has shape '(?, 224, 224, 3)'

    It seems the first run would be OK after reboot the server. For the following attempt, it will give me this error message.

    The log is as below:

    [1026 20:26:32 @logger.py:74] Argv: main.py [1026 20:26:32 @tensor_net.py:46] Running on 2 towers. Batch size per tower: 64 [1026 20:26:32 @fs.py:89] WRN Env var $TENSORPACK_DATASET not set, using /home/hgao/tensorpack_data for datasets. [1026 20:26:34 @prefetch.py:263] [PrefetchDataZMQ] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d. [1026 20:26:34 @ilsvrc.py:118] Assuming directory /tempspace2/hgao/data/imagenet/val has original structure. [1026 20:26:34 @param.py:189] Use ./logdir/hyper.txt to set hyperparam: 'learning_rate'. [1026 20:26:34 @inference_runner.py:83] InferenceRunner will eval on an InputSource of size 782 [1026 20:27:04 @input_source.py:178] Setting up the queue 'QueueInput/input_queue' for CPU prefetching ... [1026 20:27:04 @input_source.py:459] Setting up StagingArea for GPU prefetching ... [1026 20:27:04 @training.py:41] Training a model of 2 towers [1026 20:27:04 @training.py:92] Building graph for training tower 0 on device LeastLoadedDeviceSetter-/gpu:0... [1026 20:27:06 @regularize.py:108] Add REGULARIZATION_LOSSES of 58 tensors on the total cost. [1026 20:27:07 @training.py:92] Building graph for training tower 1 on device LeastLoadedDeviceSetter-/gpu:1... [1026 20:27:08 @regularize.py:108] Add REGULARIZATION_LOSSES of 58 tensors on the total cost. [1026 20:27:10 @model_utils.py:47] Model Parameters: name shape dim device


    conv_s/weights:0 [3, 3, 3, 32] 864 /device:GPU:0 conv_s/batch_norm/gamma:0 [32] 32 /device:GPU:1 conv_s/batch_norm/beta:0 [32] 32 /device:GPU:1 conv_1_0/conv1/conv/weights:0 [3, 3, 32, 1] 288 /device:GPU:1 conv_1_0/conv1/batch_norm/gamma:0 [32] 32 /device:GPU:1 conv_1_0/conv1/batch_norm/beta:0 [32] 32 /device:GPU:1 conv_1_0/conv2/weights:0 [1, 1, 32, 64] 2048 /device:GPU:1 conv_1_0/conv2/batch_norm/gamma:0 [64] 64 /device:GPU:0 conv_1_0/conv2/batch_norm/beta:0 [64] 64 /device:GPU:0 conv_1_1/conv1/conv/weights:0 [3, 3, 64, 1] 576 /device:GPU:0 conv_1_1/conv1/batch_norm/gamma:0 [64] 64 /device:GPU:0 conv_1_1/conv1/batch_norm/beta:0 [64] 64 /device:GPU:0 conv_1_1/conv2/weights:0 [1, 1, 64, 128] 8192 /device:GPU:0 conv_1_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_1_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_1_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_1_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_1_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_1_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_1_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_1_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_1_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_1_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_1_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_1_3/conv2/weights:0 [1, 1, 128, 256] 32768 /device:GPU:0 conv_1_3/conv2/batch_norm/gamma:0 [256] 256 /device:GPU:1 conv_1_3/conv2/batch_norm/beta:0 [256] 256 /device:GPU:1 conv_1_4/conv1/conv/weights:0 [3, 3, 256, 1] 2304 /device:GPU:1 conv_1_4/conv1/batch_norm/gamma:0 [256] 256 /device:GPU:1 conv_1_4/conv1/batch_norm/beta:0 [256] 256 /device:GPU:1 conv_1_4/conv2/weights:0 [1, 1, 256, 256] 65536 /device:GPU:1 conv_1_4/conv2/batch_norm/gamma:0 [256] 256 /device:GPU:0 conv_1_4/conv2/batch_norm/beta:0 [256] 256 /device:GPU:0 conv_1_5/conv1/conv/weights:0 [3, 3, 256, 1] 2304 /device:GPU:0 conv_1_5/conv1/batch_norm/gamma:0 [256] 256 /device:GPU:0 conv_1_5/conv1/batch_norm/beta:0 [256] 256 /device:GPU:0 conv_1_5/conv2/weights:0 [1, 1, 256, 512] 131072 /device:GPU:0 conv_1_5/conv2/batch_norm/gamma:0 [512] 512 /device:GPU:1 conv_1_5/conv2/batch_norm/beta:0 [512] 512 /device:GPU:1 conv_2/group_0_conv0/conv/weights:0 [1, 1, 4, 1, 1] 4 /device:GPU:1 conv_2/group_0/conv_0/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_0/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_0/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_0/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_0/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_0/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_1/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_1/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_1/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_1/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_3/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_3/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_3/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_4/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_4/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_4/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_4/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_4/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_0/conv_4/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1_conv0/conv/weights:0 [1, 1, 4, 1, 1] 4 /device:GPU:0 conv_2/group_1/conv_0/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_1/conv_0/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_0/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_0/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_1/conv_0/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_0/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_1/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_1/conv_1/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_1/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_1/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_1/conv_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_1/conv_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_1/conv_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_1/conv_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_3/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_1/conv_3/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_3/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_4/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_1/conv_4/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_4/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_4/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_1/conv_4/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_4/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2_conv0/conv/weights:0 [1, 1, 4, 1, 1] 4 /device:GPU:1 conv_2/group_2/conv_0/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_2/conv_0/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_0/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_0/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_2/conv_0/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_0/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_1/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_2/conv_1/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_1/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_1/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_2/conv_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_2/conv_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_2/conv_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_2/conv_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_3/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_2/conv_3/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_3/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_4/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_2/conv_4/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_4/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_4/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_2/conv_4/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_4/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3_conv0/conv/weights:0 [1, 1, 4, 1, 1] 4 /device:GPU:0 conv_2/group_3/conv_0/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_3/conv_0/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_0/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_0/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_3/conv_0/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_0/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_1/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_3/conv_1/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_1/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_1/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_3/conv_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_3/conv_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_3/conv_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_3/conv_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_3/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_3/conv_3/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_3/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_4/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_3/conv_4/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_4/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_4/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_3/conv_4/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_4/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_3_0/conv1/conv/weights:0 [3, 3, 512, 1] 4608 /device:GPU:1 conv_3_0/conv1/batch_norm/gamma:0 [512] 512 /device:GPU:1 conv_3_0/conv1/batch_norm/beta:0 [512] 512 /device:GPU:1 conv_3_0/conv2/weights:0 [1, 1, 512, 1024] 524288 /device:GPU:1 conv_3_0/conv2/batch_norm/gamma:0 [1024] 1024 /device:GPU:0 conv_3_0/conv2/batch_norm/beta:0 [1024] 1024 /device:GPU:0 conv_3_1/conv1/conv/weights:0 [3, 3, 1024, 1] 9216 /device:GPU:0 conv_3_1/conv1/batch_norm/gamma:0 [1024] 1024 /device:GPU:0 conv_3_1/conv1/batch_norm/beta:0 [1024] 1024 /device:GPU:0 conv_3_1/conv2/weights:0 [1, 1, 1024, 1024] 1048576 /device:GPU:0 conv_3_1/conv2/batch_norm/gamma:0 [1024] 1024 /device:GPU:1 conv_3_1/conv2/batch_norm/beta:0 [1024] 1024 /device:GPU:1 out/pool/batch_norm/gamma:0 [1024] 1024 /device:GPU:1 out/pool/batch_norm/beta:0 [1024] 1024 /device:GPU:1 out/dense/weights:0 [1024, 1000] 1024000 /device:GPU:1 out/dense/biases:0 [1000] 1000 /device:GPU:0 Total #vars=179, #param=3251000 (12.40 MB assuming all float32) [1026 20:27:10 @base.py:207] Setup callbacks graph ... [1026 20:27:11 @input_source.py:178] Setting up the queue 'DataParallelInferenceRunner/QueueInput/input_queue' for CPU prefetching ... [1026 20:27:11 @predictor_factory.py:54] Building predictor tower 'InferenceTower0' on device /gpu:0 ... [1026 20:27:12 @predictor_factory.py:54] Building predictor tower 'InferenceTower1' on device /gpu:1 ... [1026 20:27:13 @summary.py:34] Maintain moving average summary of 4 tensors. [1026 20:27:13 @graph.py:91] Applying collection UPDATE_OPS of 232 ops. [1026 20:27:16 @base.py:212] Creating the session ... [1026 20:27:19 @base.py:216] Initializing the session ... [1026 20:27:19 @base.py:223] Graph Finalized. [1026 20:27:21 @concurrency.py:36] Starting EnqueueThread DataParallelInferenceRunner/QueueInput/input_queue ... [1026 20:27:21 @concurrency.py:36] Starting EnqueueThread QueueInput/input_queue ... [1026 20:27:21 @input_source.py:418] Pre-filling staging area ... [1026 20:27:21 @input_source.py:140] ERR Exception in EnqueueThread DataParallelInferenceRunner/QueueInput/input_queue: Traceback (most recent call last): File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorpack/input_source/input_source.py", line 133, in run self.op.run(feed_dict=feed) File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2084, in run _run_using_default_session(self, feed_dict, self.graph, session) File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4542, in _run_using_default_session session.run(operation, feed_dict) File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1096, in _run % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (224, 224, 3) for Tensor 'input:0', which has shape '(?, 224, 224, 3)' [1026 20:27:22 @input_source.py:146] EnqueueThread DataParallelInferenceRunner/QueueInput/input_queue Exited.

    opened by HongyangGao 31
  • MultiProcessRunner RuntimeError

    MultiProcessRunner RuntimeError

    If you're asking about an unexpected problem which you do not know the root cause, use this template. PLEASE DO NOT DELETE THIS TEMPLATE, FILL IT:

    If you already know the root cause to your problem, feel free to delete everything in this template.

    1. What you did:

    (1) If you're using examples, what's the command you run:

    (2) If you're using examples, have you made any changes to the examples? Paste git status; git diff here:

    (3) If not using examples, tell us what you did:

    It's always better to copy-paste what you did than to describe them.

    Please try to provide enough information to let other reproduce your issues. Without reproducing the issue, we may not be able to investigate it.

    I tried to follow the "Efficient Dataflow" tutorial, continuing from https://github.com/tensorpack/tensorpack/issues/1209.

    2. What you observed:

    (1) Include the ENTIRE logs here:

    It's always better to copy-paste what you observed instead of describing them.

    It's always better to paste as much as possible, although sometimes a partial log is OK.

    Tensorpack typically saves stdout to its training log. If stderr is relevant, you can run a command with my_command 2>&1 | tee logs.txt to save both stdout and stderr to one file.

    [0528 10:55:08 @parallel.py:195] WRN MultiProcessRunner does support Windows. However, Windows requires more strict picklability on processes, which may lead of failure on some of the code. Traceback (most recent call last): File "", line 1, in File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 106, in spawn_main exitcode = _main(fd) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 115, in _main prepare(preparation_data) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 226, in prepare _fixup_main_from_path(data['init_main_from_path']) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 278, in _fixup_main_from_path run_name="mp_main") File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\runpy.py", line 254, in run_path pkg_name=pkg_name, script_name=fname) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\AI_Workspace\z_debug\load_lmdb.py", line 79, in load_lmdb3() File "C:\AI_Workspace\z_debug\load_lmdb.py", line 69, in load_lmdb3 ds = MultiProcessRunner(ds, 5000, 1) # NOTE: PrefetchData() deprecated in May 2019 File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\site-packages\tensorpack\dataflow\parallel.py", line 214, in init start_proc_mask_signal(self.procs) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\site-packages\tensorpack\utils\concurrency.py", line 244, in start_proc_mask_signal p.start() File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\contextlib.py", line 77, in exit self.gen.throw(type, value, traceback) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\site-packages\tensorpack\utils\concurrency.py", line 216, in mask_sigint yield True File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\site-packages\tensorpack\utils\concurrency.py", line 244, in start_proc_mask_signal p.start() File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\context.py", line 212, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\context.py", line 313, in _Popen return Popen(process_obj) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\popen_spawn_win32.py", line 34, in init prep_data = spawn.get_preparation_data(process_obj._name) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 144, in get_preparation_data _check_not_importing_main() File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 137, in _check_not_importing_main is not going to be frozen to produce an executable.''') RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:
    
            if __name__ == '__main__':
                freeze_support()
                ...
    
        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
    

    I will attach the code here: z_debug.zip

    But please notice that the LMDB file I'm using is too large to be attached to the zip file. The LMDB file was created from the same "debug2.py" but with more images and data entries.

    From load_lmdb3() function, the code crashed with "MultiProcessRunner()" with a RuntimeError. Maybe another Windows issue ? I had the same error before PrefetchData() was renamed to MultiProcessRunner()

    (2) Other observations, if any: For example, CPU/GPU utilization, output images, tensorboard curves, if relevant to your issue.

    3. What you expected, if not obvious.

    If you expect higher speed, please read http://tensorpack.readthedocs.io/tutorial/performance-tuning.html before posting.

    If you expect certain accuracy, only in one of the two conditions can we help with it: (1) You're unable to reproduce the accuracy documented in tensorpack examples. (2) It appears to be a tensorpack bug.

    Otherwise, how to train a model to certain accuracy is a machine learning question. We do not answer machine learning questions and it is your responsibility to figure out how to make your models more accurate.

    4. Your environment:

    • Paste the output of this command: python -c 'import tensorpack.tfutils as u; print(u.collect_env_info())' If this command failed, tell us your version of Python/TF/tensorpack.
    • You can install Tensorpack master by pip install -U git+https://github.com/ppwwyyxx/tensorpack.git and see if your issue is already solved.
    • If you're not using tensorpack under a normal command line shell (e.g., using an IDE or jupyter notebook), please retry under a normal command line shell.
    • Include relevant hardware information, e.g. number of GPUs used for training, amount of RAM.

    You may often want to provide extra information related to your issue, but at the minimum please try to provide the above information accurately to save effort in the investigation.

    Windows 10. I think no GPU was used at the moment.

    enhancement 
    opened by dps42 30
  • how to adapt model-agnostic meta learning in tensorpack

    how to adapt model-agnostic meta learning in tensorpack

    Hello,

    I would like to do model-agnostic meta learning in tensorpack The training algorithm of a classification task using model-agnostic meta learning is below:

    We have fθ as the model with parameter θ , α,β are hyperparameters

    1. in each iteration sample [ inputa, inputb, labela, labelb ] from training set
    2. forward inputa to fθ and evaluate the gradient using cross entropy
    3. Compute adapted parameters with gradient descent:
    4. θ' = θ - α∇θfθ(inputa)
    5. update θ ← θ − β∇θfθ'(inputb)

    https://arxiv.org/abs/1703.03400

    The source code of model-agnostic meta learning from github is below:

           for j in range(num_updates - 1):
                    loss = self.loss_func(self.forward(inputa, fast_weights, reuse=True), labela)
                    grads = tf.gradients(loss, list(fast_weights.values()))
                    if FLAGS.stop_grad:
                        grads = [tf.stop_gradient(grad) for grad in grads]
                    gradients = dict(zip(fast_weights.keys(), grads))
                    fast_weights = dict(zip(fast_weights.keys(), [fast_weights[key] - self.update_lr*gradients[key] for key in fast_weights.keys()]))
                    output = self.forward(inputb, fast_weights, reuse=True)
                    task_outputbs.append(output)
                    task_lossesb.append(self.loss_func(output, labelb))
         
            task_output = [task_outputa, task_outputbs, task_lossa, task_lossesb]
    

    https://github.com/cbfinn/maml/blob/master/maml.py

    I'd like to know in tensorpack and using trainers, how can I access model weights θ between the training iteration and forward with inputa, compute the gradient decent and adapted as θ' and update the model weight θ using the task_lossesb as we used to do at the end of an iteration.

    usage 
    opened by john81923 30
  • Better ModelDesc

    Better ModelDesc

    The original design lacks enough consideration and it's not clear how the graph is built, and what one can and cannot do inside build_graph. E.g.:

    • Is it OK to create placeholders inside build_graph?
    • What symbolic functions are allowed to use and what not? (e.g. tf.layers.batch_norm? tf.train.input_producer?)..
    • What to put in get_inputs and what not? Is this interface even necessary?
    • FIXED by introducing TowerTrainer, TowerFunc, TowerTensorHandle How to access a tensor a bit later? Because setting self.xxx sadly doesn't work (#287), and using the tensor names is not easy. (#315, #317, #442)
    • RESOLVED Use return cost for single-cost ModelDesc. For other types of models, you need to write your own trainer any way, so you'll build the graph by yourself anyway. On the contrary, self.cost needs to be set. This seems very hard-coded, and the reason behind it is that self.cost is only set because some (but not all) trainers need it. This contract between Model and Trainer needs to be addressed in a clearer way.
    • FIXED What's worse, some examples now actually is using self.xxx. Technically they should not rely on this unsupported use.
    • Fancy dynamic stuff might also be hard, but I'm not very familiar.

    Some of example use case that is hard or too tricky to do with the current interface:

    • Input data has different layout (needs different placeholder) in training vs inference.
    • Access some tensors in all towers.
    • Mix of data/model parallel. A special case is to create some variables (not reuse) in each tower.

    Nothing should be deprecated because the current interface works well for most problems. But I'm thinking about new ones which can expose more of the graph building process to users.

    enhancement 
    opened by ppwwyyxx 30
  • Stuck in Pre-filling StagingArea

    Stuck in Pre-filling StagingArea

    Hi there, Thanks for tensorpack ! I am training segmentation model on cityscapes. I write dataflow refering to get_imagenet_dataflow()

    def __iter__(self):
            for img_addr, gt_addr in self.lst:
                img = cv2.cvtColor(cv2.imread(img_addr, cv2.IMREAD_COLOR), cv2.COLOR_BGR2RGB)
                gt = cv2.imread(gt_addr, cv2.IMREAD_GRAYSCALE)
                yield [img, gt]
    

    And test this dataflow using below code, it prints the numpy array and achieves like 30 it/s(8 cores), and it will suddenly stop at somewhere, like 250/5000.

    ds = PrefetchDataZMQ(ds, parallel)
        ds = BatchData(ds, batch_size, remainder=False)  
        ds.reset_state()
        print(next(ds.get_data()))
        TestDataSpeed(ds).start()
    

    Then run training with SyncMultiGPUTrainerParameterServer, the problem is it stuck at Pre-filling StagingArea, showed in below. At the start, CPU is running at 104% with little GPU memory usage, after about 10-15 mins, CPU usage drops and GPU increase, but no computation on GPU with GPU-Util 0%. I have no idea where I did wrong. Could you give me some insights on this ?? Thanks so much.

    [0926 11:25:49 @base.py:211] Initializing the session ...
    [0926 11:25:49 @base.py:218] Graph Finalized.
    [0926 11:25:50 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ...
    [0926 11:26:01 @param.py:148] [HyperParamSetter] At global_step=0, learning_rate will change to 0.00025000
    [0926 11:26:03 @base.py:250] Start Epoch 1 ...
      0%|                                                                                                              |0/371[00:00<?,?it/s]
    [0926 11:26:03 @input_source.py:550] Pre-filling StagingArea ...
    [0926 11:26:05 @input_source.py:554] 1 element was put into StagingArea on each tower.
    

    My environment:

    • Python version: Python 2.7
    • TF version: tf 1.6.0
    • Tensorpack version: 0.8.9.
    • OS: Ubuntu 16.04
    • Hardware information: E5 2630, 4 1080Ti GPUs.
    usage 
    opened by s7ev3n 27
  • Add MMEval support for COCO detection evaluation

    Add MMEval support for COCO detection evaluation

    Hi, thanks for this nice work!

    This PR wants to provide a new evaluation tool for examples/FasterRCNN: MMEval

    MMEval is a unified evaluation library for multiple machine-learning libraries, the link to the home page is: https://github.com/open-mmlab/mmeval

    The coco_det_mmeval.py support multi-gpus and multi-node evaluation with MPI4PY:

    # run evaluation
    python tensorpack_mmeval.py --load <model_path>
    
    # launch multi-gpus evaluation by mpirun
    mpirun -np 8 python tensorpack_mmeval.py --load <model_path>
    

    We tested this evaluation script on COCO-MaskRCNN-R50C41x and got the same evaluation results as the TensorPack report.

    Related refer: https://github.com/open-mmlab/mmeval/tree/main/examples/tensorpack

    opened by ice-tong 0
  • Option to disable the tqdm progress bars

    Option to disable the tqdm progress bars

    Could you guys add the option to disable the tqdm progress bar? I made the code change here, adding a keyword argument "pbar_disable", but I'm not able to check it in.

    def send_dataflow_zmq(df, addr, hwm=50, format=None, bind=False, pbar_disable=False):
        """
        Run DataFlow and send data to a ZMQ socket addr.
        It will serialize and send each datapoint to this address with a PUSH socket.
        This function never returns.
    
        Args:
            df (DataFlow): Will infinitely loop over the DataFlow.
            addr: a ZMQ socket endpoint.
            hwm (int): ZMQ high-water mark (buffer size)
            format (str): The serialization format.
                 Default format uses :mod:`utils.serialize`.
                 This format works with :class:`dataflow.RemoteDataZMQ`.
                 An alternate format is 'zmq_ops', used by https://github.com/tensorpack/zmq_ops
                 and :class:`input_source.ZMQInput`.
            bind (bool): whether to bind or connect to the endpoint address.
        """
        assert format in [None, 'zmq_op', 'zmq_ops']
        if format is None:
            dump_fn = dumps
        else:
            from zmq_ops import dump_arrays
            dump_fn = dump_arrays
    
        ctx = zmq.Context()
        socket = ctx.socket(zmq.PUSH)
        socket.set_hwm(hwm)
        if bind:
            socket.bind(addr)
        else:
            socket.connect(addr)
        try:
            df.reset_state()
            logger.info("Serving data to {} with {} format ...".format(
                addr, 'default' if format is None else 'zmq_ops'))
            INTERVAL = 200
            q = deque(maxlen=INTERVAL)
    
            try:
                total = len(df)
            except NotImplementedError:
                total = 0
            tqdm_args = get_tqdm_kwargs(
                leave=True, smoothing=0.8, disable=pbar_disable)
            tqdm_args['bar_format'] = tqdm_args['bar_format'] + "{postfix}"
            while True:
                with tqdm.trange(total, **tqdm_args) as pbar:
                    for dp in df:
                        start = time.time()
                        socket.send(dump_fn(dp), copy=False)
                        q.append(time.time() - start)
                        pbar.update(1)
                        if pbar.n % INTERVAL == 0:
                            avg = "{:.3f}".format(sum(q) / len(q))
                            pbar.set_postfix({'AvgSendLat': avg})
        finally:
            logger.info("Exiting send_dataflow_zmq ...")
            socket.setsockopt(zmq.LINGER, 0)
            socket.close()
            if not ctx.closed:
                ctx.destroy(0)
    
    opened by actuallyaswin 0
  • Issue when using automatic mixed precision in training with evaluation callback

    Issue when using automatic mixed precision in training with evaluation callback

    1. What you did:

    I tried to use automatic mixed precision when training a MaskRCNN model via a graph rewrite. As presented here: https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite, I added the following line at the end of the generalized_rcnn function GeneralizedRCNN.optimizer(): opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

    2. What you observed:

    When I train the model without evaluation callback, there is no issue at all. Once it is trained, if I load the model with OfflinePredictor, it also works well. However, if I train the model with evaluation callback, I get the following error during the first evaluation:

    InternalError                             Traceback (most recent call last)
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
       1364     try:
    -> 1365       return fn(*args)
       1366     except errors.OpError as e:
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
       1349       return self._call_tf_sessionrun(options, feed_dict, fetch_list,
    -> 1350                                       target_list, run_metadata)
       1351 
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
       1442                                             fetch_list, target_list,
    -> 1443                                             run_metadata)
       1444 
    
    InternalError: 2 root error(s) found.
      (0) Internal: Blas GEMM launch failed : a.shape=(12032000, 1), b.shape=(1, 4), m=12032000, n=4, k=1
    	 [[{{node tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul}}]]
      (1) Internal: Blas GEMM launch failed : a.shape=(12032000, 1), b.shape=(1, 4), m=12032000, n=4, k=1
    	 [[{{node tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul}}]]
    0 successful operations.
    0 derived errors ignored.
    
    During handling of the above exception, another exception occurred:
    
    InternalError                             Traceback (most recent call last)
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/train/interface.py in launch_train_with_config(config, trainer)
         97         starting_epoch=config.starting_epoch,
         98         max_epoch=config.max_epoch,
    ---> 99         extra_callbacks=config.extra_callbacks)
        100 
        101 
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py in train_with_defaults(self, _sentinel, callbacks, monitors, session_creator, session_init, steps_per_epoch, starting_epoch, max_epoch, extra_callbacks)
        340         self.train(callbacks, monitors,
        341                    session_creator, session_init,
    --> 342                    steps_per_epoch, starting_epoch, max_epoch)
        343 
        344     def __new__(cls, *args, **kwargs):
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py in train(self, callbacks, monitors, session_creator, session_init, steps_per_epoch, starting_epoch, max_epoch)
        312         self.setup_callbacks(callbacks, monitors)
        313         self.initialize(session_creator, session_init)
    --> 314         self.main_loop(steps_per_epoch, starting_epoch, max_epoch)
        315 
        316     def train_with_defaults(
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/utils/argtools.py in wrapper(*args, **kwargs)
        166         cache.add(func)
        167 
    --> 168         return func(*args, **kwargs)
        169 
        170     return wrapper
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py in main_loop(self, steps_per_epoch, starting_epoch, max_epoch)
        284 
        285                     # trigger epoch outside the timing region.
    --> 286                     self._callbacks.trigger_epoch()
        287                 logger.info("Training has finished!")
        288             except (StopTraining, tf.errors.OutOfRangeError) as e:
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/base.py in trigger_epoch(self)
        154 
        155     def trigger_epoch(self):
    --> 156         self._trigger_epoch()
        157 
        158     def _trigger_epoch(self):
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/group.py in _trigger_epoch(self)
         93             display_name = str(cb)
         94             with tm.timed_callback(display_name):
    ---> 95                 cb.trigger_epoch()
         96         tm.log()
         97 
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/base.py in trigger_epoch(self)
        154 
        155     def trigger_epoch(self):
    --> 156         self._trigger_epoch()
        157 
        158     def _trigger_epoch(self):
    
    /opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
        433                 raise CancelledError()
        434             elif self._state == FINISHED:
    --> 435                 return self.__get_result()
        436             else:
        437                 raise TimeoutError()
    
    /opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
        382     def __get_result(self):
        383         if self._exception:
    --> 384             raise self._exception
        385         else:
        386             return self._result
    
    /opt/conda/lib/python3.7/concurrent/futures/thread.py in run(self)
         55 
         56         try:
    ---> 57             result = self.fn(*self.args, **self.kwargs)
         58         except BaseException as exc:
         59             self.future.set_exception(exc)
    
    /home/jovyan/eval.py in predict_dataflow()
    --> 157               outputs = predict_image(img, model_func)
    
    /home/jovyan/eval.py in predict_image(img, model_func)
    ---> 46     outputs = model_func(img)
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/predict/base.py in __call__(self, *dp)
         39             list[array]: list of outputs
         40         """
    ---> 41         output = self._do_call(dp)
         42         if self.return_input:
         43             return (dp, output)
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/predict/base.py in _do_call(self, dp)
        134         # run_metadata = tf.RunMetadata()
        135         # options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    --> 136         return self._callable(*dp)
        137 
        138 
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _generic_run(*feed_args, **kwargs)
       1230             feed: feed_val for feed, feed_val in zip(feed_list, feed_args)
       1231         }
    -> 1232         return self.run(fetches, feed_dict=feed_dict, **kwargs)
       1233 
       1234       return _generic_run
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
        954     try:
        955       result = self._run(None, fetches, feed_dict, options_ptr,
    --> 956                          run_metadata_ptr)
        957       if run_metadata:
        958         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
       1178     if final_fetches or final_targets or (handle and feed_dict_tensor):
       1179       results = self._do_run(handle, final_targets, final_fetches,
    -> 1180                              feed_dict_tensor, options, run_metadata)
       1181     else:
       1182       results = []
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
       1357     if handle is None:
       1358       return self._do_call(_run_fn, feeds, fetches, targets, options,
    -> 1359                            run_metadata)
       1360     else:
       1361       return self._do_call(_prun_fn, handle, feeds, fetches)
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
       1382                     '\nsession_config.graph_options.rewrite_options.'
       1383                     'disable_meta_optimizer = True')
    -> 1384       raise type(e)(node_def, op, message)
       1385 
       1386   def _extend_graph(self):
    
    InternalError: 2 root error(s) found.
      (0) Internal: Blas GEMM launch failed : a.shape=(12032000, 1), b.shape=(1, 4), m=12032000, n=4, k=1
    	 [[node tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul (defined at /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
      (1) Internal: Blas GEMM launch failed : a.shape=(12032000, 1), b.shape=(1, 4), m=12032000, n=4, k=1
    	 [[node tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul (defined at /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
    0 successful operations.
    0 derived errors ignored.
    
    Original stack trace for 'tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul':
      File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py", line 16, in <module>
        app.launch_new_instance()
      File "/opt/conda/lib/python3.7/site-packages/traitlets/config/application.py", line 845, in launch_instance
        app.start()
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelapp.py", line 612, in start
        self.io_loop.start()
      File "/opt/conda/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 199, in start
        self.asyncio_loop.run_forever()
      File "/opt/conda/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
        self._run_once()
      File "/opt/conda/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once
        handle._run()
      File "/opt/conda/lib/python3.7/asyncio/events.py", line 88, in _run
        self._context.run(self._callback, *self._args)
      File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 688, in <lambda>
        lambda f: self._run_callback(functools.partial(callback, future))
      File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
        ret = callback()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 814, in inner
        self.ctx_run(self.run)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 775, in run
        yielded = self.gen.send(value)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 374, in dispatch_queue
        yield self.process_one()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 250, in wrapper
        runner = Runner(ctx_run, result, future, yielded)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 741, in __init__
        self.ctx_run(self.run)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 775, in run
        yielded = self.gen.send(value)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 358, in process_one
        yield gen.maybe_future(dispatch(*args))
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 234, in wrapper
        yielded = ctx_run(next, result)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell
        yield gen.maybe_future(handler(stream, idents, msg))
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 234, in wrapper
        yielded = ctx_run(next, result)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 538, in execute_request
        user_expressions, allow_stdin,
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 234, in wrapper
        yielded = ctx_run(next, result)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/ipkernel.py", line 302, in do_execute
        res = shell.run_cell(code, store_history=store_history, silent=silent)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/zmqshell.py", line 539, in run_cell
        return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2895, in run_cell
        raw_cell, store_history, silent, shell_futures)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2940, in _run_cell
        return runner(coro)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/async_helpers.py", line 68, in _pseudo_sync_runner
        coro.send(None)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3166, in run_cell_async
        interactivity=interactivity, compiler=compiler, result=result)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3357, in run_ast_nodes
        if (await self.run_code(code, result,  async_=asy)):
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-2-f9d37edbca59>", line 23, in <module>
        commit_hash = "unknown",
      File "/home/jovyan/train.py", line 315, in train_mask_rcnn
        launch_train_with_config(traincfg, trainer)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/interface.py", line 99, in launch_train_with_config
        extra_callbacks=config.extra_callbacks)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py", line 342, in train_with_defaults
        steps_per_epoch, starting_epoch, max_epoch)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py", line 312, in train
        self.setup_callbacks(callbacks, monitors)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/utils/argtools.py", line 168, in wrapper
        return func(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py", line 209, in setup_callbacks
        self._callbacks.setup_graph(weakref.proxy(self))
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/base.py", line 59, in setup_graph
        self._setup_graph()
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/group.py", line 68, in _setup_graph
        cb.setup_graph(self.trainer)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/base.py", line 59, in setup_graph
        self._setup_graph()
      File "/home/jovyan/eval.py", line 305, in _setup_graph
        self.predictors = [self._build_predictor(k % num_gpu) for k in range(self.num_predictor)]
      File "/home/jovyan/eval.py", line 305, in <listcomp>
        self.predictors = [self._build_predictor(k % num_gpu) for k in range(self.num_predictor)]
      File "/home/jovyan/eval.py", line 319, in _build_predictor
        return self.trainer.get_predictor(self._in_names, self._out_names, device=idx)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/tower.py", line 136, in get_predictor
        self.tower_func(*input.get_input_tensors())
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/tfutils/tower.py", line 291, in __call__
        output = self._tower_fn(*args)
      File "/home/jovyan/modeling/generalized_rcnn.py", line 129, in build_graph
        features = self.backbone(image)
      File "/home/jovyan/modeling/generalized_rcnn.py", line 307, in backbone
        p23456 = fpn_model('fpn', c2345)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/models/registry.py", line 173, in wrapped_func
        outputs = func(*args, **actual_args)
      File "/home/jovyan/modeling/model_fpn.py", line 65, in fpn_model
        lat = lat + upsample2x('upsample_lat{}'.format(6 - idx), lat_sum_5432[-1])
      File "/home/jovyan/modeling/model_fpn.py", line 51, in upsample2x
        data_format='channels_first')
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/models/registry.py", line 173, in wrapped_func
        outputs = func(*args, **actual_args)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/models/pool.py", line 127, in FixedUnPooling
        ret = tf.tensordot(x, mat, axes=1)  # bxcxhxwxshxsw
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 4071, in tensordot
        ab_matmul = matmul(a_reshape, b_reshape)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
        return target(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 2754, in matmul
        a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
        name=name)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
        op_def=op_def)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
        return func(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
        attrs, op_def, compute_device)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
        op_def=op_def)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
        self._traceback = tf_stack.extract_stack()
    

    4. Your environment:

    sys.platform          linux
    Python                3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
    Tensorpack            v0.10.1-0-g8f831349
    Numpy                 1.19.5
    TensorFlow            1.15.5/v1.15.5-1-g7d0c58b5326
    TF Compiler Version   7.3.1 20180303
    TF CUDA support       True
    TF MKL support        False
    TF XLA support        False
    Nvidia Driver         /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.51.06
    CUDA                  /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0.221
    CUDNN                 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
    NCCL                  /usr/lib/x86_64-linux-gnu/libnccl.so.2.7.8
    CUDA_VISIBLE_DEVICES  Unspecified
    GPU 0                 Tesla T4
    Free RAM              21.86/29.45 GB
    CPU Count             8
    Horovod               0.21.3
    cv2                   4.4.0
    msgpack               1.0.2
    python-prctl          False
    

    Question: is it possible to run evaluation callback while training with automatic mixed precision (even if it already works in inference outside of the training) or are there changes to perform to make it work?

    opened by martinjammes 0
  • Is there an analogue for parallel Dataset.interleave in Dataflow?

    Is there an analogue for parallel Dataset.interleave in Dataflow?

    A typical data loading pipeline in TensorFlow using tf.data.Dataset might look something like this:

    dataset = tf.data.Dataset.from_tensor_slices(filenames)
    dataset = dataset.interleave(
        tf.data.TFRecordDataset,
        num_parallel_calls=reader_num_threads)
    dataset = dataset.batch(batch_size, drop_remainder=True)
    dataset = dataset.map(
        lambda serialized_example: tf.io.parse_example(serialized_example, features),
        num_parallel_calls=parser_num_threads)
    

    Obviously, I'm not trying to use Dataflow to parse TFRecords, but it is somewhat of an analogous workflow of wanting to parallelize reading multiple file iterators at a time. I understand how to do the parallel map using Dataflow, but I don't quite see how to do the parallel interleave. Any tips?

    enhancement 
    opened by cyc 6
  • Why doesn't MultiProcessMapData() stop?

    Why doesn't MultiProcessMapData() stop?

    I tried something very simple with MultiProcessMapData():

    from tensorpack import *
    
    class MyFlow(DataFlow):
        def __init__(self, n):
            super().__init__()
            self.n = n
    
        def __iter__(self):
            for i in range(self.n):
                yield i
    
        def __len__(self):
            return self.n
    
    def f(i):
        return i*10
    
    d0 = MyFlow(10)
    d1 = MultiProcessMapData(d0, num_proc = 4, map_func=f, buffer_size=10, strict=False)
    d1.reset_state()
    
    for i in d1:
        print(i)
    print("end")
    

    In this example, the loop never stops. It just produces more and more numbers. If I set strict to False, the code produces 5 numbers (0, 10, 20, 30, 40) and then freezes. Is this the expected behaviour? I am using the latest version of Tensorpack on macOS. Thank you.

    opened by hsinhaoyu 2
  • [Placeholder]Detectron2 fbnet backbone

    [Placeholder]Detectron2 fbnet backbone

    It was amazing to see detectron2, that's like the best of pytorch and tensorflow. Thank you for the great library.

    according to @wat3rbro https://github.com/facebookresearch/detectron2/issues/12#issuecomment-565566046

    https://github.com/facebookresearch/detectron2/issues/12#issuecomment-566822670 mobile friendly models are coming soon.

    Creating this issue as a placeholder to support fbnet backbone when even they are available.

    Once again thank you for the great library. Pardon if the category is wrong.

    opened by no-1ne 0
Owner
Tensorpack
Use TensorFlow in the right way
Tensorpack
Digan - Official PyTorch implementation of Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

DIGAN (ICLR 2022) Official PyTorch implementation of "Generating Videos with Dyn

Sihyun Yu 147 Dec 31, 2022
3D cascade RCNN for object detection on point cloud

3D Cascade RCNN This is the implementation of 3D Cascade RCNN: High Quality Object Detection in Point Clouds. We designed a 3D object detection model

Qi Cai 22 Dec 02, 2022
The official implementation of Equalization Loss for Long-Tailed Object Recognition (CVPR 2020) based on Detectron2

Equalization Loss for Long-Tailed Object Recognition Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, Junjie Yan ⚠️ We re

Jingru Tan 197 Dec 25, 2022
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 282 Jan 09, 2023
Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Adaptive Segmentation Mask Attack This repository contains the implementation of the Adaptive Segmentation Mask Attack (ASMA), a targeted adversarial

Utku Ozbulak 53 Jul 04, 2022
This repository contains code for the paper "Disentangling Label Distribution for Long-tailed Visual Recognition", published at CVPR' 2021

Disentangling Label Distribution for Long-tailed Visual Recognition (CVPR 2021) Arxiv link Blog post This codebase is built on Causal Norm. Install co

Hyperconnect 85 Oct 18, 2022
Code for the CVPR 2021 paper: Understanding Failures of Deep Networks via Robust Feature Extraction

Welcome to Barlow Barlow is a tool for identifying the failure modes for a given neural network. To achieve this, Barlow first creates a group of imag

Sahil Singla 33 Dec 05, 2022
some academic posters as references. May we have in-person poster session soon!

some academic posters as references. May we have in-person poster session soon!

Bolei Zhou 472 Jan 06, 2023
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers

Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers Results results on COCO val Backbone Method Lr Schd PQ Config Download

155 Dec 20, 2022
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

ActNN : Activation Compressed Training This is the official project repository for ActNN: Reducing Training Memory Footprint via 2-Bit Activation Comp

UC Berkeley RISE 178 Jan 05, 2023
Collection of generative models in Tensorflow

tensorflow-generative-model-collections Tensorflow implementation of various GANs and VAEs. Related Repositories Pytorch version Pytorch version of th

3.8k Dec 30, 2022
Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Punctuation Restoration using Transformer Models This repository contins official implementation of the paper Punctuation Restoration using Transforme

Tanvirul Alam 142 Jan 01, 2023
The implemention of Video Depth Estimation by Fusing Flow-to-Depth Proposals

Flow-to-depth (FDNet) video-depth-estimation This is the implementation of paper Video Depth Estimation by Fusing Flow-to-Depth Proposals Jiaxin Xie,

32 Jun 14, 2022
LSTM model trained on a small dataset of 3000 names written in PyTorch

LSTM model trained on a small dataset of 3000 names. Model generates names from model by selecting one out of top 3 letters suggested by model at a time until an EOS (End Of Sentence) character is no

Sahil Lamba 1 Dec 20, 2021
An imperfect information game is a type of game with asymmetric information

DecisionHoldem An imperfect information game is a type of game with asymmetric information. Compared with perfect information game, imperfect informat

Decision AI 25 Dec 23, 2022
Generate vibrant and detailed images using only text.

CLIP Guided Diffusion From RiversHaveWings. Generate vibrant and detailed images using only text. See captions and more generations in the Gallery See

Clay M. 401 Dec 28, 2022
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

Institute of Computational Perception 45 Dec 29, 2022
Files for a tutorial to train SegNet for road scenes using the CamVid dataset

SegNet and Bayesian SegNet Tutorial This repository contains all the files for you to complete the 'Getting Started with SegNet' and the 'Bayesian Seg

Alex Kendall 800 Dec 31, 2022
A lossless neural compression framework built on top of JAX.

Kompressor Branch CI Coverage main (active) main development A neural compression framework built on top of JAX. Install setup.py assumes a compatible

Rosalind Franklin Institute 2 Mar 14, 2022
Demo notebooks for Qiskit application modules demo sessions (Oct 8 & 15):

qiskit-application-modules-demo-sessions This repo hosts demo notebooks for the Qiskit application modules demo sessions hosted on Qiskit YouTube. Par

Qiskit Community 46 Nov 24, 2022