Deploy optimized transformer based models on Nvidia Triton server

Overview

๐Ÿค— Hugging Face Transformer submillisecond inference ๐Ÿคฏ and deployment on Nvidia Triton server

Yes, you can perfom inference with transformer based model in less than 1ms on the cheapest GPU available on Amazon (T4)!

The commands below have been tested on a AWS G4.dnn with Deep Learning Base AMI (Ubuntu 18.04) Version 44.0. They may require some small adaptations to be run on a another Linux distribution.

You can find explanations on how it works in Hugging Face Transformer inference UNDER 1 millisecond latency

Baseline set by Hugging Face Infinity demo

Hugging Face infinity demo video

  • AWS virtual machine: g4dn.xlarge (T4 GPU)
  • model: "philschmid/MiniLM-L6-H384-uncased-sst2" (Hugging Face hub URL)
  • experience 1 : batch size 1, seq len 16 tokens -> 1.7ms
  • experience 2 : batch size 1, seq len 128 tokens -> 2.5ms

Install dependencies

Those dependencies have to be installed on the remote machine directly (no container).

git clone [email protected]:ELS-RD/triton_transformers.git
pip3 install -r requirements.txt

Generate optimized models

We generate the models from a Docker image so we can also get measures for TensorRT + ONNX Runtime.

cd triton_transformers
DOCKER_BUILDKIT=1 docker build --tag onnxruntime-trt:latest -f Dockerfile .
docker run -it --rm --gpus all -v $PWD:/project onnxruntime-trt bash -c "cd /project && python convert_onnx.py"

โš ๏ธ WARNING โš ๏ธ : if you run the conversion outside Docker container, you may have very different timings, and TensorRT won't work

It should produce something like that:

10/31/2021 11:35:08 INFO     inference done on Tesla T4
10/31/2021 11:35:08 INFO     timing [[TensorrtExecutionProvider] ./onnx_models/model-shape.onnx]: mean=0.61ms, sd=0.11ms, min=0.52ms, max=0.92ms, median=0.54ms, 95p=0.88ms, 99p=0.90ms
10/31/2021 11:35:08 INFO     timing [[CUDAExecutionProvider] ./onnx_models/model.onnx]: mean=1.10ms, sd=0.10ms, min=1.04ms, max=3.44ms, median=1.07ms, 95p=1.29ms, 99p=1.36ms
10/31/2021 11:35:08 INFO     timing [[CUDAExecutionProvider] ./onnx_models/model-optimized.onnx]: mean=0.63ms, sd=0.05ms, min=0.60ms, max=0.84ms, median=0.61ms, 95p=0.77ms, 99p=0.79ms
10/31/2021 11:35:08 INFO     timing [Pytorch_32]: mean=5.09ms, sd=0.16ms, min=4.88ms, max=6.11ms, median=5.07ms, 95p=5.28ms, 99p=5.35ms
10/31/2021 11:35:08 INFO     timing [Pytorch_FP16]: mean=6.04ms, sd=0.74ms, min=5.77ms, max=28.79ms, median=6.05ms, 95p=6.19ms, 99p=6.29ms

TensorRT and optimized ONNX Runtime provides very similar results on short sequences. In the following steps, we will continue with ONNX Runtime model because the dynamic axis are easier to work with compared to TensorRT.

Docker build will is very slow on a G4, be patient... the docker image is only required for TensorRT support inside ONNX Runtime (and measure a difference, if any, with ONNX Runtime).

FastAPI server

This is our baseline, easy to run, but not very performant.

/dev/null ">
# launch server, disable logging for best performances
python3 -m uvicorn --log-level warning server_onnx:app --port 8000 --host 0.0.0.0
# other variation, 1 worker per CPU for best latency (plus not a good idea to have several times the same model on a single GPU):
python3 -m gunicorn -w 1 -k uvicorn.workers.UvicornWorker --log-level warning server_onnx:app --bind 0.0.0.0:8000

# simple inference timing
time curl -G --data-urlencode query="This live event is great. I will sign-up for Infinity." localhost:8000/predict
# slightly more serious measure
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo perf stat -r 50 -d curl -G --data-urlencode query="This live event is great. I will sign-up for Infinity." localhost:8000/predict -s > /dev/null

It should produce:

Performance counter stats for 'curl -G --data-urlencode query=This live event is great. I will sign-up for Infinity. localhost:8000/predict' (50 runs):

              6.14 msec task-clock                #    0.494 CPUs utilized            ( +-  0.59% )
                 3      context-switches          #    0.462 K/sec                    ( +-  1.84% )
                 0      cpu-migrations            #    0.000 K/sec                  
               577      page-faults               #    0.094 M/sec                    ( +-  0.06% )
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               
   <not supported>      L1-dcache-loads                                             
   <not supported>      L1-dcache-load-misses                                       
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

         0.0124429 +- 0.0000547 seconds time elapsed  ( +-  0.44% )

Triton server

We want to copy the ONNX model we have generated in the first step in this folder. Then we launch the Triton image. As you can see we install Transformers and then launch the server itself. This is of course a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.

# copy the generated model to triton model folder
cp ./onnx_models/model-optimized.onnx ./triton_models/sts/1/model.onnx
# install transformers (and its tokenizer) and launch server in a single line, ugly but good enough for our demo
# --shm-size 256m -> to have several Python backend at the same time
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

Triton server perf analysis

You need to edit the source code to load the 16 or 128 token sequence (the text is already included).

  • 16 tokens:
[email protected]:~/triton_transformers$ python3 triton_transformers.py 
10/31/2021 12:09:34 INFO     timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
[[-3.4355469  3.2753906]]
  • 128 tokens:
[email protected]:~/triton_transformers$ python3 triton_transformers.py 
10/31/2021 12:12:00 INFO     timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
[[-3.4589844  3.3027344]]

There is also a more serious performance analysis tool called perf_analyzer (it will take care to check that measures are stable, etc.). documentation The tool need to be run on Ubuntu >= 20.04 (and won't work on Ubuntu 18.04 used for the AWS official Ubuntu deep learning image): It also make measures on torchserve and tensorflow.

# perf_analyzer needs this dependency
sudo apt install libb64-dev
# add -a for async measures, and -i grpc to use that protocol instead of http 
~/.local/bin/perf_analyzer -m transformers --percentile=95 --input-data perf_data.json --shape TEXT:1 # -i grpc -a

Call Triton HTTP API directly

If you don't want to use the tritonclient API, you can call the Triton server those ways:

# if you like Python requests library
python3 triton_requests.py

# if you want generic HTTP template, the @ means no data conversion
curl -X POST  http://localhost:8000/v2/models/transformers/versions/1/infer \
  --data-binary "@query_body.bin" \
  --header "Inference-Header-Content-Length: 160"

Use TensorRT model in Triton server (instead of ONNX)

To use TensorRT model instead of ONNX Runtime one:

  • we need to convert the ONNX to TensorRT engine
  • update the configuration, TensorRT takes int32 as input instead of int64
# we use Docker container to guarantee the use of the right trtexec version (otherwise you will have a deserialization error)
# it's a bacic conversion, IRL you want to provide minimum, optimimum and maximum shape at least
# it may take a few minutes...
docker run -it --rm --gpus all -v $PWD/onnx_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
    /usr/src/tensorrt/bin/trtexec \
    --onnx=/models/model.onnx \
    --best \
    --shapes=input_ids:1x128,attention_mask:1x128 \
    --saveEngine="/models/model.plan" \
    --workspace=6000
# move to triton model folder
cp ./onnx_models/model.plan ./triton_models/sts/1/model.plan

You then need to update you config.pbtxt in STS and tokenizer folders, replace all TYPE_INT64 tensor type by TYPE_INT32. In STS configuraiton file, replace platform: "onnxruntime_onnx" by platform: "tensorrt_plan" Finally convert the numpy tensors to int32 in the tokenizer python code, like below (notice the astype()):

input_ids = pb_utils.Tensor("INPUT_IDS", tokens['input_ids'].astype(np.int32))
attention = pb_utils.Tensor("ATTENTION", tokens['attention_mask'].astype(np.int32))

And you are done!

Comments
  • Support for large models (external data format)

    Support for large models (external data format)

    This PR closes #59.

    Changelog:

    • Refactored dockerfile and fixed dependencies to cope with python3: /root/gpgpu/MachineLearning/myelin/src/compiler/optimizer/reshape_ppg.cpp:950: void myelin::ir::reshape_ppg_t::transform_op(myelin::ir::bb_t*, myelin::ir::operation_t*): Assertionop->outputs()[0]->dimensions().size() == 3' failed.`
    • Bumped patch version
    • Added --fast argument to skip the fp16 conversion (saving GPU memory)
    • Updated logging (increased default verbosity for better understandability)
    • Added external data path for tensorrt to cope with models > 2G
    • Moved ONNX export post Pytorch benchmark to do conversion on CPU only (for larger models)
    bug documentation 
    opened by oborchers 16
  • Dynamic batching does not give better latency for Roberta running on TensorRT.

    Dynamic batching does not give better latency for Roberta running on TensorRT.

    Hi, I used your build_engine API to convert the Roberta model. While building if I use the constant batch size for input_shapes, i.e. (min, optimal, max) -> (1,1,1) or (4, 4, 4,). The model yields good results (faster than ort and torch).

    But when I convert it with dynamic batch size i.e. (min, optimal, max) -> (1, 4, 4), the model performs really slow compared to ort or torch.

    code to understand the problem better:

    # fast inference but constrained to use always 4 batches during inferencing
    tensor_shapes = list(zip([4, 4, 4], [1, 128, 128]))
    
    # slow inference
    tensor_shapes = list(zip([1, 4, 4], [1, 128, 128]))
    
    engine: ICudaEngine = build_engine(
    ย  ย  runtime=runtime,
    ย  ย  onnx_file_path=onnx_model_path,
    ย  ย  logger=trt_logger,
    ย  ย  min_shape=tensor_shapes[0],
    ย  ย  optimal_shape=tensor_shapes[1],
    ย  ย  max_shape=tensor_shapes[2],
    ย  ย  workspace_size=workspace_size * 1024**3,
    ย  ย  fp16=not quantization,
    ย  ย  int8=quantization,
    ย  ย  profiling=True,
    )
    
    save_engine(engine=engine, engine_file_path=tensorrt_path)
    

    the complete build and inference logs for slow inference case (when converting with dynamic batch)

    [06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +312, GPU +0, now: CPU 3789, GPU 2470 (MiB)
    [06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3790, GPU 2470 (MiB)
    [06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 3790 MiB, GPU 2470 MiB
    [06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 3924 MiB, GPU 2504 MiB
    [06/02/2022-03:19:09] [TRT] [I] parsing TensorRT model
    [libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
    [libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1418322027
    [06/02/2022-03:19:22] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
    [06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
    [06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
    [06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
    [06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +512, GPU +226, now: CPU 5802, GPU 2730 (MiB)
    [06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +116, GPU +52, now: CPU 5918, GPU 2782 (MiB)
    [06/02/2022-03:19:43] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
    [06/02/2022-03:19:43] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
    [06/02/2022-03:19:43] [TRT] [W]  (# 1 (SHAPE input_ids))
    [06/02/2022-03:19:43] [TRT] [W]  (# 0 (SHAPE attention_mask))
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    [06/02/2022-03:25:32] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
    [06/02/2022-03:25:32] [TRT] [W]  (# 1 (SHAPE input_ids))
    [06/02/2022-03:25:32] [TRT] [W]  (# 0 (SHAPE attention_mask))
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    [06/02/2022-03:30:10] [TRT] [I] Detected 2 inputs and 1 output network tensors.
    [06/02/2022-03:30:10] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
    [06/02/2022-03:30:10] [TRT] [W]  (# 1 (SHAPE input_ids))
    [06/02/2022-03:30:10] [TRT] [W]  (# 0 (SHAPE attention_mask))
    [06/02/2022-03:30:32] [TRT] [I] Total Host Persistent Memory: 208
    [06/02/2022-03:30:32] [TRT] [I] Total Device Persistent Memory: 0
    [06/02/2022-03:30:32] [TRT] [I] Total Scratch Memory: 442827264
    [06/02/2022-03:30:32] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 774 MiB, GPU 2058 MiB
    [06/02/2022-03:30:32] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.038945ms to assign 4 blocks to 4 nodes requiring 443041280 bytes.
    [06/02/2022-03:30:32] [TRT] [I] Total Activation Memory: 443041280
    [06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5993, GPU 4298 (MiB)
    [06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5993, GPU 4306 (MiB)
    [06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +1353, now: CPU 0, GPU 1353 (MiB)
    [06/02/2022-03:30:33] [TRT] [I] Loaded engine size: 1364 MiB
    [06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7354, GPU 4282 (MiB)
    [06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7355, GPU 4290 (MiB)
    [06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 1352 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] Loaded engine size: 1364 MiB
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7366, GPU 5636 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7367, GPU 5644 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 2704 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 6002, GPU 5636 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6002, GPU 5644 (MiB)
    [06/02/2022-03:30:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +423, now: CPU 0, GPU 3127 (MiB)
    
    latencies in ms
    --------------------------------------------------
    Pytorch 
    --------------------------------------------------
    [93.5968, 94.0308, 94.8224, 93.6746, 94.5972, 94.0188, 92.3105, 93.6535, 92.4908, 91.4413]
    --------------------------------------------------
    Onnxruntime 
     --------------------------------------------------
    [81.445, 81.3684, 80.2145, 81.5339, 82.9578, 83.6845, 83.6738, 82.6652, 81.5462, 82.8237]
    --------------------------------------------------
    TensorRT (FP16) 
     --------------------------------------------------
    [426.353, 425.1992, 426.0317, 425.8226, 426.8828, 428.0485, 426.3119, 426.4556, 425.4863, 426.0393]
    --------------------------------------------------
    

    Is this the expected behavior?

    I want to convert the model to use dynamic batches. When inferencing, the model should be able to handle a variable batch size and perform faster. How can I achieve that?

    Any help would be greatly appreciated, thank you in advance.

    bug 
    opened by Ki6an 12
  • Optimizations for T0

    Optimizations for T0

    I'm trying to replicate the T5 ONNX optimization notebook (the latest version, on the feat/t5_3b branch) but for T0_3B (which in itself is a derivative of T5, but with a slightly different config and no tie_word_embeddings.

    I installed ONNX runtime from source as described in the notebook.

    The only changes I made to the notebook are replacing "t5-3b" with "bigscience/T0_3B", and commenting out out_dec["last_hidden_state"] = out_dec["last_hidden_state"] * (pytorch_model.model_dim**-0.5) in the ExportT5 class, as T0 does not use tie word embeddings.

    However, the notebook fails on dec_if_ort_model = create_model_for_provider(dec_if_model_path, "CUDAExecutionProvider", log_severity=3), with the error: Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from ./test-dec-if/model.onnx failed:This is an invalid model. Error: the graph is not acyclic.

    Shouldn't T0 work because it is essentially T5? Your help would be greatly appreciated @pommedeterresautee. Thanks!

    opened by michaelroyzen 10
  • WIP - Support Token Classification

    WIP - Support Token Classification

    I think this makes it so TD can handle TokenClassification.

    However, the model I picked to test with seems to not convert well, both the ONNX and TensorRT conversion fail on the assert np.areclose, and I am not sure what this means...

    I am testing it with

    $ python src/transformer_deploy/convert.py -m dslim/bert-large-NER --backend onnx --seq-len 8 128 256 --batch-size 1 1 1 --task=TokenClassification --verbose

    Also, sorry for all the commits, I can squash them on my fork and make it clean later, I just wanted to know if you had any idea why this was failing.

    opened by sam-writer 9
  • got error in optimize onnx when ran gpt2 file from demo/generative-model

    got error in optimize onnx when ran gpt2 file from demo/generative-model

    getting error when ran this code part logging.basicConfig() logging.getLogger().setLevel(logging.INFO) num_attention_heads, hidden_size = get_model_size(path=model_name) optimize_onnx( onnx_path="test-gpt2.onnx", onnx_optim_model_path="test-gpt2-opt.onnx", fp16=True, use_cuda=True, num_attention_heads=num_attention_heads, hidden_size=hidden_size, architecture='gpt2' )

    INFO:fusion_base:Fused LayerNormalization count: 25 INFO:fusion_base:Fused FastGelu count: 12

    failed in shape inference <class 'AssertionError'> failed in shape inference <class 'AssertionError'> failed in shape inference <class 'AssertionError'>

    INFO:onnx_model:Graph pruned: 0 inputs, 0 outputs and 720 nodes are removed INFO:onnx_model_gpt2:postprocess: remove Reshape count:72 INFO:fusion_base:Fused FastGelu(add bias) count: 12 INFO:onnx_model_bert:opset verion: 13


    AssertionError Traceback (most recent call last)

    in () 9 num_attention_heads=num_attention_heads, 10 hidden_size=hidden_size, ---> 11 architecture='gpt2' 12 )

    7 frames

    /usr/local/lib/python3.7/dist-packages/onnxruntime/transformers/../tools/symbolic_shape_infer.py in _add_suggested_merge(self, symbols, apply) 209 210 def add_suggested_merge(self, symbols, apply=False): --> 211 assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols]) 212 symbols = set(symbols) 213 for k, v in self.suggested_merge.items():

    AssertionError:

    bug 
    opened by rohitmishra94 8
  • Support other tasks/architectures?

    Support other tasks/architectures?

    First off: thank you! This is a great project, I'm really grateful you released it publically.

    From what I can tell, this supports encoder-only architectures, and the Sequence Classification task (ex). Am I correct? If so, are there plans to support, or interest in supporting, other architectures (encoder/decoder, decoder-only) and/or tasks (Token Classification and Masked token prediction for encoder-only architectures, or Seq2SeqLM for the other architectures)?

    opened by sam-writer 7
  • GPT2 has slow inference

    GPT2 has slow inference

    Hello,

    your wrapper for gpt2 does not support 'past_key_values' as huggingface transformers initially do. I've seen your measurements in the gpt2 demo, and at least for pytorch they are not really correct, instead of just simply calling the model with always the same input, you should call the generate method..

    I tried to run gpt2 in pytorch both on cpu and gpu (GPU: TESLA T4) with your sample text: "Here is some text to encode Hello World"

    here are my results (vanilla pytorch): gpu no cache: 14s/sequence gpu cache: 3.6s/sequence

    cpu no cache: 114s/sequence cpu cache: 9.8s/sequence

    For every measurement, the result is average out of ten runs of the generate method, I used number of beams=5

    when running greedysearch, the difference is not so big, but still.. cpu no cache: 29s cpu cache: 4.8s

    CPU: Intel(R) Xeon(R) Platinum 8259CL CPU

    opened by kobzaond 6
  • Calibration failure occurred with no scaling factors detected

    Calibration failure occurred with no scaling factors detected

    Hey,

    first of all, thanks a lot for your great work. This repo was already a great help to me.

    With your quantization update for INT8, however, I ran into a problem. As soon as I activate --quantization, I get the following error:

    [01/14/2022-11:18:37] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
    [01/14/2022-11:18:37] [TRT] [E] 4: [standardEngineBuilder.cpp::initCalibrationParams::1402] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)
    [01/14/2022-11:18:37] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
    
    Traceback (most recent call last):
      File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 326, in <module>
        entrypoint()
      File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 322, in entrypoint
        main(commands=args)
      File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 216, in main
        engine: ICudaEngine = build_engine(
      File "/data/repos/transformer-deploy/src/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
        engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
    TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
        1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine
    
    Invoked with: <tensorrt.tensorrt.Runtime object at 0x7feb14128e30>, None
    

    The problem in the traceback is then just that the trt_engine will be None. I don't get any other warnings or errors, so I'm a bit at a loss. I've tried with distilroberta-base and also with bert-base-uncased, but I get the same error each time. Did you, by any chance, run into the same problem at some point in time or do you see what the issue may be?

    Thanks a lot in advance!

    opened by v1nc3nt27 6
  • Failed to load private model

    Failed to load private model

    Hi,

    I tried to convert a private model of sentence-transformer on the Hugging Face Hub:

    docker run -it --rm --gpus all \
        -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
        bash -c "cd /project && \
        convert_model -m \"Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2\" \
        --backend tensorrt onnx \
        --task embedding \
        --seq-len 16 128 128 \
        --auth-token XXX"
    

    However, the download of config.json file failed with the following message:

    =============================
    == Triton Inference Server ==
    =============================
    
    NVIDIA Release 22.01 (build 31237563)
    
    Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
    
    Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
    
    This container image and its contents are governed by the NVIDIA Deep Learning Container License.
    By pulling and using the container, you accept the terms and conditions of this license:
    https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
    
    Downloading: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 451/451 [00:00<00:00, 650kB/s]
    Downloading: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4.83M/4.83M [00:02<00:00, 2.48MB/s]
    Downloading: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 16.3M/16.3M [00:07<00:00, 2.41MB/s]
    Downloading: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 280/280 [00:00<00:00, 425kB/s]
    401 Client Error: Unauthorized for url: https://huggingface.co/Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2/resolve/main/config.json
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 585, in _get_config_dict
        resolved_config_file = cached_path(
      File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1846, in cached_path
        output_path = get_from_cache(
      File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 2050, in get_from_cache
        _raise_for_status(r)
      File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1977, in _raise_for_status
        request.raise_for_status()
      File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 960, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2/resolve/main/config.json
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/bin/convert_model", line 8, in <module>
        sys.exit(entrypoint())
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 357, in entrypoint
        main(commands=args)
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 152, in main
        model_config: PretrainedConfig = AutoConfig.from_pretrained(pretrained_model_name_or_path=commands.model)
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 612, in from_pretrained
        config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 537, in get_config_dict
        config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 618, in _get_config_dict
        raise EnvironmentError(
    OSError: We couldn't connect to 'https://huggingface.co/' to load this model and it looks like Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2 is not the path to a directory conaining a config.json file.
    Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
    

    Any advice?

    Thanks!

    opened by Matthieu-Tinycoaching 5
  • 'assert num_heads > 0' error with DistilBert

    'assert num_heads > 0' error with DistilBert

    I get the following error when I try to optimize distilbert:

    AssertionError                            Traceback (most recent call last)
    <timed eval> in <module>
    
    /opt/conda/lib/python3.7/site-packages/transformer_deploy/convert.py in main(input_args)
        245             onnx_path=onnx_model_path,
        246             onnx_optim_fp16_path=onnx_optim_fp16_path,
    --> 247             use_cuda=True,
        248         )
        249         onnx_model = create_model_for_provider(path=onnx_optim_fp16_path, provider_to_use="CUDAExecutionProvider")
    
    /opt/conda/lib/python3.7/site-packages/transformer_deploy/backends/ort_utils.py in optimize_onnx(onnx_path, onnx_optim_fp16_path, use_cuda)
         72         num_heads=0,  # automatic detection don't work with opset 13
         73         hidden_size=0,  # automatic detection
    ---> 74         optimization_options=optimization_options,
         75     )
         76 
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/optimizer.py in optimize_model(input, model_type, num_heads, hidden_size, optimization_options, opt_level, use_gpu, only_onnxruntime)
        289 
        290     if not only_onnxruntime:
    --> 291         optimizer.optimize(optimization_options)
        292 
        293     # Remove the temporary model.
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in optimize(self, options, add_dynamic_axes)
        317             if options is not None:
        318                 self.attention_mask.set_mask_format(options.attention_mask_format)
    --> 319             self.fuse_attention()
        320 
        321         self.fuse_shape()
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in fuse_attention(self)
         52 
         53     def fuse_attention(self):
    ---> 54         self.attention_fusion.apply()
         55 
         56     def fuse_gelu(self):
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_base.py in apply(self)
         41                     raise Exception("Can not find node in any graphs")
         42                 self.this_graph_name = graph.name
    ---> 43                 self.fuse(node, input_name_to_nodes, output_name_to_node)
         44 
         45         op_list = [node.op_type for node in self.nodes_to_add]
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in fuse(self, normalize_node, input_name_to_nodes, output_name_to_node)
        444             new_node = self.create_attention_node(mask_index, matmul_q, matmul_k, matmul_v, add_q, add_k, add_v,
        445                                                   q_num_heads, self.hidden_size, root_input,
    --> 446                                                   attention_last_node.output[0], add_qk_str)
        447             if new_node is None:
        448                 return
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in create_attention_node(self, mask_index, q_matmul, k_matmul, v_matmul, q_add, k_add, v_add, num_heads, hidden_size, input, output, add_qk_str)
        161             Union[NodeProto, None]: the node created or None if failed.
        162         """
    --> 163         assert num_heads > 0
        164 
        165         if hidden_size > 0 and (hidden_size % num_heads) != 0:
    
    AssertionError: 
    

    While trying to resolve the issue, I observed that it did not occur when optimizer from onnxruntime-tools was used with opt_level 99 (instead of the one in onnxruntime.transformers). But the code then threw Exceptions due to some skip layer normalization issues.

    opened by vishalsrao 5
  • Unable to install transformer-deploy module

    Unable to install transformer-deploy module

    Any support would be appreciated:

    When running demo/torchdynamo/benchmark.ipynb, specifically this cell (pasted code), I run into the error below.

    from typing import Dict
    
    import numpy as np
    import torch
    from onnxruntime import GraphOptimizationLevel
    
    from transformers import AutoModel, PreTrainedModel
    from transformer_deploy.backends.ort_utils import convert_fp16
    from transformer_deploy.backends.onnx_utils import save_onnx
    
    from dynamo_utils import (
        benchmark,
        check_output,
        get_dynamo_optimizer,
        get_onnx_inference,
        get_pytorch_inference,
        get_pytorch_input,
        plot_benchmarks,
        print_pytorch_profile,
        get_tensorrt_inference,
        seq_lengths,
    )
    
    import gc
    import tensorrt as trt
    from tensorrt.tensorrt import ICudaEngine, Logger, Runtime
    import onnx
    from transformer_deploy.backends.trt_utils import build_engine, save_engine
    
    
    ModuleNotFoundError Traceback (most recent call last)Cell In [3], line 8 5 from onnxruntime import GraphOptimizationLevel 7 from transformers import AutoModel, PreTrainedModel----> 8 from transformer_deploy.backends.ort_utils import convert_fp16 9 from transformer_deploy.backends.onnx_utils import save_onnx 11 from dynamo_utils import ( 12 benchmark, 13 check_output, (...) 21 seq_lengths, 22 )ModuleNotFoundError: No module named 'transformer_deploy'
    
    opened by elvinagam 4
  • Question-Answering example not working for batch_size > 1

    Question-Answering example not working for batch_size > 1

    I'm running demo/question-answering/triton_client.py from the examples directory. The script returns expected result with batch_size=1. However, if you make the batch_size > 1 in this line, it outputs only the result of the first element in the batch and other elements are ignored.

    I saw #84 and #106 about the question-answering example and batch_size but I don't think they are related to this. The triton server does not yield in any errors.

    Am I missing something here?

    opened by lakshaykc 0
  • Support for constrained beam-search in T5

    Support for constrained beam-search in T5

    HF T5 model (actually seq2seq model in general) supports complex decoding schemes such as constrained beam search https://huggingface.co/blog/constrained-beam-search. In my use case, I just really need the simplest constrained beam search where decoded sequences have to belong to a pre-defined Trie. This can be done via https://huggingface.co/docs/transformers/internal/generation_utils#transformers.PrefixConstrainedLogitsProcessor

    Is this possible for transformer-deploy ?

    opened by junwang-wish 0
  • Attempting to run T5 ORT model in Triton inference server

    Attempting to run T5 ORT model in Triton inference server

    Hi there,

    Thanks again for this library!

    We're trying to convert a fine-tuned T5 model to ONNX and run it in Triton. We've managed to convert the model to ONNX and use the T5 notebook guide to run the model just fine in python.

    But trying to get it to run in Triton has been a challenge. In particular, we're not sure how to get past_key_values to be passed through in Triton. We have the decoder config as follows:

    name: "t5-dec-if-node_onnx_model"
    max_batch_size: 0
    platform: "onnxruntime_onnx"
    default_model_filename: "model.bin"
    input [
        {
            name: "input_ids"
            data_type: TYPE_INT32
            dims: [ -1, -1 ]
        },
        {
            name: "encoder_hidden_states"
            data_type: TYPE_FP32
            dims: [ -1, -1, 2048 ]
        },
        {
            name: "enable_cache"
            data_type: TYPE_BOOL
            dims: [ 1 ]
        },
        
            {
                name: "past_key_values.0.decoder.key"
                data_type: TYPE_FP32
                dims: [-1, 32, -1, 64]
            },
            {
                name: "past_key_values.0.decoder.value"
                data_type: TYPE_FP32
                dims: [-1, 32, -1, 64]
            },
            {
                name: "past_key_values.0.encoder.key"
                data_type: TYPE_FP32
                dims: [-1, 32, -1, 64]
            },
            {
                name: "past_key_values.0.encoder.value"
                data_type: TYPE_FP32
                dims: [-1, 32, -1, 64]
            }
         ...
    ]
    output [
        {
            name: "logits"
            data_type: TYPE_FP32
            dims: [ -1, -1, 32128 ]
        }
    ]
    instance_group [
        {
          count: 1
          kind: KIND_GPU
        }
    ]
    

    And when we do the following:

    input_1 = tritonclient.http.InferInput(name="input_ids", shape=(1, 24), datatype="INT32")
    input_2 = tritonclient.http.InferInput(name="encoder_hidden_states", shape=(1, 24, 2048), datatype="FP32")
    input_3 = tritonclient.http.InferInput(name="enable_cache", shape=(1, ), datatype="BOOL")
    
    input_1.set_data_from_numpy(input_ids)
    input_2.set_data_from_numpy(encoder_hidden_states)
    input_3.set_data_from_numpy(np.asarray([True]))
    
    result = triton_client.infer(
        model_name='t5-dec-if-node_onnx_model', 
        inputs=[input_1, input_2, input_3], 
        outputs=[tritonclient.http.InferRequestedOutput(name="logits", binary_data=False)]
    )
    

    We get this error:

    InferenceServerException: [request id: <id_unknown>] expected 99 inputs but got 3 inputs for model 't5-dec-if-node_onnx_model'
    

    Any idea how we can fix this?

    opened by samiur 1
  • Two GPU are slower than one

    Two GPU are slower than one

    Hi, I run Triton web server on two GPUs NVIDIA RTX3090Ti with --shm-size 20g. When I do inference, I get time near 1.56s. But if I run web server with only one GPU set --gpus '"device=0"' after that I get the time near 860ms. Length of input sequence was 256 tokens. I've optimized GPT2-medium by your script.

    convert_model -m gpt2-medium \
        --backend tensorrt onnx \
        --seq-len 32 512 512 \
        --task text-generation --atol=2"
    
    opened by OleksandrKorovii 0
  • Tensorrt engine

    Tensorrt engine

    I tried running TRT based-off three methods:

    1. python src/transformer-deploy/convert.py
    2. exisiting docker image
    3. build docker image from repo

    In all three instances, I got back the same response while running TRT backend.

    The command I have been trying to run (docker for example):

    docker run -it --rm --gpus all -v $PWD:/project ghcr.io/els-rd/transformer-deploy:latest bash -c "cd /project && \
      convert_model -m \"sentence-transformers/multi-qa-distilbert-cos-v1\" \
      --backend tensorrt onnx \
      --seq-len 128 128 256 \
      --batch-size 1 32 300"
    
    

    When i pass only 'onnx' as backend param everything runs pretty smoothly. But face issues with 'tensorrt' backend.

    [11/29/2022-10:58:17] [TRT] [E] 2: [optimizer.cpp::getFormatRequirements::2945] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. no supported formats)
    [11/29/2022-10:58:17] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
    Traceback (most recent call last):
      File "/usr/local/bin/convert_model", line 8, in <module>
        sys.exit(entrypoint())
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 417, in entrypoint
        main(commands=args)
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 308, in main
        engine: ICudaEngine = build_engine(
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 206, in build_engine
        engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
    TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
        1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine
    
    Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f88c85c12b0>, None
    free(): invalid pointer
    
    

    Would be great if I could have a workaround for this.

    Versions: Python: 3.8.15 transformers-deploy: 0.5.3 TensorRT: 8.4.1.5 Onnxruntime (GPU): 1.12.0 transformers: 4.24.0

    opened by imsiddhant07 1
  • Token type ids bug

    Token type ids bug

    Some models don't use token_type_ids in the forward pass. E.g. deberta has type_vocab_size=0 as a default value.

    What happens is the model ignores token_type_ids (https://github.com/huggingface/transformers/blob/bac2d29a802803a7f2db8e8597a2ec81730afcc9/src/transformers/models/deberta/modeling_deberta.py#L810)

    However, tokenizer doesn't know about this and token_type_ids is still in tokenizer.model_input_names.

    This mismatch leads to

    docker run -it --rm --gpus all \
      -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.3 \
      bash -c "cd /project && \
        convert_model -m \"microsoft/deberta-base-mnli\" \
        --backend onnx \
        --seq-len 16 128 128"
    
    docker run -itd --rm --gpus '"device=3"' -p8000:8000 --shm-size 256m \
      -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
      bash -c "pip install transformers && tritonserver --model-repository=/models"
    

    And the triton inference server fails with

    I1123 13:49:09.821427 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbf36000000' with size 268435456
    I1123 13:49:09.821983 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
    I1123 13:49:09.828017 1 model_repository_manager.cc:1206] loading: transformer_onnx_tokenize:1
    I1123 13:49:09.828058 1 model_repository_manager.cc:1206] loading: transformer_onnx_model:1
    I1123 13:49:09.830743 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime
    I1123 13:49:09.830786 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10
    I1123 13:49:09.830804 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10
    I1123 13:49:09.830814 1 onnxruntime.cc:2504] backend configuration:
    {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
    I1123 13:49:09.846110 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: transformer_onnx_model (version 1)
    I1123 13:49:09.847111 1 onnxruntime.cc:666] skipping model configuration auto-complete for 'transformer_onnx_model': inputs and outputs already specified
    I1123 13:49:09.851839 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_model_0 (GPU device 0)
    I1123 13:49:12.063610 1 onnxruntime.cc:2637] TRITONBACKEND_ModelInstanceFinalize: delete instance state
    I1123 13:49:12.063688 1 onnxruntime.cc:2583] TRITONBACKEND_ModelFinalize: delete model state
    E1123 13:49:12.063708 1 model_repository_manager.cc:1355] failed to load 'transformer_onnx_model' version 1: Invalid argument: unable to load model 'transformer_onnx_model', configuration expects 3 inputs, model provides 2
    None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
    I1123 13:49:13.744756 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_tokenize_0 (GPU device 0)
    None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
    I1123 13:49:14.298233 1 model_repository_manager.cc:1352] successfully loaded 'transformer_onnx_tokenize' version 1
    E1123 13:49:14.298380 1 model_repository_manager.cc:1559] Invalid argument: ensemble 'transformer_onnx_inference' depends on 'transformer_onnx_model' which has no loaded version
    I1123 13:49:14.298438 1 server.cc:559]
    +------------------+------+
    | Repository Agent | Path |
    +------------------+------+
    +------------------+------+
    
    I1123 13:49:14.298487 1 server.cc:586]
    +-------------+----------------------------------------------------------------+----------------------------------------------------------------+
    | Backend     | Path                                                           | Config                                                         |
    +-------------+----------------------------------------------------------------+----------------------------------------------------------------+
    | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.s | {"cmdline":{"auto-complete-config":"true","min-compute-capabil |
    |             | o                                                              | ity":"6.000000","backend-directory":"/opt/tritonserver/backend |
    |             |                                                                | s","default-max-batch-size":"4"}}                              |
    |             |                                                                |                                                                |
    | python      | /opt/tritonserver/backends/python/libtriton_python.so          | {"cmdline":{"auto-complete-config":"true","min-compute-capabil |
    |             |                                                                | ity":"6.000000","backend-directory":"/opt/tritonserver/backend |
    |             |                                                                | s","default-max-batch-size":"4"}}                              |
    +-------------+----------------------------------------------------------------+----------------------------------------------------------------+
    
    I1123 13:49:14.298549 1 server.cc:629]
    +---------------------------+---------+---------------------------------------------------------------------------------------------------------+
    | Model                     | Version | Status                                                                                                  |
    +---------------------------+---------+---------------------------------------------------------------------------------------------------------+
    | transformer_onnx_model    | 1       | UNAVAILABLE: Invalid argument: unable to load model 'transformer_onnx_model', configuration expects 3 i |
    |                           |         | nputs, model provides 2                                                                                 |
    | transformer_onnx_tokenize | 1       | READY                                                                                                   |
    +---------------------------+---------+---------------------------------------------------------------------------------------------------------+
    
    I1123 13:49:14.351997 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
    I1123 13:49:14.352405 1 tritonserver.cc:2176]
    +----------------------------------+------------------------------------------------------------------------------------------------------------+
    | Option                           | Value                                                                                                      |
    +----------------------------------+------------------------------------------------------------------------------------------------------------+
    | server_id                        | triton                                                                                                     |
    | server_version                   | 2.24.0                                                                                                     |
    | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configu |
    |                                  | ration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace                         |
    | model_repository_path[0]         | /models                                                                                                    |
    | model_control_mode               | MODE_NONE                                                                                                  |
    | strict_model_config              | 0                                                                                                          |
    | rate_limit                       | OFF                                                                                                        |
    | pinned_memory_pool_byte_size     | 268435456                                                                                                  |
    | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                   |
    | response_cache_byte_size         | 0                                                                                                          |
    | min_supported_compute_capability | 6.0                                                                                                        |
    | strict_readiness                 | 1                                                                                                          |
    | exit_timeout                     | 30                                                                                                         |
    +----------------------------------+------------------------------------------------------------------------------------------------------------+
    
    I1123 13:49:14.352443 1 server.cc:260] Waiting for in-flight requests to complete.
    I1123 13:49:14.352453 1 server.cc:276] Timeout 30: Found 0 model versions that have in-flight inferences
    I1123 13:49:14.352460 1 model_repository_manager.cc:1230] unloading: transformer_onnx_tokenize:1
    I1123 13:49:14.352525 1 server.cc:291] All models are stopped, unloading models
    I1123 13:49:14.352534 1 server.cc:298] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
    I1123 13:49:15.352620 1 server.cc:298] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
    I1123 13:49:15.444143 1 model_repository_manager.cc:1335] successfully unloaded 'transformer_onnx_tokenize' version 1
    I1123 13:49:16.352790 1 server.cc:298] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
    error: creating server: Internal - failed to load all models
    

    The proposed solution fixes this bug

    opened by fursovia 2
Releases(v0.4.0)
  • v0.4.0(Feb 8, 2022)

    • add support for decoder based model (GPT-2) on both ONNX Runtime and TensorRT
    • refactor triton configuration generation (simplification)
    • add GPT-2 model documentation (notebook)
    • fix CPU quantization benchmark (was not using the quant model)
    • fix sentence transformers bug
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Dec 28, 2021)

    What's Changed

    • Update requirements_gpu.txt by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/22
    • refactoring by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/27
    • add CPU inference support by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/28
    • Add QAT support to more models by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/29

    Full Changelog: https://github.com/ELS-RD/transformer-deploy/compare/v0.2.0...v0.3.0

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Dec 8, 2021)

    • support int-8 GPU quantization
    • add a tuto to perform quantization end to end
    • add QDQRoberta model
    • switch to ONNX opset 13
    • refactoring in the TensorRT engine creation
    • fix bugs
    • add auth token (for private HF repo)

    What's Changed

    • Update triton by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/11
    • fix README.md by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/13
    • Fix install errors by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/20
    • Add auth token by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/19
    • Support GPU INT-8 quantization by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/15

    New Contributors

    • @sam-writer made their first contribution in https://github.com/ELS-RD/transformer-deploy/pull/20

    Full Changelog: https://github.com/ELS-RD/transformer-deploy/compare/v0.1.1...v0.2.0

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Nov 24, 2021)

  • v0.1.0(Nov 23, 2021)

    • switch from a proof of concept to a library
    • add support for TensorRT Python API (for best performances)
    • improve documentation (separate Hugging Face Infinity thing from the doc, add benchmark, etc.)
    • fix issues with mixed precision
    • add license
    • add tests, Github actions, Makefile
    • change the way the Docker image is built
    Source code(tar.gz)
    Source code(zip)
  • v0.0.1(Nov 8, 2021)

Owner
Lefebvre Sarrut Services
Dรฉpartement R&D du groupe Lefebvre Sarrut
Lefebvre Sarrut Services
๐ŸŽ๏ธ Accelerate training and inference of ๐Ÿค— Transformers with easy to use hardware optimization tools

Hugging Face Optimum ๐Ÿค— Optimum is an extension of ๐Ÿค— Transformers, providing a set of performance optimization tools enabling maximum efficiency to t

Hugging Face 842 Dec 30, 2022
The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

3D Human Pose Estimation with Spatial and Temporal Transformers This repo is the official implementation for 3D Human Pose Estimation with Spatial and

Ce Zheng 363 Dec 28, 2022
Deep Learning Tutorial for Kaggle Ultrasound Nerve Segmentation competition, using Keras

Deep Learning Tutorial for Kaggle Ultrasound Nerve Segmentation competition, using Keras This tutorial shows how to use Keras library to build deep ne

Marko Jociฤ‡ 922 Dec 19, 2022
Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

M-LSD: Towards Light-weight and Real-time Line Segment Detection Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Det

123 Jan 04, 2023
ๅŸบไบŽtensorflow 2.x็š„ๅ›พ็‰‡่ฏ†ๅˆซๅทฅๅ…ท้›†

Classification.tf2 ๅŸบไบŽtensorflow 2.x็š„ๅ›พ็‰‡่ฏ†ๅˆซๅทฅๅ…ท้›† ๅŠŸ่ƒฝ ็ฒ—็ฒ’ๅบฆๅœบๆ™ฏๅ›พ็‰‡ๅˆ†็ฑป ็ป†็ฒ’ๅบฆๅœบๆ™ฏๅ›พ็‰‡ๅˆ†็ฑป ๅ…ถไป–ๅœบๆ™ฏๅ›พ็‰‡ๅˆ†็ฑป ๆจกๅž‹้ƒจ็ฝฒ tensorflow servingๆœฌๅœฐๆŽจ็†ๅ’Œdocker้ƒจ็ฝฒ tensorRT onnx ... ๆ•ฐๆฎ้›† https://hyper.a

Wei Qi 1 Nov 03, 2021
Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning.

Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning Installation

Pytorch Lightning 1.6k Jan 08, 2023
A simple pytorch pipeline for semantic segmentation.

SegmentationPipeline -- Pytorch A simple pytorch pipeline for semantic segmentation. Requirements : torch=1.9.0 tqdm albumentations=1.0.3 opencv-pyt

petite7 4 Feb 22, 2022
Code accompanying our NeurIPS 2021 traffic4cast challenge

Traffic forecasting on traffic movie snippets This repo contains all code to reproduce our approach to the IARAI Traffic4cast 2021 challenge. In the c

Nina Wiedemann 2 Aug 09, 2022
GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.

The GT4SD (Generative Toolkit for Scientific Discovery) is an open-source platform to accelerate hypothesis generation in the scientific discovery process. It provides a library for making state-of-t

Generative Toolkit 4 Scientific Discovery 142 Dec 24, 2022
Image to Image translation, image generataton, few shot learning

Semi-supervised Learning for Few-shot Image-to-Image Translation [paper] Abstract: In the last few years, unpaired image-to-image translation has witn

yaxingwang 49 Nov 18, 2022
AITom is an open-source platform for AI driven cellular electron cryo-tomography analysis.

AITom Introduction AITom is an open-source platform for AI driven cellular electron cryo-tomography analysis. AITom is originated from the tomominer l

93 Jan 02, 2023
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 08, 2022
Medical-Image-Triage-and-Classification-System-Based-on-COVID-19-CT-and-X-ray-Scan-Dataset

Medical-Image-Triage-and-Classification-System-Based-on-COVID-19-CT-and-X-ray-Sc

2 Dec 26, 2021
Software & Hardware to do multi color printing with Sharpies

3D Print Colorizer is a combination of 3D printed parts and a Cura plugin which allows anyone with an Ender 3 like 3D printer to produce multi colored

343 Jan 06, 2023
Vision-and-Language Navigation in Continuous Environments using Habitat

Vision-and-Language Navigation in Continuous Environments (VLN-CE) Project Website โ€” VLN-CE Challenge โ€” RxR-Habitat Challenge Official implementations

Jacob Krantz 132 Jan 02, 2023
Learning from Synthetic Data with Fine-grained Attributes for Person Re-Identification

Less is More: Learning from Synthetic Data with Fine-grained Attributes for Person Re-Identification Suncheng Xiang Shanghai Jiao Tong University Over

SunchengXiang 68 Dec 13, 2022
My personal Home Assistant configuration.

About This is my personal Home Assistant configuration. My guiding princile is to have full local control of all my devices. I intend everything to ru

Chris Turra 13 Jun 07, 2022
Label Studio is a multi-type data labeling and annotation tool with standardized output format

Website โ€ข Docs โ€ข Twitter โ€ข Join Slack Community What is Label Studio? Label Studio is an open source data labeling tool. It lets you label data types

Heartex 11.7k Jan 09, 2023
PyTorch implementation of the paper: Long-tail Learning via Logit Adjustment

logit-adj-pytorch PyTorch implementation of the paper: Long-tail Learning via Logit Adjustment This code implements the paper: Long-tail Learning via

Chamuditha Jayanga 53 Dec 23, 2022
Relative Uncertainty Learning for Facial Expression Recognition

Relative Uncertainty Learning for Facial Expression Recognition The official implementation of the following paper at NeurIPS2021: Title: Relative Unc

35 Dec 28, 2022