BigDL: Distributed Deep Learning Framework for Apache Spark

Last update: Jan 09, 2023

Overview

BigDL: Distributed Deep Learning on Apache Spark

What is BigDL?

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters. To makes it easy to build Spark and BigDL applications, a high level Analytics Zoo is provided for end-to-end analytics + AI pipelines.

Rich deep learning support. Modeled after Torch, BigDL provides comprehensive support for deep learning, including numeric computing (via Tensor) and high level neural networks; in addition, users can load pre-trained Caffe or Torch models into Spark programs using BigDL.
Extremely high performance. To achieve high performance, BigDL uses Intel MKL / Intel MKL-DNN and multi-threaded programming in each Spark task. Consequently, it is orders of magnitude faster than out-of-box open source Caffe, Torch or TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU). With adoption of Intel DL Boost, BigDL improves inference latency and throughput significantly.
Efficiently scale-out. BigDL can efficiently scale out to perform data analytics at "Big Data scale", by leveraging Apache Spark (a lightning fast distributed data processing framework), as well as efficient implementations of synchronous SGD and all-reduce communications on Spark.

Why BigDL?

You may want to write your deep learning programs using BigDL if:

You want to analyze a large amount of data on the same Big Data (Hadoop/Spark) cluster where the data are stored (in, say, HDFS, HBase, Hive, etc.).
You want to add deep learning functionalities (either training or prediction) to your Big Data (Spark) programs and/or workflow.
You want to leverage existing Hadoop/Spark clusters to run your deep learning applications, which can be then dynamically shared with other workloads (e.g., ETL, data warehouse, feature engineering, classical machine learning, graph analytics, etc.)

How to use BigDL?

For the technical overview of BigDL, please refer to the BigDL white paper
More information can be found at the BigDL project website:

https://bigdl-project.github.io/

In particular, you can check out the Getting Started page for a quick overview of how to use BigDL
For step-by-step deep leaning tutorials on BigDL (using Python), you can check out the BigDL Tutorials project
You can join the BigDL Google Group (or subscribe to the Mail List) for more questions and discussions on BigDL
You can post bug reports and feature requests at the Issue Page
You may refer to Analytics Zoo for high level pipeline APIs, built-in deep learning models, reference use cases, etc. on Spark and BigDL

Citing BigDL

If you've found BigDL useful for your project, you can cite the paper as follows:

@inproceedings{SOCC2019_BIGDL,
  title={BigDL: A Distributed Deep Learning Framework for Big Data},
  author={Dai, Jason (Jinquan) and Wang, Yiheng and Qiu, Xin and Ding, Ding and Zhang, Yao and Wang, Yanzhang and Jia, Xianyan and Zhang, Li (Cherry) and Wan, Yan and Li, Zhichao and Wang, Jiao and Huang, Shengsheng and Wu, Zhongyuan and Wang, Yang and Yang, Yuhao and She, Bowen and Shi, Dongjie and Lu, Qi and Huang, Kai and Song, Guoqiong},
  booktitle={Proceedings of the ACM Symposium on Cloud Computing},
  publisher={Association for Computing Machinery},
  pages={50--60},
  year={2019},
  series={SoCC'19},
  doi={10.1145/3357223.3362707},
  url={https://arxiv.org/pdf/1804.05839.pdf}
}

Comments

How to save a BigDL model in the following example ? is there any api doc ?

https://github.com/mrafayaleem/transfer-learning-bigdl/blob/master/transfer-learning-bigdl.ipynb

It was not saved as xx.model when I ran antbeeModel.save("/root/Desktop/model.model")
user issue

opened by 704572066 49
Example test on yarn
Change code to add a deploymode option. Reference: https://github.com/intel-analytics/BigDL/blob/branch-2.0/python/dllib/src/bigdl/dllib/models/inception/inception.py, https://github.com/intel-analytics/BigDL/blob/branch-2.0/python/dllib/src/bigdl/dllib/models/lenet/lenet5.py

Add test to python/dllib/src/bigdl/dllib/examples/run-example-tests-yarn-integration.sh

Run jenkins http://10.112.231.51:18888/view/ZOO-PR/job/ZOO-PR-Python-integration-test/

TODO: Move run-example-tests-yarn-integration.sh to python/dllib/examples/. (xin)

[ ] dllib examples

[ ] orca examples

dllib examples, use init_nncontext | Module | Example | Added | Client Mode | Cluster Mode | | ----------- | ----------- | ----------- | ----------- | ----------- | | autograd | custom.py |Y | Succeed | Succeed | | autograd | customloss.py | Y |Succeed | Succeed | | nnframes | imageInference | Y | Succeed | Succeed | | nnframes | imageTransferLearning | Y | Succeed | Succeed |

orca examples, use init_orca_context |Module|Example|Added|Client Mode|Cluster Mode| |-|-|-|-|-| | automl | autoestimator/autoestimator_pytorch.py |Y | Succeed | Succeed | | automl | autoxgboost/AutoXGBoostClassifier.p (https://github.com/intel-analytics/analytics-zoo/issues/5049) |Y | Succeed | Succeed | | automl | autoxgboost/AutoXGBoostRegressor.py (https://github.com/intel-analytics/analytics-zoo/issues/5049)|Y | Succeed | Succeed | | data | spark_pandas.py | Y | Succeed | Succeed | | bigdl | learn/bigdl/attention/transformer.py | Y | Succeed | Failed | | bigdl | learn/bigdl/imageInference/imageInference.py | Y | Succeed | Failed | | horovod | learn/horovod/pytorch_estimator.py | Y | Succeed | Succeed | | horovod | simple_horovod_pytorch.py | Y | Succeed | | | mxnet | learn/mxnet/lenet_mnist.py | Y | Succeed | | | openvino | learn/openvino/predict.py | Not Added | | | | pytorch | learn/pytorch/cifar10/cifar10.py | Y | Succeed | Failed | | pytorch | learn/pytorch/fashion_mnist/fashion_mnist.py | Y | Succeed | Failed | | pytorch | learn/pytorch/super_resolution/super_resolution.py | Y | Succeed | Failed | | tf | learn/tf/basic_text_classification/basic_text_classification.py | Y | Succeed | Failed | | tf | learn/tf/image_segmentation/image_segmentation.py | Y | Succeed | Failed | | tf | learn/tf/inception/inception.py | Y | Succeed | Failed | | tf | learn/tf/transfer_learning/transfer_learning.py | Y | Failed | Failed | | tf2 | learn/tf2/resnet/resnet-50-imagenet.py | Y | Failed | Failed | | tf2 | learn/tf2/yolov3/yoloV3.py | Y | Succeed | Succeed | | ray_on_spark | ray_on_spark/parameter_server/async_parameter_server.py | Y | Succeed | Failed | | ray_on_spark | ray_on_spark/parameter_server/sync_parameter_server.py | Y | Succeed | Succeed | | ray_on_spark | ray_on_spark/rl_pong/rl_pong.py | Y | Succeed | Succeed | | ray_on_spark | ray_on_spark/rllib/multiagent_two_trainers.py | Y | Succeed | Succeed | | tfpark | tfpark/estimator/estimator_dataset.py | Y | Succeed | Failed | | tfpark | tfpark/estimator/estimator_inception.py | Y | Succeed | Failed | | tfpark | tfpark/estimator/pre-made-estimator.py | Not Added | | | | tfpark | tfpark/gan/gan_train_and_evaluate.py | Y | Failed | | | tfpark | tfpark/keras/keras_dataset.py | Y | Succeed | Failed | | tfpark | tfpark/keras/keras_ndarray.py | Y | Succeed | Failed | | tfpark | tfpark/tf_optimizer/evaluate.py | Y | Succeed | Failed | | tfpark | tfpark/tf_optimizer/train.py | Y | Succeed | Failed | | torchmodel | torchmodel/train/imagenet/main.py | Y | Succeed | Failed | | torchmodel | torchmodel/train/mnist/main.py | Y | Succeed | Failed | | torchmodel | torchmodel/train/resnet_finetune/resnet_finetune.py | Y | Succeed | Failed |
opened by qiuxin2012 29

the consistency of preprocessing between NNFrame transform and customized one

I did transfer learning of image classification and deployed the saved the model in cluster-serving，then I found the prediction results between pyspark pipeline transform and cluster serving http api inference are different. It turns out to be the model inputs (aka. preprocessing output) are different. The preprocessing of NNFrame is ChainedPreprocessing and my customized one seems the same with it. the original issue https://github.com/intel-analytics/BigDL/issues/3764

transfer learning preprocessing code:

import os.path as osp
from bigdl.dllib.nn.criterion import *
from bigdl.dllib.nn.layer import *
from bigdl.dllib.nnframes import *
from bigdl.dllib.feature.image import *


def build_transforms(params):
    from bigdl.dllib.feature.common import ChainedPreprocessing
    transformer = ChainedPreprocessing(
        [RowToImageFeature(), ImageResize(256, 256), ImageCenterCrop(224, 224),
         ImageChannelNormalize(123.0, 117.0, 104.0), ImageMatToTensor(), ImageFeatureToTensor()])    
    return transformer

def build_classifier():
    ...

def train(task_path, dataset_path, params):
    from pyspark.ml import Pipeline
    from pyspark.ml.evaluation import MulticlassClassificationEvaluator
    from pyspark.sql.functions import udf
    from pyspark.sql.types import DoubleType, StringType

    from bigdl.dllib.nnframes import NNImageReader
    from bigdl.dllib.utils.common import redire_spark_logs

    spark_conf = SparkConf().set("spark.driver.memory", "10g") \
        .set("spark.driver.cores", 4)

    sc = init_nncontext(spark_conf, cluster_mode="local")
    # redire_spark_logs("float", osp.join(task_path, 'out.log'))

    getFileName = udf(lambda row: os.path.basename(row[0]), StringType())
    getLabel = udf(lambda row: 1.0 if 'ants' in row[0] else 2.0, DoubleType())

    trainingDF = NNImageReader.readImages(osp.join(dataset_path, 'train/*'), sc, resizeH=300, resizeW=300, image_codec=1)
    trainingDF = trainingDF.withColumn('filename', getFileName('image')).withColumn('label', getLabel('image'))

    validationDF = NNImageReader.readImages(osp.join(dataset_path, 'val/*'), sc, resizeH=300, resizeW=300, image_codec=1)
    validationDF = validationDF.withColumn('filename', getFileName('image')).withColumn('label', getLabel('image'))


    transformer = build_transforms(params)
    preTrainedNNModel = NNModel(Model.loadModel(osp.join(dataset_path,'analytics-zoo_resnet-50_imagenet_0.1.0.model')), transformer) \
        .setFeaturesCol("image") \
        .setPredictionCol("embedding")

    classifier = build_classifier()
    pipeline = Pipeline(stages=[preTrainedNNModel, classifier])

    antbeeModel = pipeline.fit(trainingDF)
    predictionDF = antbeeModel.transform(validationDF).cache()

customized preprocessing code:

import base64
import cv2
import numpy as np
from urllib import request
import json
import matplotlib.pyplot as plt
import pylab
# import torch
def resize_img(img, target_size):
    img = cv2.resize(img, (target_size, target_size))
    return img


def center_crop_img(im, target_size):
    w, h = im.shape[0], im.shape[1]
    tw, th = target_size, target_size
    assert (w >= target_size) and (h >= target_size), \
            "image width({}) and height({}) should be larger than crop size".format(w, h, target_size)
    x1 = int(round((w - tw) / 2.))
    y1 = int(round((h - th) / 2.))
    im = im[x1:x1+tw, y1:y1+th]
    return im

def normalize_image(x, mean=(0., 0., 0.), std=(1.0, 1.0, 1.0)):
    '''Normalization.

    Args:
        x: input image.
        mean: mean value of the input image.
        std: standard deviation value of the input image.

    Returns:
        Normalized image.
    '''

    x = np.asarray(x, dtype=np.float32)
    if len(x.shape) == 4:
        for dim in range(3):
            x[:, :, :, dim] = (x[:, :, :, dim] - mean[dim]) / std[dim]
    if len(x.shape) == 3:
        for dim in range(3):
            x[:, :, dim] = (x[:, :, dim] - mean[dim]) / std[dim]
    return x

def preprocess(img):
    img = img[..., ::-1]
    image_resize = resize_img(img, 256)
    ccrop_img = center_crop_img(image_resize, 224)
    n_img = normalize_image(ccrop_img, [123.0, 117.0, 104.0])
    return n_img

if __name__ == '__main__':
    bee = cv2.imread("/root/Desktop/ws/datasets/D0002/val/bees/586041248_3032e277a9.jpg")
    t_bee = preprocess(bee)

result of NNFrame preprocessing:

-49.0	-48.0	-48.0	...	-12.0	-12.0	-11.0	
-50.0	-49.0	-49.0	...	-12.0	-12.0	-11.0	
-51.0	-50.0	-51.0	...	-12.0	-12.0	-10.0

result of customized preprocessing:

[-56., -38., -49.],
[-55., -37., -48.],
[-54., -36., -47.],
...,

Image link: https://user-images.githubusercontent.com/23404868/148499487-d0b723c9-df53-4213-8779-65f1e6b8c5f3.jpg
model link: https://github.com/704572066/online-ai-platform/raw/master/model/20220106.model

user issue

opened by 704572066 23

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

Hello, Im using docker container built by this BigDL image. when I tried to collect the predictions using collect() this error occurs: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

this is the code:

def retrain(self, batch_size):    
        minibatch =random.sample(self.experience_replay, batch_size)
        for state, action, reward, next_state in minibatch:
            state = np.asmatrix(state)
            next_state = np.asmatrix(next_state)
            print('state type',state)
            print('next state type',next_state)
            target = self.q_network.predict(state)
            p= target.collect()          
            tt = self.target_network.predict(next_state)
            t=tt.collect()
            p[0][action] = reward+self.gamma * np.amax(t)           
            self.q_network.fit(state, p, verbose=0)
        self.dqn_update_time-=1
        print("***********",self.dqn_update_time,"************ ")
        if self.dqn_update_time==0: 
          self.dqn_update_time=100 #dqn_time
          self.alighn_target_model()
          print('model updated')

this is the error:

/tmp/ipykernel_1032/2958540146.py in retrain(self, batch_size)
     71             print('next state type',next_state)
     72             target = self.q_network.predict(state)
---> 73             p= target.collect()
     74 
     75             tt = self.target_network.predict(next_state)

/opt/work/spark-3.1.2/python/lib/pyspark.zip/pyspark/rdd.py in collect(self)
    947         """
    948         with SCCallSiteSync(self.context) as css:
--> 949             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    950         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    951 

/usr/local/envs/bigdl/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/opt/work/spark-3.1.2/python/lib/pyspark.zip/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

/usr/local/envs/bigdl/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

could anyone explain why this error occured and how fix it

user issue

opened by fatenlouati 22

Provide quantization for nano
Description

Propose to integrate quantization methods into nano to reduce the model size and accelerate inference. Neural Compressor provides a set of methods to quantize a model to simplify the usage.
Discussion on the details is as below in comments.

Related tasks

Intel Neural Compressor (INC)

Post-training Quantization

[ ] Pytorch quantization API.

PR: #3602

[ ] Keras quantization API.

Issue: #3651

PR: #3856

[ ] Examples

Quantization-Aware Training

[ ] TBD

OpenVINO

[ ] OpenVINO Support

NNCF

[ ] TBD

POT

[ ] TBD
opened by zhentaocc 20
How any epochs to be run in Bigdl.sh

I have 64 cores, 1 node and 32g driver memory allocated. I am running the training, but it has already crossed 12 epcochs. Wanted to check the number of epoch and time that this bigdl.sh will run in such a configuration.

opened by jaymahesh 20
Orca: Align the data analysis method of dataloader and dataframe
Description

when model has only one input which is a list or tuple consists of tensors, we should not extract it in args.

Basic Assumption: There are only three possible types in features: torch.Tensor, list\tuple and dict

features and lables type list:

| | features | labels | | -----| ---- | ---- | | dataframe/Xshards | a (list or tuple) of tensor, or a dict(Xshards only) | a single tensor, tuple, list | | raydataset | a single tensor or a list of tensors | a single tensor or a list of tensors | | dataloader | a single tensor or a list of user's input(all elements beside last one which is lable ) | last one which is label |

When will features be a single tensor?

dataloader yields feature consists of only one tensor.

there is only one feature_column specified by user.(df, xshard and rayDataset)

When will features be a list or tuple?

dataloader yields feature consists of more than one tensor or object is not a tensor.

there is more than only one feature_column specified by user.(df, xshard and rayDataset)

When will features be a dict? only when input is XShards of dictionary

1. Why the change?

https://github.com/intel-analytics/BigDL/issues/5762 In some case, the model does take x as a list of two tensors as input:

def forward(self, x, bboxes=None): x = x[:] # avoid pass by reference x = self.s1(x)

code but our torchrunner will extract this as two separated ones: https://github.com/intel-analytics/BigDL/blob/affe54803c320afd4fc0631dc3fa02f8be1cfcdc/python/orca/src/bigdl/orca/learn/pytorch/training_operator.py#L279

2. User API changes

none

3. Summary of the change

~~before: output = self.model(*features)~~ ~~after: output = self.model(*features) if not isSingleListInput else self.model(features)~~

if data is a pt dataloader of creator, reload_dataloader_creator wil combine all elements besides lables into a list, and if feature consists of only one tensor it remains the same:

def make_dataloader_list_wrapper(func): import torch def make_feature_list(batch): if func is not None: batch = func(batch) *features, target = batch if len(features) == 1 and torch.is_tensor(features[0]): features = features[0] return features, target return make_feature_list

and will parse features here:

features, target = batch # Compute output. with self.timers.record("fwd"): if torch.is_tensor(features): output = self.model(features) elif isinstance(features, (tuple, list)): output = self.model(*features)

This ensure the consistency of *features and user input.

And current df, xshard and raydataset logic is right, we keep it safe.

4. How to test?

[ ] N/A

[x] Unit test

[ ] Application test

[ ] Document test

[ ] ...

orca
opened by leonardozcm 19
Spark 3 with Scala 2.12

Hi devs!

Not an issue but a question (feel free to add the label):

Given that Spark 3 has been released now a few month ago, do you have any plans to go for it and update to Spark 3 which also needs to move to Scala 2.12? I'm also wondering if this Spark 3 deployment script can work with the current version still being on Scala 2.11

Thanks in advance.
user issue

opened by LorenzBuehmann 18
Add Back PR Template
Description

This PR adds a pull request template.

Motivation and Context

This is a good practice to add a structured short description of a submitted PR. This way the reviewer and future developers can better understand the changes and the context of the changes.

API Usage or Code Design.

The proposed PR template is presented as in this description.

Related Link or Issue

N/A

How was this PR tested?

N/A

[ ] jenkins passed (please provide link):

[ ] documentation build passed (please provide link):

License & Dependency

N/A
opened by yangw1234 17
Replace DL_ENGINE_TYPE env variable with property bigdl.engineType for now, and also make it default to mklblas

What changes were proposed in this pull request?

Replace DL_ENGINE_TYPE env variable with property bigdl.engineType for now, and also make it default to mklblas. This makes it work with the only value that's supported out of the box.

How was this patch tested?

Existing tests.

Related links or issues (optional)

This is a small subset of https://github.com/intel-analytics/BigDL/issues/788

opened by srowen 17
add backwardGraphPruning feature

What changes were proposed in this pull request?

Add backwardGraph pruning feature for [[StaticGraph]], which means remove all pre-processing nodes from [[backwardGraph]] whose element is an [[Operation]] or who is simply depended on [[Operation]] based nodes during backward.

I think this feature will be useful when building a graph with numerous operations for data pre-processing. Futhermore, it make possible to retreat modules with parameters as operations(such as nn.Abs), as long as they are full depended on operation nodes during backward.

And we are working on data pre-processing operations, mainly for data-mining. :)

How was this patch tested?

unit test

opened by sperlingxx 16
Chronos: add forecaster alg choose guide and some cleaning for how to guide
Description

1. Why the change?

This is a "remake" for https://github.com/intel-analytics/BigDL/pull/5364, since previous PR has too many conflicts.

2. User API changes

nothing

3. Summary of the change

choose forecaster guide: https://bigdl-junweid.readthedocs.io/en/forecaster-alg-choose/doc/Chronos/Howto/how_to_choose_forecasting_alg.html renewed how to guide overview: https://bigdl-junweid.readthedocs.io/en/forecaster-alg-choose/doc/Chronos/Howto/index.html

4. How to test?

[ ] Document test

document Chronos
opened by TheaperDeng 0
Nano: fix keras onnx model output shape
Description

Fix the output shape of keras onnx model

1. Why the change?

tf.py_function(func, ..., Tout=tf.float32) will convert the output of func to a single Tensor, that means [t] or [[t]] (t is a tf.Tensor) all will be converted to t.

Besides, if the original model return a single tensor t, then after tracing, onnxruntime model will return [t].

These two rules will destroy the output shape when the original output is a single tensor, or a list of tensor with only one element, or a list of list of tensor with only one element, ...

Before this PR, we simply check whether the output is a list and has only one element, if is, then return its first element. We use this method to

However, this method cannot distinguish whether the original model return t or [t] (onnxruntime model will return [t] in both case). And we don't handle the influence of tf.py_function.

This PR fix both, it save the right output shape first, and convert the final output to this right shape.

2. User API changes

N/A

3. Summary of the change

N/A

4. How to test?

[ ] Unit test

Nano
opened by MeouSker77 0
chronos

Traceback (most recent call last): File "autolstm_nyc_taxi.py", line 21, in from bigdl.chronos.data import get_public_dataset ImportError: cannot import name 'get_public_dataset' from 'bigdl.chronos.data'
Chronos

opened by YUKUN-XIAO 12
Nano: fix documentation typos in tensorflow inference and quantization

Description

For quantization API in tensorflow inference engine, there's no keyword argument of "calib_dataset", instead using "x" simply.
Nano

opened by HensonMa 0
[Nano] Add a generalized how-to guide for accelerate PyTorch cv data process pipeline
Description

Add a generalized how-to guide for accelerate PyTorch cv data process pipeline

1. Why the change?

The cv data process pipeline acceleration are exactly the same for PyTorch and PyTorch Lightning applications. There is no need to add separated how-to guides in PyTorch/PyTorch Lightning Training sections.

2. Summary of the change

Add a Nano how-to guide section "Preprocessing"

Add how to guide “How to accelerate a computer vision data processing pipeline” for PyTorch

Restyled quote blocks for better note/warning/related reading box styles Before:

After:

3. How to test?

[x] Document test: https://yuwentestdocs.readthedocs.io/en/nano-pytorch-cv-pipeline/doc/Nano/Howto/index.html

[x] Github Notebook preview: https://github.com/Oscilloscope98/BigDL/blob/nano-pytorch-cv-pipeline/python/nano/tutorial/notebook/preprocessing/pytorch/accelerate_pytorch_cv_data_pipeline.ipynb

[ ] Notebook test locally (conda create an empty environment with python=3.7)

document Nano
opened by Oscilloscope98 0

Releases(v2.1.0)

v2.1.0(Sep 28, 2022)
Highlights

Note: BigDL v2.1.0 has been updated to include functional and security updates. Users should update to the latest version.

Orca

Improve user experience and API consistency for Orca Estimators.

Support directly save and load TensorFlow model format in Orca TensorFlow2 Estimator.

Provide more examples (e.g. PyTorch brain image segmentation, XShards tutorials for distributed Python data processing), etc.

Support customized metrics in Orca PyTorch Estimator.

Nano

New inference optimization pipelines, with more optimization methods and a new InferenceOptimizer

More training optimization methods (bf16, channel last)

Add TorchNano support for PyTorch model customized training loop

Auto-scale learning rate for multi-instance training

Built-in AutoML support through hyperparameter optimization

Support a wide range versions of pytorch (1.9-1.12) and tensorflow (2.7-2.9)

DLlib

Add LightGBM support

Improve Keras-style model summary API

Add Python support for loading HDFS files

Chronos

Add new Autoformer (https://arxiv.org/abs/2106.13008) Forecaster and pipeline that are optimized on CPU

Tensorflow 2 support for LSTM, Seq2Seq, TCN and MTNet Forecasters

Add light-weight (does not rely on Spark/Ray Tune) auto tunning

Better support on distributed workflow (spark df and distributed pandas processing)

Add more installation options is now supported to make the installation lighter

Friesian:

Integration of DeepRec (https://github.com/alibaba/DeepRec) with Friesian.

Add more reference examples, e.g. multi-task recommendation, TFRS (https://www.tensorflow.org/recommenders) list-wise ranking, LightGBM training, etc.

Add a reference example for offline distributed similarity search (using FAISS)

More operations in FeatureTable (e.g. string embeddings with BERT, etc.).

PPML

Upgrade BigDL PPML on Gramine.

Improve the attestation and key managing process

More Big Data frameworks on BigDL PPML (including spark, flink, hive, hdfs, etc.)

Add PPMLContext API for encryption IO and KMS, supports different file formats, encryption algorithms and KMS services

Support PSI, Pytorch NN, Keras NN, FGBoost (federated XGBoost) in VFL scenario, linear regression & logistic regression for VFL

Source code(tar.gz)
Source code(zip)
v2.0.0(Mar 9, 2022)

Highlights

Note: BigDL v2.0.0 has been updated to include functional and security updates. Users should update to the latest version.
Source code(tar.gz)
Source code(zip)
v0.13.0(Jul 9, 2021)

Source code(tar.gz)
Source code(zip)
v0.12.2(Apr 21, 2021)

Source code(tar.gz)
Source code(zip)
v0.12.1(Jan 5, 2021)

Source code(tar.gz)
Source code(zip)
v0.11.1(Jan 5, 2021)

Source code(tar.gz)
Source code(zip)
v0.10.0(Nov 5, 2019)
Highlights

Continue RNN optimization. We support both LSTM and GRU integration with MKL-DNN which acheives ~3x performance

ONNX support. We support loading third party framework models via ONNX

Richer data preprocssing support and segmentation inference pipeline support

Details

[New Feature] Full MaskRCNN model support with data processing

[New Feature] Support variable-size Resize

[New Feature] Support batch input for region proposal

[New Feature] Support samples of different size in one minibatch

[New Feature] MAP validation method implementation

[New Feature] ROILabel enhancement to support both object detection and segmentation

[New Feature] Grey image support for segmentation

[New Feature] Add TopBlocks support for Feature Pyramid Networks (FPN)

[New Feature] GRU integration with MKL-DNN support

[New Feature] MaskHead support for MaskRCNN

[New Feature] BoxHead support for MaskRCNN

[New Feature] RegionalProposal support for MaskRCNN

[New Feature] Shape operation support for ONNX

[New Feature] Gemm operation support for ONNX

[New Feature] Gather operation support for ONNX

[New Feature] AveragePool operation support for ONNX

[New Feature] BatchNormalization operation support for ONNX

[New Feature] Concat operation support for ONNX

[New Feature] Conv operation support for ONNX

[New Feature] MaxPool operation support for ONNX

[New Feature] Reshape operation support for ONNX

[New Feature] Relu operation support for ONNX

[New Feature] SoftMax operation support for ONNX

[New Feature] Sum operation support for ONNX

[New Feature] Squeeze operation support for ONNX

[New Feature] Const operation support for ONNX

[New Feature] ONNX model loader implementation

[New Feature] RioAlign layer support

[Enhancement] Align batch normalization layer between mklblas and mkl-dnn

[Enhancement] Python API enhancement to support nested list input

[Enhancement] Multi-model training/inference support with MKL-DNN

[Enhancement] BatchNormalization fusion with Scale

[Enhancement] SoftMax companion object support no argument initialization

[Enhancement] Python support for training with MKL-DNN

[Enhancement] Docs enhancement

[Bug Fix] Fix model version comparison

[Bug Fix] Fix graph backward bug for ParallelTable

[Bug Fix] Fix memory leak for training with MKL-DNN

[Bug Fix] Fix performance caused by denormal values during training

[Bug Fix] Fix SoftMax segment fault issue under MKL-DNN

[Bug Fix] Fix TimeDistributedCriterion python API inconsistent with Scala

Source code(tar.gz)
Source code(zip)
v0.9.0(Jul 22, 2019)
Highlights

Continue VNNI acceleration support, we add optimization for more CNN models including object detection models, enhance model scales generation support for VNNI.

Add attention based model support, we add Transformer implementation for both lanuage model and translation model.

RNN optimization, We support LSTM integration with MKL-DNN which acheives ~3x performance speedup.

Details

[New Feature] Add attention layer support

[New Feature] Add FeedForwardNetwork layer support

[New Feature] Add ExpandSize layer support

[New Feature] Add TableOperation layer to support table calculation with different input sizes

[New Feature] Add LayerNormalizaiton layer support

[New Feature] Add Transformer support for both language and translation models

[New Feature] Add beam search support in Transformer model

[New Feature] Add Layer-wise adaptve rate scaling optim method

[New Feature] Add LSTM integration with MKL-DNN support

[New Feature] Add dilated convolution integration with MKL-DNN support

[New Feature] Add parameter process for LarsSGD optim method

[New Feature] Support Affinity binding option with mkl-dnn

[Enhancement] Document enhancement for configuration and build

[Enhancement] Reflection enhancement to get default values for constructor parameters

[Enhhancement] User one AllReducemParameter for multi-optim method training

[Enhancement] CAddTable layer enhancement to support input expansion along specific dimension

[Enhancement] Resnet-50 preprocessing pipeline enhancement to replace RandomCropper with CenterCropper

[Enhancement] Calculate model scales for arbitrary mask

[Enhancment] Enable global average pooling

[Enhancement] Check input shape and underlying MKL-DNN layout consistency

[Enhancement] Threadpool enhancement to throw proper exception at executor runtime

[Enhancement] Support mkl-dnn format conversion from ntc to tnc

[Bug Fix] Fix backward graph generation topology ordering issue

[Bug Fix] Fix MemoryData hash code calculation

[Bug Fix] Fix log output for BCECriterion

[Bug Fix] Fix setting mask for container quantization

[Bug Fix] Fix validation accuracy issue when multi-executor running with the same worker

[Bug Fix] Fix INT8 layer fusion between conlution with multi-group masks and BatchNormalization

[Bug Fix] Fix JoinTable scales generation issue

[Bug Fix] Fix CMul forward issue with special input format

[Bug Fix] Fix weights change issue after model fusion issue

[Bug Fix] Fix SpatinalConvolution primitives initializaiton issue

Source code(tar.gz)
Source code(zip)
v0.8.0(Mar 28, 2019)
Highlights

Add MKL-DNN Int8 support, especially for VNNI acceleration support. Low precision inference accelerates both latency and throughput significantly

Add support for runnning MKL-BLAS models under MKL-DNN. We leverage MKL-DNN to speed up both training and inference for MKL-BLAS models

Add Spark 2.4 support. Our examples and APIs are fully compatible with Spark 2.4, we released the binary for Spark 2.4 together with other Spark versions

Details

[New Feature] Add MKL-DNN Int8 support, especially for VNNI support

[New Feature] Add support for runnning MKL-BLAS models under MKL-DNN

[New Feature] Add Spark 2.4 support

[New Feature] Add auto fusion to speed up model inference

[New Feature] Memoery reorder support for low precision inference

[New Feature] Add bytes support for DNN Tensor

[New Feature] Add SAME padding in MKL-DNN layers

[New Feature] Add combined (add/or) triggers for training completion

[Enhancement] Inception-V1 python training support enhancement

[Enhancement] Distributed Optimizer enhancement to support customized optimizer

[Enhancement] Add compute output shape for DNN supported layers

[Enhancement] New MKL-DNN computing thread pool

[Enhancement] Add MKL-DNN support for Predictor

[Enhancement] Documentation enhancement for Sparse Tensor, MKL-DNN support, etc

[Enhancement] Add ceilm mode for AvgPooling and MaxPooling layers

[Enhacement] Add binary classification support for DLClassifierModel

[Enhacement] Improvement to support conversion between NHWC and NCHW for memory reoder

[Bug Fix] Fix SoftMax layer with narrowed input

[Bug Fix] TensorFlow loader to support checking all data types

[Bug Fix] Fix Add operation bug to support double type when loading TensorFlow graph

[Bug Fix] Fix one-step weight update missing issue in validation during training

[Bug Fix] Fix scala compiler security issue in 2.10 & 2.11

[Bug Fix] Fix model broadcast cache UUID issue

[Bug Fix] Fix predictor issue for batch size == 1

Source code(tar.gz)
Source code(zip)
v0.7.0(Oct 13, 2018)
Highlights

MKL-DNN support enhancement, which includes training optimization, more models training support and model serialization support

A new distributed optimizer for models powered by MKL-DNN. This optimizer can overlap training and communication during the distributed training, which lead to a better scalability on multi-nodes

Details

[New Feature] A new optim method ParallelAdam which leverages the multi-thread capacity

[New Feature] Add new validation methods HitRate, which is widely used in recommendation

[New Feature] Add new validation methods NDCG, which is widely used in recommendation

[New Feature] Support communication priority when synchronize parameter in the distributed training

[New Feature] Support ModelBroadcast customization

[New Feature] Add a new distributed optimizer for models powered by MKL-DNN. This optimizer can overlap training and communication during the distributed training, which lead to a better scalability on multi-nodes

[API Change] Add batch size into the Python model.predict API

[Enhancement] Add MKL-DNN training example for LeNet

[Enhancement] Improve the training performance by getting rid of narrowing gradients and zero gradients for model powered by MKL-DNN

[Enhancement] Add training example for VGG-16 based on MKL-DNN

[Enhancement] Support nested table in Graph output

[Enhancement] Enhancement on thread pool to make it compatible with MKL-DNN engine

[Enhancement] MKL-DNN model serialization support

[Enhancement] Add VGG-16 validation example

[Bug Fix] Fix JoinTable throwing exception during backward if batch size is changed

[Bug Fix] Change Reshape to InferReShape in ReshapeLoadTF

[Bug Fix] Fix splitBatch issue in Predictor, where the model has multiple Graph and each Graph outputs a table

[Bug Fix] Fix MDL-DNN inference performance issue not to copy weights at inference

[Bug Fix] Fix the issue that the training will crash if there are unlabeled data

[Bug Fix] Fix the issue that the input is grey image while the model needs 3 channels input

[Bug Fix] Correct the style check job to make both input and output file format to UTF-8 format

[Bug Fix] Load the relevant library only if MKL-DNN engine specified

[Bug Fix] Shade org.tensorflow.framework to avoid conflict

[Bug Fix] Fix dlframes not packaged in pip issue

[Bug Fix] Fix LocalPredictor cannot be serialized because of nested logger variable

[Bug Fix] Need to clear Recurrent preTopology's output while cloneCells

[Bug Fix] MM layer output different output for same input if ran multiple times

[Bug Fix] Distribute predictor will send model twice when do mapPartition

[Document] Kubernetes programming guide to spark2.3

[Document] Add document for wrap preprocessor and model in one graph and add its python API

Source code(tar.gz)
Source code(zip)
v0.6.0(Jun 29, 2018)
Highlights

We integrate MKL-DNN as an alternative execution engine for CNN models. MKL-DNN provides better training/inference performance and less memory consuming. On some CNN models, we find there’s 2x throughput improvement in our experiment.

Support using different optimization methods to optimize different parts of the model. This is necessary when train some models.

Spark 2.3 support. We have tested our code and examples on Spark 2.3. We release the binary for Spark 2.3, and Spark 1.5 will not be supported.

Details

[New Feature] MKL-DNN integration. We integrate MKL-DNN as an alternative execution engine for CNN models. It supports speedup layers like: AvgPooling, MaxPooling, CAddTable, LRN, JoinTable, Linear, ReLU, SpatialConvolution, SpatialBatchnormalization, Softmax. MKL-DNN provides better training/inference performance and less memory consuming.

[New Feature] Layer fusion. Support layer fusion on conv + relu, batchnorm + relu, conv + batchnorm and conv + sum(some of the fusion can only be applied in the inference). Layer fusion provides better performance especially on inference. Currently layer fusion are only available for MKL-DNN related layers.

[New Feature] Multiple optimization method support in optimizer. Support using different optimization methods to optimize different parts of the model.

[New Feature] Add a new optimization method Ftrl, which is often used in recommendation model training.

[New Feature] Add a new example: Training Resnet50 on ImageNet dataset.

[New Feature] Add new OpenCV based image preprocessing transformer ChannelScaledNormalizer.

[New Feature] Add new OpenCV based image preprocessing transformer RandomAlterAspect.

[New Feature] Add new OpenCV based image preprocessing transformer RandomCropper.

[New Feature] Add new OpenCV based image preprocessing transformer RandomResize.

[New Feature] Support loading Tensorflow Max operation.

[New Feature] Allow user to specify input port when loading Tensorflow model. If the input operation accepts multiple tensors as input, user can specify which to feed data to instead of feed all tensors.

[New Feature] Support loading Tensorflow Gather operation.

[New Feature] Add random split for ImageFrame

[New Feature] Add setLabel and getURI API into ImageFrame

[API Change] Add batch size into the Python model.predict API.

[API Change] Add generateBackward into load Tensorflow model API, which allows user choose whether to generate backward path when load Tensorflow model.

[API Change] Add feature() and label() to the Sample.

[API Change] Deprecate the DLClassifier/DLEstimator in org.apache.spark.ml. Prefer using DLClassifier/DLEstimator under com.intel.analytics.bigdl.dlframes.

[Enhancement] Refine StridedSlice. Support begin/end/shrinkAxis mask just like Tensorflow.

[Enhancement] Add layer sync to SpatialBatchNormalization. SpatialBatchNormalization can calculate mean/std on a larger batch size. The model with SpatialBatchNormalization layer can converge to a better accuracy even the local batch size is small.

[Enhancement] Code refactor in DistriOptimizer for advanced parameter operations, e.g. global gradient clipping.

[Enhancement] Add more models into the LoadModel example.

[Enhancement] Share Const values when broadcast the model. The Const value will not be changed and we can share it when use multiple model for inference on a same node, which will reduce memory usage.

[Enhancement] Refine the getTime and time counting implementation.

[Enhancement] Support group serializer so that layers of the same hierarchy could share the same serializer.

[Enhancement] Dockerfile use Python 2.7.

[Bug Fix] Fix memory leak problem when using quantized model in predictor.

[Bug Fix] Fix PY4J Java gateway not compatible in Spark local mode for Spark 2.3.

[Bug Fix] Fix a bug in python inception example.

[Bug Fix] Fix a bug when run Tensorflow model using loop.

[Bug Fix] Fix a bug in the Squeeze layer.

[Bug Fix] Fix python API for random split.

[Bug Fix] Using parameters() instead of getParameterTable() to get weight and bias in serialization.

[Document] Fix incorrectness in Quantized model document.

[Document] Fix incorrect instructions when generate Sequence files for ImageNet 2012 dataset in the document.

[Document] Move bigdl-core build document into a separated page and refine the format.

[Document] Fix incorrect command in Tensorflow load and transfer learning examples.

Source code(tar.gz)
Source code(zip)
v0.5.0(Mar 30, 2018)
Highlights

Bring in a Keras-like API(Scala and Python). User can easily run their Keras code (training and inference) on Apache Spark through BigDL. For more details, see this link.

Support load Tensorflow dynamic models(e.g. LSTM, RNN) in BigDL and support more Tensorflow operations, see this page.

Support combining data preprocessing and neural network layers in the same model (to make model deployment easy )

Speedup various modules in BigDL (BCECriterion, rmsprop, LeakyRelu, etc.)

Add DataFrame-based image reader and transformer

New Features

Tensor can be converted to OpenCVMat

Bring in a new Keras-like API for scala and python

Support load Tensorflow dynamic models(e.g. LSTM, RNN)

Support load more Tensorflow operations(InvertPermutation, ConcatOffset, Exit, NextIteration, Enter, RefEnter, LoopCond, ControlTrigger, TensorArrayV3,TensorArrayGradV3, TensorArrayGatherV3, TensorArrayScatterV3, TensorArrayConcatV3, TensorArraySplitV3, TensorArrayReadV3, TensorArrayWriteV3, TensorArraySizeV3, StackPopV2, StackPop, StackPushV2, StackPush, StackV2, Stack)

ResizeBilinear support NCHW

ImageFrame support load Hadoop sequence file

ImageFrame support gray image

Add Kv2Tensor Operation(Scala)

Add PGCriterion to compute the negative policy gradient given action distribution, sampled action and reward

Support gradual increase learning rate in LearningrateScheduler

Add FixExpand and add more options to AspectScale for image preprocessing

Add RowTransformer(Scala)

Support to add preprocessors to Graph, which allows user combine preprocessing and trainable model into one model

Resnet on cifar-10 example support load images from HDFS

Add CategoricalColHashBucket operation(Scala)

Predictor support Table as output

Add BucketizedCol operation(Scala)

Support using DenseTensor and SparseTensor together to create Sample

Add CrossProduct Layer (Scala)

Provide an option to allow user bypass the exception in transformer

DenseToSparse layer support disable backward propagation

Add CategoricalColVocaList Operation(Scala)

Support imageframe in python optimizer

Support get executor number and executor cores in python

Add IndicatorCol Operation(Scala)

Add TensorOp, which is an operation with Tensor[T]-formatted input and output, and provides shortcuts to build Operations for tensor transformation by closures. (Scala)

Provide a docker file to make it easily to setup testing environment of BigDL

Add CrossCol Operation(Scala)

Add MkString Operation(Scala)

Add a prediction service interface for concurrent calls and accept bytes input

Add SparseTensor.cast & SparseTensor.applyFun

Add DataFrame-based image reader and transformer

Support load tensoflow model files saved by tf.saved_model API

SparseMiniBatch supporting multiple TensorDataTypes

Enhancement

ImageFrame support serialization

A default implementation of zeroGradParameter is added to AbstractModule

Improve the style of the document website

Models in different threads share weights in model training

Speed up leaky relu

Speed up Rmsprop

Speed up BCECriterion

Support Calling Java Function in Python Executor and ModelBroadcast in Python

Add detail instructions to run-on-ec2

Optimize padding mechanism

Fix maven compiling warnings

Check duplicate layers in the container

Refine the document which introduce how to automatically Deploy BigDL on Dataproc cluster

Refactor adding extra jars/python packages for python user. Now only need to set env variable BIGDL_JARS & BIGDL_PACKAGES

Implement appendColumn and avoid the error caused by API mismatch between different Spark version

Add python inception training on ImageNet example

Update "can't find locality partition for partition ..." to warning message

API change

Move DataFrame-based API to dlframe package

Refine the Container hierarchy. The add method(used in Sequential, Concat…) is moved to a subclass DynamicContainer

Refine the serialization code hierarchy

Dynamic Graph has been an internal class which is only used to run tensorflow models

Operation is not allowed to use outside Graph

The getParamter method as final and private[bigdl], which should be only used in model training

remove the updateParameter method, which is only used in internal test

Some Tensorflow related operations are marked as internal, which should be only used when running Tensorflow models

Bug Fix

Fix Sparse sample batch bug. It should add another dimension instead of concat the original tensor

Fix some activation or layers don’t work in TimeDistributed and RnnCell

Fix a bug in SparseTensor resize method

Fix a bug when convert SparseTensor to DenseTensor

Fix a bug in SpatialFullConvolution

Fix a bug in Cosine equal method

Fix optimization state mess up when call optimizer.optimize() multiple times

Fix a bug in Recurrent forward after invoking reset

Fix a bug in inplace leakyrelu

Fix a bug when save/load bi-rnn layers

Fix getParameters() in submodule will create new storage when parameters has been shared by parent module

Fix some incompatible syntax between python 2.7 and 3.6

Fix save/load graph will loss stop gradient information

Fix a bug in SReLU

Fix a bug in DLModel

Fix sparse tensor dot product bug

Fix Maxout ser issue

Fix some serialization issue in some customized faster rcnn model

Fix and refine some example document instructions

Fix a bug in export_tf_checkpoint.py script

Fix a bug in set up python package.

Fix picklers initialization issues

Fix some race condition issue in Spark 1.6 when broadcasting model

Fix Model.load in python return type is wrong

Fix a bug when use pyspark-with-bigdl.sh to run jobs on Yarn

Fix empty tensor call size and stride not throw null exception

Source code(tar.gz)
Source code(zip)
v0.4.0(Jan 4, 2018)
Highlights

Supported all Keras layers, and support Keras 1.2.2 model loading. See keras-support for detail

Python 3.6 support

OpenCV support, and add a dozen of image transformer based on OpenCV

More layers/operations

New Features

Models & Layers & Operations & Loss function

Add layers for Keras: Cropping2D, Cropping3D, UpSampling1D, UpSampling2D, UpSampling3D, masking,Maxout,HighWay,GaussianDropout, GaussianNoise, CAveTable, VolumetricAveragePooling, HardSigmoidSReLU, LocallyConnected1D, LocallyConnected2D, SpatialSeparableConvolution, ActivityRegularization, SpatialDropout1D, SpatialDropout2D, SpatialDropout3D

Add Criterion for keras: PoissonCriterion, KullbackLeiblerDivergenceCriterion, MeanAbsolutePercentageCriterion, MeanSquaredLogarithmicCriterion, CosineProximityCriterion

Support NHWC for LRN and BatchNormalization

Add LookupTableSparse (lookup table for multivalue)

Add activation argument for recurrent layers

Add MultiRNNCell

Add SpatialSeparableConvolution

Add MSRA filler

Support SAME padding in 3d conv and allows user config padding size in convlstm and convlstm3d

TF opteration: SegmentSum, conv3d related operations, Dilation2D, Dilation2DBackpropFilter, Dilation2DBackpropInput, Digamma, Erf, Erfc, Lgamma, TanhGrad, depthwise, Rint, All, Any, Range, Exp, Expm1, Round, FloorDiv, TruncateDiv, Mod, FloorMod, TruncateMod, IntopK, Round, Maximum, Minimum, BatchMatMu, Sqrt, SqrtGrad, Square, RsqrtGrad, AvgPool, AvgPoolGrad, BiasAddV1, SigmoidGrad, Relu6, Relu6Grad, Elu, EluGrad, Softplus, SoftplusGrad, LogSoftmax, Softsign, SoftsignGrad, Abs, LessEqual, GreaterEqual, ApproximateEqual, Log, LogGrad, Log1p, Log1pGrad, SquaredDifference, Div, Ceil, Inv, InvGrad, IsFinite, IsInf, IsNan, Sign, TopK. See details at tensorflow_ops_list)

Add object detection related layers: PriorBox, NormalizeScale, Proposal, DetectionOutputSSD, DetectionOutputFrcnn, Anchor

Transformer

Add image Transformer based on OpenCV: Resize, Brightness, ChannelOrder, Contrast, Saturation, Hue, ChannelNormalize, PixelNormalize, RandomCrop, CenterCrop, FixedCrop, DetectionCrop, Expand, Filler, ColorJitter, RandomSampler, MatToFloats, AspectScale, RandomAspectScale, BytesToMat

Add Transformer: RandomTransformer, RoiProject, RoiHFlip, RoiResize, RoiNormalize

API change

Add predictImage function in LocalPredictor

Add partition number option for ImageFrame read

Add an API to get node from graph model with given name

Support List of JTensors for label in Python API

Expose local optimizer and predictor in Python API

Install & Deploy

Support BigDL on Spark on k8s

Model Save/Load

Support big-sized model (parameter exceed > 2.1G) for both java and protobuffer

Support keras model loading

Training

Allow user to set new train data or new criterion for optimizer reusing

Support gradient clipping (constant clip and clip by L2-norm)

Enhancement

Speed up BatchNormalization.

Speed up MSECriterion

Speed up Adam

Speed up static graph execution

Support reading TFRecord files from HDFS

Support reading raw binary files from HDFS

Check input size in concat layer

Add proper exception handling for CaffeLoader&Persister

Add serialization support for multiple tensor numeric

Add an Activity wrapper for Python to simplify the returning value

Override joda-time in hadoop-aws to reduce compile time

LocalOptimizer-use modelbroadcast-like method to clone module

Time counting for paralleltable's forward/backward

Use shade to package jar-with-dependencies to manage some package conflict

Support loading bigdl_conf_file in multiple python zip files

Bug Fix

Fix getModel failed in DistriOptimizer when model parameters exceed 2.1G

Fix core number is 0 where there's only one core in system

Fix SparseJoinTable throw exception if input’s nElement changed.

Fix some issues found when save bigdl model to tensorflow format file

Fix return object type error of DLClassifier.transform in Python

Fix graph generatebackward is lost in serialization

Fix resizing tensor to empty tensor doesn’t work properly

Fix Adapter layer does not support different batch size at runtime

Fix Adaper layer cannot be serialized directly

Fix calling wrong function when set user-defined mkl threads

Fix SmoothL1Criterion and SoftmaxWithCriterion doesn’t deal with input’s offset.

Fix L1Regularization throw NullPointerException while broadcasting model.

Fix CMul layer will crash for certain configure

Source code(tar.gz)
Source code(zip)
v0.3.0(Nov 8, 2017)
Highlights

New protobuf-based model storage format

Support model quantization

Support sparse tensor and model

Easier and broader Tensorflow model load support

More layers/operations

Apache Spark 2.2 support

New Features

Models & Layers & Operations & Loss function

Support convlstm3D model

Support Variational Auto Encoder

Support Unet

Support PTB model

Add SpatialWithinChannelLRN layer

Add 3D-deconv layer

Add BifurcateSplitTable layer

Add KLD criterion

Add Gaussian layer

Add Sampler layer

Add RNN decoder layer

Support NHWC data format in 2D-conv, 2D-pooling layers

Support same/valid padding type in 2D-conv and 2D-pooling layers

Support dynamic execution flow in Graph

Graph node can pass nested tensors

Layer/Operation can support different input and output numeric tensor

Start to support operations in BigDL, add following operations: LogicalNot, LogicalOr, LogicalAnd, 1D Max Pooling, Squeeze, Prod, Sum, Reshape, Identity, ReLU, Equals, Greater, Less, Switch, Merge, Floor, L2Loss, RandomUniform, Rank, MatMul, SoftMax, Conv2d, Add, Assert, Onehot, Assign, Cast, ExpandDims, MaxPool, Realdiv, BiasAdd, Pad, Tile, StridedSlice, Transpose, Negative, AssignGrad, BiasAddGrad, Deconv2D, Conv2DBackFilter CrossEntropy, MaxPoolGrad, NoOp, RandomUniform, ReluGrad, Select, Sum, Pow, BroadcastGradientArgs, Control Dependency

Start to support sparse layers in BigDL, add following sparse layers: SparseLinear, SparseJoinTable, DenseToSparse

Tensor

Support sparse tensor

Support scalar (0-D tensor)

Tensor support more numeric type: boolean, short, int, long, string, char, bytestring

Tensor don’t display full content in toString when there’re too many elements

API change

Expose evaluate API to python

Add a predictClass API to model to simplify the code when user want to use model in classification

Change model.test to model.evaluate in Python

Refine Recurrent, BiRecurrent and RnnCell API

Sample.features from ndarray to JTensor/List[JTensor]

Sample.label from ndarray to JTensor

Install & Deploy

Support Apache Spark 2.2

Add script to run BigDL on Google DataProc platform

Refine run-example.sh scripts to run bigdl examples on AWS with build-in Spark

Pip install will now auto install spark-2.2

Add a docker file

Model Save/Load

New model persistent format(protobuf based) to provide a better user experience when save/load bigdl models

Support load more operations from Tensorflow

Support read tensor content from Tensorflow checkpoint

Support load a subset of Tensorflow graph

Support load Tensorflow preprocessing graph(read/parse tfrecord data, image decoders and queues)

Automatically convert data in Tensorflow queue to RDD and feeding model training in BigDL

Support load deconv layer from caffe and Tensorflow

Support save/load SpatialCrossLRN torch module

Training

Allow user to modify the optimization algorithm status when resuming the training in Python

Allow user to specify optimization algorithms, learning rate and learning rate decay when use BigDL in Spark * ML pipeline

Allow user to stop gradient on some layers in backpropagation

Allow user to freeze layer parameters in training

Add ML pipeline python API, user can use BigDL with ML pipeline in python code

Enhancement

Support model quantization. User can speed up model inference by quantize the model

Display bigdl model in Tensorboard

User can easily convert a sequential model to graph model by invoking new added toGraph method

Remove unnecessary contiguous check in 3D conv

Support global average pooling

Support regularizer in 3D convolution layer

Add regularizer for convlstmpeephole3d

Throw more meaningful messages in layers and criterions

Migrate GRU/LSTM/RNN/LSTM-Peehole definition from sequence to graph

Switch to pytest for python unit tests

Speed up tanh layer

Speed up sigmoid layer

Speed up recurrent layer

Support batch normalization in recurrent

Speedup Python ndarray to scala tensor convertion

Improve gradient sync performance in distributed training

Speedup tensor dot operation with mkl dot

Speedup copy operation in recurrent container

Speedup logsoftmax

Move classes.lst and img_class.lst to the model example folder, so user can easier to find them.

Ensure spark.speculation is set to false to get a better performance in training

Easier to turn on performance data in distributed training log

Optimize memory usage when broadcasting the model

Support mllib vector as feature for BigDL

Support create multiple tensors Sample in python

Support resizing in BytesToBGRImg

Bug Fix

Fix TemporalConv layer cannot return parameter table

Fix some bugs when loading dilated group convolution from caffe

Fix some bugs when loading caffe v1 layers

Fix a bug in TimeDistributed layer

Fix get incorrect execution time in recurrent layers

Fix inplace layer clear state bug

Fix incorrect training data sample count under some input

Remove label check in BytesToGreyImg

Fix a bug in concat table when it contains no layer

Fix a bug in maptable

Fix some typos in document

Use newInstance method to obtain FileSystem

Source code(tar.gz)
Source code(zip)
v0.2.0(Jul 24, 2017)
New feature

A new BigDL document website online https://bigdl-project.github.io/, which replace the original BigDL wiki

Added New Models & Layers

TreeLSTM and examples for sentiment analytics

convLSTM layer

1D convolution layer

Mean Absolute Error (MAE) metrics

TimeDistributed Layer

VolumetricConvolution(3D convolution)

VolumetricMaxPooling

RoiPooling layer

DiceCoefficient loss

bi-recurrent layers

API change

Allow user to set regularization per layer

Allow user to set learning rate per layer

Add predictClass API for python

Add DLEstimator for Spark ML pipeline

Add Functional API for model definition

Add movie length dataset API

Add 4d normalize support

Add evaluator API to simplify model test

Install & Deploy

Allow user to install BigDL from pip

Support win64 platform

A new script to auto pack/distribute python dependency on yarn cluster mode

Model Save/Load

Allow user to save BigDL model as Caffe model file

Allow user to load/save some Tensorflow model(cover tensorflow slim APIs)

Support save/load model file from/to s3/hdfs

Optimization

Add plateau learning rate schedule

Allow user to adjust optimization process based on loss and score

Add Exponential learning rate decay

Add natural exp decay learning rate schedule

Add multistep learning rate policy

Enhancement

Optimization method API refactor

Allow user to load a Caffe model without pre-defining a BigDL model

Optimize Recurrent Layers performance

Refine the ML pipeline related API, and add more examples

Optimize JoinTable layer performance

Allow user to use nio blockmanager on Spark 1.5

Refine layer parameter initialization algorithm API

Refine Sample class to save memory usage when cache train/test dataset as tensor format

Refine MiniBatch API to support padding and multiple tensors

Remove bigdl.sh. BigDL will set MKL behavior through MKL Java API, and user can control this via Java properties

Allow user to remove Spark log in redirecting log file

Allow user create a SpatialConvultion layer without bias

Refine validation metrics API

Refine smoothL1Criterion and reduce tensor storage usage

Use reflection to handle difference of Spark2 platforms, and user need not to recompile BigDL for different Spark2 platform

Optimize FlattenTable performance

Use maven package instead of script to copy dist artifacts together

Bug Fix

Fix some error in Text-classifier document

Fix a bug when call JoinTable after clearState()

Fix a bug in Concat layer when the dimension concatenated along is larger than 2

Fix a bug in MapTable layer

Fix some multi-thread error not catch issue

Fix maven artifact dependency issue

Fix model save method won’t close the stream issue

Fix a bug in BCECriterion

Fix some ConcatTable don’t clear gradInput buffer

Fix SpatialDilatedConvolution not clear gradInput content

Source code(tar.gz)
Source code(zip)
v0.1.1(Jun 16, 2017)
Release Notes

API Change

Use bigdl as the top level package name for all bigdl python module

Allow user to change the model in the optimizer

Allow user to define a model in python API

Allow user to invoke BigDL scala code from python in 3rd prject

Allow user to use BigDL random generator in python

Allow user to use forward/backward method in python

Add BiRnn layer to python

Remove useless CriterionTable layer

Enhancement

Load libjmkl.so in the class load phase

Support python 3.5

Initialize gradient buffer at the start of backward to reduce the memory usage

Auto pack python dependency in yarn cluster mode

Bug Fix

Fix optimizer continue without failure after retry maximum number

Fix LookupTable python API throw noSuchMethod error

Fix an addmv bug for 1x1 matrix

Fix lenet python example error

Fix python load text file encoding issue

Fix HardTanh performance issue

Fix data may distribute unevenly in vgg example when input partition is too large

Fix a bug in SpatialDilatedConvolution

Fix a bug in BCECriterion loss function

Fix a bug in Add layer

Fix runtime error when run BigDL on Pyspark 1.5

Source code(tar.gz)
Source code(zip)

BigDL: Distributed Deep Learning Framework for Apache Spark

Related tags

Overview

BigDL: Distributed Deep Learning on Apache Spark

What is BigDL?

Why BigDL?

How to use BigDL?

Citing BigDL

Comments

Description

Related tasks

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

Description

Motivation and Context

API Usage or Code Design.

Related Link or Issue

How was this PR tested?

License & Dependency

What changes were proposed in this pull request?

How was this patch tested?

Related links or issues (optional)

What changes were proposed in this pull request?

How was this patch tested?

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

Description

Description

1. Why the change?

2. Summary of the change

3. How to test?

Releases(v2.1.0)

v2.1.0(Sep 28, 2022)

Highlights

v2.0.0(Mar 9, 2022)

Highlights

v0.13.0(Jul 9, 2021)

v0.12.2(Apr 21, 2021)

v0.12.1(Jan 5, 2021)

v0.11.1(Jan 5, 2021)

v0.10.0(Nov 5, 2019)

Highlights

Details

v0.9.0(Jul 22, 2019)

Highlights

Details

v0.8.0(Mar 28, 2019)

Highlights

Details

v0.7.0(Oct 13, 2018)

Highlights

Details

v0.6.0(Jun 29, 2018)

Highlights

Details

v0.5.0(Mar 30, 2018)

Highlights

New Features

Enhancement

API change

Bug Fix

v0.4.0(Jan 4, 2018)

Highlights

New Features

Enhancement

Bug Fix

v0.3.0(Nov 8, 2017)

Highlights

New Features