We improved Tensorflow Serving performance by over 70%

ajtulloch · on March 26, 2019

FWIW, I believe that the current state of the art for batch-size 1, fp32 inference for ResNet-50 on Intel CPUs is AWS's work in https://arxiv.org/abs/1809.02697. After the low-hanging fruit outside of model execution are picked, this kind of work is probably quite relevant.

human_afterall · on March 26, 2019

Hey! Author here, thanks for linking the paper. The article was from an infrastructure perspective, but we're definitely diving deeper into graph execution optimizations after this:)

rahulun · on March 26, 2019

Are there any particular optimizations you are looking into ?

ss7pro · on March 26, 2019

Have a look here: https://github.com/IntelAI/OpenVINO-model-server/blob/master... You can replace tf-serving with OpenVINO to get even better performance and latency when running on CPU

londons_explore · on March 26, 2019

What useful models run at decent speed on a CPU these days?

Even basic image classifiers tend to be 100x faster on a GPU or TPU...

bitL · on March 26, 2019

Inference is not that super slow on CPU, especially for network requests that already have quite a bit of latency, so plenty of companies use CPUs on the cloud for lambda/flexible loads where GPUs aren't available.

greesil · on March 26, 2019

https://www.microsoft.com/en-us/research/publication/deepcpu...

TensorFlow has some known inefficiencies.

solidasparagus · on March 26, 2019

Cool work! It feels like the improvement is a little overstated due to how you're measuring - your measurements include import/setup time so you get big gains by improving imports. But in reality, you won't be creating a new client for each request and client import/setup time is unrelated to TF serving performance. TF serving performance is really about the time elapsed between request received and response returned.

bwasti · on March 26, 2019

> containers are run on a 4 core, 15GB, Ubuntu 16.04 host machine

What CPU is being used?

Assuming the benchmark is done with something like an EC2 C5 instance, the results in this post are quite slow. Somewhere around 14x slower than benchmarks from a year ago on EC2 C5 instances. [1]

[1] https://dawn.cs.stanford.edu/benchmark/ImageNet/inference.ht..., using the c5.2xlarge benchmark and assuming linear scaling

human_afterall · on March 26, 2019

Hi bwasti, the host's CPU platform is Intel Broadwell. While the CPU architecture of our production hosts are the same, the resources allocated are much higher than 4 cores. This post details an overview of the relative improvements that can be made from a vanilla setup :)

-masroor (author)

bwasti · on March 26, 2019

You may want to check out Intel's optimized version of TensorFlow Serving[1] for further improvements (on the order of 2x for ResNet-50[2]).

As an aside, I took into account the resource allocation in the parent comment. The c5.2xlarge has 8 cores, 8GB RAM [3] and does a single fp32 inference in ~17ms. If we chop that down to 4 cores and assume linear scaling we can fathom running ResNet-50 in ~35ms compared to the ~500ms achieved here. I'd recommend comparing to a known baseline rather than a "vanilla setup" to ensure you aren't missing any simple changes that may dramatically improve performance.

[1] https://github.com/IntelAI/models/blob/master/docs/general/t...

[2] https://www.intel.ai/improving-tensorflow-inference-performa...

[3] https://aws.amazon.com/ec2/instance-types/c5/

human_afterall · on March 26, 2019

@bwasti, really good points - this is something we look forward to evaluating! Our post does indeed outline optimizations from tensorflow/serving to tensorflow/serving:* -devel [1]. The next logical improvement (given intel architecture and docs linked) is start building on top of the * -devel-mkl image.

-masroor(author)

[1] https://github.com/tensorflow/serving/tree/master/tensorflow...

tgma · on March 26, 2019

The grpc.beta code elements are deprecated and may go away anytime. (gRPC 1.0.0 is also super old and unsupported)

human_afterall · on March 26, 2019

Good point - we're still in the process of migrating to >= 1.17. The gRPC connection and client stub should still translate (few semantic updates).

```

channel = grpc.insecure_channel('0.0.0.0:9000')

stub = PredictionServiceStub(channel)

```

-masroor (author)

rahulun · on March 26, 2019

There is an optimized version of Tensorflow based on Clear Linux and MKLDNN -https://clearlinux.org/stacks , would be interest to see the performance difference between the natively compiled version and this .

human_afterall · on March 26, 2019

Hey! That's super interesting - so far we went with Tensorflow's ubuntu based official Docker devel image, but a clearlinux base looks like it would definitely be worth looking into!

-masroor (author)

naturalwarren · on March 26, 2019

This is amazing.