FWIW, I believe that the current state of the art for batch-size 1, fp32 inference for ResNet-50 on Intel CPUs is AWS's work in https://arxiv.org/abs/1809.02697. After the low-hanging fruit outside of model execution are picked, this kind of work is probably quite relevant.
Hey! Author here, thanks for linking the paper. The article was from an infrastructure perspective, but we're definitely diving deeper into graph execution optimizations after this:)
Inference is not that super slow on CPU, especially for network requests that already have quite a bit of latency, so plenty of companies use CPUs on the cloud for lambda/flexible loads where GPUs aren't available.
Cool work! It feels like the improvement is a little overstated due to how you're measuring - your measurements include import/setup time so you get big gains by improving imports. But in reality, you won't be creating a new client for each request and client import/setup time is unrelated to TF serving performance. TF serving performance is really about the time elapsed between request received and response returned.
> containers are run on a 4 core, 15GB, Ubuntu 16.04 host machine
What CPU is being used?
Assuming the benchmark is done with something like an EC2 C5 instance, the results in this post are quite slow. Somewhere around 14x slower than benchmarks from a year ago on EC2 C5 instances. [1]
Hi bwasti, the host's CPU platform is Intel Broadwell. While the CPU architecture of our production hosts are the same, the resources allocated are much higher than 4 cores. This post details an overview of the relative improvements that can be made from a vanilla setup :)
You may want to check out Intel's optimized version of TensorFlow Serving[1] for further improvements (on the order of 2x for ResNet-50[2]).
As an aside, I took into account the resource allocation in the parent comment. The c5.2xlarge has 8 cores, 8GB RAM [3] and does a single fp32 inference in ~17ms. If we chop that down to 4 cores and assume linear scaling we can fathom running ResNet-50 in ~35ms compared to the ~500ms achieved here. I'd recommend comparing to a known baseline rather than a "vanilla setup" to ensure you aren't missing any simple changes that may dramatically improve performance.
@bwasti, really good points - this is something we look forward to evaluating! Our post does indeed outline optimizations from tensorflow/serving to tensorflow/serving:* -devel [1]. The next logical improvement (given intel architecture and docs linked) is start building on top of the * -devel-mkl image.
There is an optimized version of Tensorflow based on Clear Linux and MKLDNN -https://clearlinux.org/stacks , would be interest to see the performance difference between the natively compiled version and this .
Hey! That's super interesting - so far we went with Tensorflow's ubuntu based official Docker devel image, but a clearlinux base looks like it would definitely be worth looking into!