Show and Tell: Image captioning open sourced in TensorFlow

bahro · on Sept 23, 2016

Why on earth would they distribute this without a trained model? Several weeks of training time on a multi-thousand dollar piece of specialized hardware are required to actually run this.

Google clearly has many different trained versions of this network sitting around. It must have been a conscious decision not to release them. Is the point to artificially create a barrier to entry for hobbyists that might want to apply this research? If so, why bother releasing it at all? I'm really scratching my head here.

Kabukks · on Sept 23, 2016

I feel the same. Maybe we should pool some money to train it on AWS. Is there a community where Machine Learning hobbyists can pool money to train models that are open sourced afterwards?

bahro · on Sept 23, 2016

I haven't heard of one, but this is a good idea!

visarga · on Sept 22, 2016

This network shows how it is possible to represent meaning into a vector of 100-600 real numbers, first from images to vectors, then from vectors to text. Philosophers have always wondered about the nature of thought, but this representation model creates for the first time the ability to work with sensorially grounded abstract concepts in AI. It's not so mysterious after all. It is possible to merge multiple sensorial modalities in a common meaning space.

In order to get full AI we need to add behavior and embodiment to these meaning vectors. They need to be trained by reinforcement learning, to learn the behavior that maximizes rewards. Meaning vectors are just a small part of the final system, equivalent to our ability to see and speak. The most difficult part is that of learning behavior.

gradys · on Sept 22, 2016

Well put. I think it's really valuable to take inventory of human cognitive capabilities that we don't yet know how to implement in neural nets.

We seem to have more or less solved perception. Given a high dimensional, "raw" input space, we know how to process it into a more usable, more abstract representation.

Work on perception has also given us certain limited kinds of behavior. We can generate images from abstract representations by inverting our image recognition architectures. RNNs can do perception over sequences (e.g. of words), but can also generate language output or a sequence of control commands for robotics.

One area in which I think human cognition is far ahead of neural network research is control flow. Maybe there's a better name for this. We seem to be able to encounter a new cognitive challenge and quickly design a mental program to solve it. We can attend to relevant sensory streams, various kinds of memory, and then design (sometimes novel) behavior to solve the problem.

Work on attention and architectures with more sophisticated internal representations like stack-augmented RNNs are definitely moving in this direction, but it seems like we have much further to go on this front than in visual perception, for example.

bbctol · on Sept 22, 2016

I'd be much more hesitant to talk about "meaning" in this context. Are we working with abstract concepts? We're working with images and text, and creating connections between the two. It may be that this approach, ramped up in processing power and complexity, can completely mimic a human's response to images; we may also hit a wall where new techniques are needed to address what you'd call "meaning."

visarga · on Sept 23, 2016

Yes, could be. Word vectors and other kinds of embeddings seem promising, a little too good to be true. There might be a glass ceiling we're not seeing yet.

zardo · on Sept 22, 2016

Thinking about behavior as a sequence of actions and operators to combine actions... It may not be all that different than speech generation. Speech is a behavior after all.

visarga · on Sept 23, 2016

Yes, and also other functions such as attention, memory reading and writing, comparing - they form an assembly language for the "neural turing machine". Doing all sorts of mental operations is behavior as well. So it can be learned by reinforcement.

nl · on Sept 23, 2016

See the paper 'zero shot learning through cross modal transfer' which might be worth reading.

It's one of the papers that kicked off this approach to image captioning, but is much more ambitious. Lots of things in it don't really quite work, but it shows where this work is going.

tkinom · on Sept 22, 2016

Wondering if one can train TensorFlow to catch bugs in source code.

Train it from github.com's commits, logs - auto learn "what the SW bugs look like" and scan for new one....

visarga · on Sept 22, 2016

There is something like this:

Automatic Patch Generation by Learning Correct Code

https://people.csail.mit.edu/rinard/paper/popl16.pdf

mdda · on Sept 22, 2016

s/TensorFlow/a deep RNN like this/ would make more sense.

TensorFlow is just a framework (as are Theano, Torch or DL4J) for expressing the network architecture. Framework:Network ~ ProgrammingLanguage:Algorithm

angerbot · on Sept 22, 2016

Very cool. I've been toying with the idea of using something like this or perhaps the cloud vision API to automatically generate image captions for screen readers (e.g. through a browser extension) but the cost to run something like an EC2 GPU unit is prohibitive for a project like that which I wouldn't want to charge for.

Running it locally on the user's machine would take far too long to train, especially as you would have to use the CPU in the majority of cases since many people don't have a separate GPU.

GrantS · on Sept 22, 2016

While you would never do this kind of training on your user's machines (which takes multiple weeks even with a powerful GPU), you should be able to apply the trained model to a single photo nearly instantaneously. So the real roadblock is mostly that they don't appear to have a included a completely pre-trained model with this release, and it will take you as a developer a lot of GPU time to train one. But your users would not necessarily have a problem captioning images on their machines.

angerbot · on Sept 22, 2016

I hadn't considered that (this is really out of my depth). Any ideas on what the actual size of a trained model would be to distribute? Taking 150G on the user's hard drive is out as well, probably.

dharma1 · on Sept 22, 2016

Depends on the model and dataset, inceptionv3 trained on imagenet is about 150mb but you can quantise the weights to 8bit and prune it much smaller without affecting perf much

matt4077 · on Sept 22, 2016

Here's a complete model for image recognition that works fine on a notebook: https://www.tensorflow.org/versions/r0.10/tutorials/image_re...

nl · on Sept 22, 2016

You can run this model on a RasberryPi.

Training is another matter.

infinitone · on Sept 23, 2016

It would be much more useful if they released a pretrained model. Most people don't have the tech required to train their own- unless you want to wait months.

mungoman2 · on Sept 22, 2016

This is super awesome! It would be nice if a pretrained model was also available to be able to play with this without spending weeks of training.

Omnipresent · on Sept 22, 2016

Would it be possible to simulate this on a MBP or would it require significantly more amount of computing power

danialtz · on Sept 22, 2016

from their note on github page [1]:

> The time required to train the Show and Tell model depends on your specific hardware and computational capacity. > In this guide we assume you will be running training on a single machine with a GPU. In our experience on an NVIDIA Tesla K20m GPU the initial training phase takes 1-2 weeks. The second training phase may take several additional weeks to achieve peak performance (but you can stop this phase early and still get reasonable results).

> It is possible to achieve a speed-up by implementing distributed training across a cluster of machines with GPUs, > but that is not covered in this guide.

> Whilst it is possible to run this code on a CPU, beware that this may be approximately 10 times slower.

So, I assume it will take veeery long time to train it on a MBP, unless they publish their pre-trained data-set.

[1] https://github.com/tensorflow/models/tree/master/im2txt#a-no...

rememberlenny · on Sept 22, 2016

You could run this locally. You could also spin up a AWS gpu instance and run the library as a web service.

georgehm · on Sept 23, 2016

If someone has the model running already, could you please share the captions generated for the images mentioned in page 25 of http://cims.nyu.edu/~brenden/1604.00289v2.pdf .

NasKe · on Sept 22, 2016

I wonder if as someone learning ML I should try to train it and run it on my pc.