>Some slightly more technical conditions have to hold in order to have vanilla G...

LolWolf · on Nov 13, 2018

> which function is non-differentiable at points?

Sorry, this was referring to the construction provided in the paper referenced.

I do agree that the title is somewhat misleading, since, when I first read it (and thought, "this is probably wrong"), I imagined that it proved that given any resnet, you can show convergence to the global optimum via GD, not just "a resent of a given size converges to a global optimum, via GD, for a specific training set."

That being said, the paper does not prove (nor claim to prove) general, globally-optimal convergence of GD, which is what I think you're saying (given, for example, what you mentioned about finding the factorization of a semiprime in the GGP and your specific function construction)—which is what I was pushing back against a bit. In particular, even in the title, they only claim to prove this for a specific class of problems (i.e. NNs).

> Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN

I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?

DoctorOetker · on Nov 13, 2018

Thanks for your added comments, it's really helpful to see a more candid breakdown of others views of a paper.

>I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?

They are referenced in the paper in the section:

>Another way to attack this problem is to study the dynamics of a specific algorithm for a specific neural network architecture. Our paper also belongs to this category. Many previous works put assumptions on the input distribution and assume the label is generated according to a planted neural network. Based on these assumptions, one can obtain global convergence of gradient descent for some shallow neural networks [Tian, 2017, Soltanolkotabi, 2017, Brutzkus and Globerson, 2017, Du et al., 2018a, Li and Yuan, 2017, Du et al., 2017b]. Some local convergence results have also been proved [Zhong et al., 2017a,b, Zhang et al., 2018]. In comparison, our paper does not try to recover the underlying neural network. Instead, we focus the empirical loss minimization problem and rigorously prove that randomly initialized gradient descent can achieve zero training loss.

I had this idea independently but never pursued it due to lack of time. It's another reason I like this paper: for referencing this approach, at least if they investigate what I think they do, I still havent had time to read those references, but from the description in this section it appears they investigate nearly if not exactly what I wanted to investigate "some day" :)