I don’t know any of machine learning but I’ve heard folks in some fields will prefer a gradient descent algorithm even though it doesn’t have nice convergence speed because the Hessian (Hessian approximation) will be too huge or difficult to calculate
There are always tricks to keep the problem size small for quadratic methods(such as L-BFGS). In ML stochastic gradient descent (a bad name for doing gradient descent on batches of the data at a time) seems to have a bit of magic in it's ability to go uphill to escape local minima as well as letting the problem scale to much larger sets of data.
Newton's method finds zeros of functions, which is different from optimisation (finding maximums or minimums). If you use Newton's method on a function which is the derivative of some other function you're trying to optimise, then it will find critical points, which may be the optimums you're looking for (or inflection points, saddles, etc.).
This is why you actually do something like Levenberg-Marquardt, which combines a gradient descent with a trust region of quadratic optimization, which is varied depending on if the cost function is decreasing or not. And of course it is iterative, so a single bad step doesn't destroy everything. The advantage comes from faster convergence in practice.