Nah, you really don’t. The only notable difference would be in stackless Python,...

ghostwriter · on June 8, 2020

You really do. "Generators" are APIs, their implementations may vary, even across different CPython versions, and Stackless is not the only implementation (yet a significant one) that you should consider when talking about the differences in memory footprint. Numba, Cython, Nuitka, and a few other more exotic compilers all have their own implementations of compiled generators, and some of them allocate the required tracking structure quite compactly on the stack [1].

But let's say I agree with you. To make the case demonstrably true, I'd like to see the context where a total volume of generator instantiations becomes less efficient than all possible list/tuple allocations with their contained data. Do you mind sharing your stats? What kind of codebase is it? I'd like to understand the layout of it and to draw my own conclusion on whether the mentioned veteran code is sane.

[1] http://numba.pydata.org/numba-doc/0.20.0/developer/generator...

mlthoughts2018 · on June 9, 2020

Hey, I actually used to work on the numba compiler team at Continuum. I can say I don’t understand the last rant paragraph at all or why you think it’s related to my point about generators.

ghostwriter · on June 9, 2020

> I can say I don’t understand the last rant paragraph at all or why you think it’s related to my point about generators.

This is not a rant, you made a claim about generators' allocations inefficiencies compared to physical list() and tuple() allocations, I'd like to see the case where it's true.

mlthoughts2018 · on June 9, 2020

It is true in many many cases. For a very simple illustration consider

    def f():
        x = list(range(10000))
        yield from x

vs

    list(range(10000))

The former includes the overhead of memory for both the materialized list and function execution frames.

The same thing happens when people naively split up generators that hold onto a lot of data, for example (this comes from real life experience where someone wanted to essentially “memoize with generators” a membership check on the response from a database call).

    def check_expensive_in():
        s = large_db_call()
        while True:
            x = yield
            yield x in s

    def expensive_filter():
        f = check_expensive_in()
        next(f)

        def helper(item):
            v = f.send(item)
            f.next()
            return v

        while True:
            items = yield
            for item in items:
                yield helper(item)
            yield None

    e = expensive_filter()
    next(e)
    e.send(some_list)
    # iterate e until None.

(Sorry for any typos or minor glitches with send(), I am writing this on my phone as I eat breakfast.)

It’s a very simple issue, which is that memoizing with generators maintains the memory footprint of the memoized data _and_ additional memory footprint of the generator (and also of large sent values into the generator too, but this is less common).

It’s better to just materialize the things you need in memory and reuse them in regular function calls, which don’t have fixed permanent overhead for a long lifetime like generators.

This can absolutely happen with small data examples too, where generators are less efficient than just materializing everything, but normally nobody cares because with small data, the effect of any inefficiency won’t be noticed.

Even in that case though, generators often lead to spaghetti code like my example above, because depending on laziness as you compose multiple functions is just a poor conceptual way to organize code. Very rarely, but sometimes, it’s worth it to avoid memory bottlenecks or to do stream processing. But it’s overstated how often this matters generally - it’s very rare unless you’re in a specialized domain where that’s all do.

Lastly I’d like to say that your tone comes across as needlessly antagonistic and it seems extremely obvious you are engaging in bad faith. You don’t seem open to consider what I am saying, rather in a rush to demand some kind of “proof” with no willingness to think through it, and likely not seeking proof to learn anything but just to try to create shallow, undermining retorts.

I won’t be continuing to check back here or engage any further with you. If you want the last word in the thread, take it.

ghostwriter · on June 11, 2020

> You don’t seem open to consider what I am saying, rather in a rush to demand some kind of “proof” with no willingness to think through it

it doesn't work this way, if you state something to be true, the burden of proof lies on your shoulders and I am open to consider the proof, and not just words alone.

It's funny that this subthread was started with a statement that my example of complexity estimation difficulties was deeply contrived, yet the examples for a proof of generators' inefficiences materialize all intermediate values into lists. Why would I want to materialize a list to yield something from it later, if I can just pass the range() stream to the caller instead?

Your second example is an exemplary code smell, as there's no need for a helper() closure. And there's no need to hold resources for check_expensive_in(), as most of DBs have lazy query cursor implementations in one way or the other. Afterall, it's not about doing everything in one expensive DB call and laziness doesn't negate buffered data, it just says that there's an explicit boundary on internal buffers within a single iteration, and every single iteration may perform a separate but less expensive DB call.