I would consider the following optimizations first before attempting to rewrite ...

danpalmer · on July 31, 2023

These don't really apply to the parent commenter's scenario.

1) gunicorn or any solution with multiple processes is going to just multiply the RAM usage. Using 10-100GB of RAM per effective thread makes this sort of problem very RAM bound, to the point that it can be hard to find hardware or VM support.

2) This isn't I/O bound.

3) If your service is fundamentally just looking up data in a huge in-memory data store, adding LRU caching around that is unlikely to make much of a difference because you're a) still doing a lookup in memory, just for the cache rather than the real data, and b) you're still subject to the GIL for those cache lookups.

I've also written services like this, we only loaded ~5GB of data, but it was sufficient to be difficult to manage in a few ways like this. The GIL-ectomy will probably have a significant impact on these sorts of use cases.

kayodelycaon · on July 31, 2023

For #1, would copy on write help? Or does python store the counters on the objects?

danpalmer · on July 31, 2023

Ha! Yes! Unfortunately I know this because of terrible reasons. Python is reference counted so copy-on-write doesn't work for this with Python objects (note: if your Python object is actually just a reference to a native object in a library all bets are off, may work or may not).

We had an issue with the service I mentioned above where VMs with ~6GB RAM weren't working, because at the point that gunicorn forked there was instantaneously >10GB RAM usage because everything got copied. We had to make sure that the data file was only loaded after the daemon fork, which unfortunately limits the benefits of that fork, part of the idea is that you do all your setup before forking so that you know you've started cleanly.

xmaayy · on July 31, 2023

> 1. For multiples processes use `gunicorn`

This will load up multiple processes like you say. OP loads a large dataset and gUnicorn would copy that dataset in each process. I have never figured out shared memory with gUnicorn.

zbentley · on July 31, 2023

> gUnicorn would copy that dataset in each process

Assuming you're on Linux/BSD/MacOS, sharing read-only memory is easy with Gunicorn (as opposed to actual POSIX shared memory, for which there are multiprocessing wrappers, but they're much harder to use).

To share memory in copy-on-write mode, add a call to load your dataset into something global (i.e. a global or class variable or an lru_cache of a free/class/static method) in gunicorn's "when_ready" config function[1].

This will load your dataset once on server start, before any processes are forked. After processes are forked, they'll gain access to that dataset in copy-on-write mode (this behavior is not specific to python/gunicorn; rather, it's a core behavior of fork(2)). If those processes do need to mutate the dataset, they'll only mutate their copy-on-write copies of it, so their mutations won't be visible to other parallel Gunicorn workers. In other words, if one request in a parallel=2 gunicorn mutates the dataset, a subsequent request has only a 50% likelihood of observing that mutation.

If you do need mutable shared memory, you could either check out databases/caches as other commenters have mentioned (Redislite[2] is a good way to embed Redis as a per-application cache into Python without having to run or configure a separate server at all; you can launch it in gunicorn's "when_ready" as well), or try true shared memory[3][4]

1. https://docs.gunicorn.org/en/stable/settings.html#when-ready 2. https://pypi.org/project/redislite/ 3. https://docs.python.org/3/library/multiprocessing.html#share... 4. https://docs.python.org/3/library/multiprocessing.shared_mem...

sanderjd · on July 31, 2023

One way to achieve similar performance is redis or memcached running on the same node. It really depends on the workload too. If it is lookups by key without much post-processing, that architecture will probably work well. If it's a lot of scanning, or a lot of post-processing, in-process caching might be the way to go, maybe with some kind of request affinity so that the cache isn't duplicated across each process.