I'm not sure operators are a good example, they can be idempotent or not depending on the implementation. The operator is a function that takes two values and return a result: result = op(left, right).
But that safety net doesn't extend to startup founders does it? AFAIR In Norway, you need to have worked in a company as a regular employee for at least one or two years before you can claim unemployment benefits. Health insurance is always there, and so is basic social benefits (but the unemployment one is the only you can live on).
Yes, they do serve different purposes but they also share similarities. You could easily write a task queue on top of Faust.
It's important to remember that users had difficulty understanding the concepts behind Celery as well, perhaps it's more approachable now that you're used to it.
Using an asynchronous iterator for processing events enables us to maintain state. It's no longer just a callback that handles a single event, you could do things like "read 10 events at a time", or "wait for events on two separate topics and join them".
Faust is a library that you can import into your Python program, and all it requires is Kafka. Most other stream processing systems require additional infrastructure. Kafka Streams has similar goals, but Faust additionally enables you to use Python libraries and perform async I/O operations while processing the stream.
There's an examples/django project in the distribution. I think they removed the gevent bridge from PyPI for some reason, but you can still use the eventlet one. Gevent is production quality and the concept of bridging them is sound, so hope someone will work on it.
Faust uses Kafka for message passing. The new messages you create can be pushed to a new topic and you could have another agent consuming from this new topic. Check out the word count example here: https://github.com/robinhood/faust/blob/9fc9af9f213b75159a54...
Also note that the Table is persisted in a log compacted Kafka topic. This means, we are able to recover the state of the table in the case of a failure. However, you can always write to any other datastore while processing a stream within an agent. We do have some services that process streams and storing state in Redis and Postgres.
What I'm thinking of is flushing the table to a secondary storage so that other services can query that data.
I think Storm/Flink have the concept of a "tick tuple", a message that comes every n-seconds to tell the worker to flush the table to some other store. I've been looking over the project, and I'm not sure how I would do this in Faust yet, as far as I understand the "Table" is partitioned, so you'd have to send a tick to every worker.
In addition to that you can have an arbitrary method run at an interval using the app.timer decorator. You can use this to flush every n seconds. You could also use stream.take to read batches of messages (or wait t seconds, whichever happens first) and write to an external source.
Hey, I am a huge user of Apache Flink in personal and work projects. Skimmed through your readme, but still have the following question.
What are the advantages of using Faust feature-wise or otherwise? Are you guys planning on having feature parity with Flink (triggers, evictors, process function equivalent, etc.)? I can definitely see its use in having better compatibility with ML/AI env and the other tools in python toolkit. But specifically regarding ML/AI, I can export those things into the JVM usually. And with regards with Flask/Django, I can use scala http4s.
Sorry if that seemed like a lot! Trust me very excited about this project!
When it comes to batch, I still think Apache Spark and the standard scientific computing stack in Python are king.
But when it comes to streaming, I don't think Spark Structured Streaming was very close, although Spark 2.3.0 might be sufficient for your use cases, and I use spark structured streaming only when I have to deploy spark pipeline/ml.
But, besides the features I listed, I think Flink is just better written for streaming.
For example, defining checkpoints and state management in Flink is so much more expressive and easier to write (and perhaps more performant from what I saw at Flink Forward).
Flink Async is amazing!!!!
The Flink table/sql API is not as nice as Spark SQL for batch right now, but it is getting there.
Flink also seems to perform better than Spark Structured Streaming when you turn on object reusability, at least according to Alibaba and Netflix.
And then, the features I listed are amazing. Evictors and triggers are super helpful. Process Functions let me write custom aggregations without a lot of mental overhead.
Time Characteristics are really nice in Flink.
Flink has a lot of nice connectors and stuff written already (although Parquet files are troublesome though if they come from Spark because of the way that Spark handles it read and write support classes). For example, you have to write your own kinesis connector in spark structured streaming (or use the one put out by databricks on their blogs).
Sorry if this seems all over the place. I just posted what came to mind. And I should add, Databricks is doing a lot of work to make Spark Structured Streaming really comparable, so who knows what stuff looks like in 2-3 years. Like I said above, I use Spark Structured Streaming (2.3.0+) for some very specific ML stuff that my Flink environment cannot handle nicely yet.
My only complaint is that there isn't a lot of community written code out there, but the Data Artisans team is super active, helpful, and nice.
Thanks, this is super helpful and exactly what I am looking for. My main prospective use of Flink is as a target for Apache Beam, GCP Dataflow is great, but I want to have portable jobs, and Flink looks like the best target (over Spark).
Never used Beam before, and I am unsure if it uses all the features of Flink. And the documentation seems a bit sparse (I couldn't find how to tune checkpoints for example on a quick glance). But if you know how to tune it to use more of Flink or it works for your use case (it supports Python and Go), then go for it. It looks pretty neat.
I'd love to see some docs/examples around supporting pluggable brokers; Faust looks like exactly what I want from a python stream processor, but for various technical and operational reasons it would be nice to be able to provide a RabbitMQ/AMQP broker.
I'm looking at Redis Streams and it seems to lack the partitioning component of Kafka. One of the properties that we rely on is the ability to manually assign partitions such that a worker assigned partition 0 of one topic, will also be assigned partition 0 of other topics that it consumes from.
Simplicity is of course a goal, but this may mean we have to sacrifice some features when Redis Streams is used as a backend.
I'm glad you like Celery, this project in many ways realize what I wanted it to be.
In Faust we've had the ability to take a very different approach towards quality control. We have a cloud integration testing setup, that actually runs a series of Faust apps in production. We do chaos testing: randomly terminating workers, randomly blocking network to Kafka, and much more. All the while monitoring the health of the apps, and the consistency of the results that they produce.
If you have a Faust app that depends on a particular feature we strongly suggest you submit it as an integration test for us to run.
Hopefully some day Celery will be able to take the same approach, but running cloud servers cost money that the project does not have.
My name is Ask and I'm co-creator on this project, along with Vineet Goel.
This answer sums up why we wanted to use Python for this project: it's the most popular language for data science and people can learn it quickly.
Performance is not really a problem either, with Python 3 and asyncio we can process tens of thousands of events/s. I have seen 50k events/s, and there are still many optimizations that can be made.
That said, Faust is not just for data science. We use it to write backend services that serve WebSockets and HTTP from the same Faust worker instances that process the stream.
You don't think Equifax have lawyers? Of course they do, and they know that the clause exists. Just having the clause in the first place is a conspiracy to abuse consumers.
What I mean is that I don't think anybody specifically was thinking "Aha! We can put this clause in there today, and we've got a free out! Hooray!", precisely because even if it did work as putatively designed, it wouldn't work, and they'd completely know it. It isn't even useful as a "throw the spaghetti against the wall and see what sticks" maneuver.
To believe that this clause is related to this matter is to require not merely mendacity (believable), not merely stupidity (believable), but an unbelievably precise combination of mendacity and stupidity that can only be read as constructing a rationalization for a pre-supposed conclusion.