On Apache Kafka, PostgreSQL, and Apache Flink. Q&A with Francesco Tisiot

Q1. Many enterprises are shifting towards real-time, event-driven architectures. How do Apache Kafka, PostgreSQL, and Apache Flink complement each other to form the backbone of a modern data platform, and what fundamental business capability does this combination unlock compared to traditional data warehouses?

PostgreSQL, Apache Kafka, Apache Flink, and ClickHouse are fundamentally reshaping data management and leverage in modern enterprises, forming a powerful quartet for real-time data processing and analytics. PostgreSQL, with its widespread availability, rich features, scalability, and strong community, has become the de-facto relational database for developers and enterprises, serving as a robust system of record. Apache Kafka is the industry standard for real-time event streaming, decoupling producers and consumers to enable high-efficiency operations for microservices, notification services, and data synchronization, further enhanced by the real-time, scalable, and customizable Kafka Connect framework. Apache Flink acts as the “orchestrator tool,” transforming data in-flight via Kafka to facilitate near-real-time pipelines for tasks like PII data obfuscation, aggregations, alerting, and pattern detection, making it essential for industries requiring paramount time-to-decision like banking and e-commerce. ClickHouse, a column-oriented database, excels in analytical queries over large datasets, providing immediate insights from data streams processed by Kafka and Flink, enabling businesses to move beyond batch processing for real-time, data-driven decisions.
 

Q2. Adopting powerful open-source technologies can be challenging. What are the most significant hurdles—whether technical, operational, or cultural—that enterprises face when migrating to this stack, and what is your top recommendation for a smooth transition?

I wouldn’t call nowadays the adoption of open-source tech as a challenging task. From a technical and operational Point of view tech like PostgreSQL, Kafka, Flink, and ClickHouse have been proven to be secure, scalable, maintainable pieces of software and their usage across a huge variety of users, plus their growing community is a manifest of their long term solidity. Of course when migrating across any tech the devil’s in the detail: niche features can have tremendous different performances and behavioural profiles therefore full testing needs to be ensured. However the same also applies to any migration involving open-source tooling or proprietary one.

From a cultural point of view, companies might be reclutant to invest in open source since they don’t feel the customer-vendor relationship that is ingrained with a proprietary system. However, nowadays this is not true since the availability of managed services like PostgreSQL across all hyperscalers by a variety of vendors, including Aiven. This provides the best of both worlds: the usage of an open-source tool, with the benefits of its community and the avoidance of any lock-in, combined with the accessibility of automatism, expertise, and talent that the companies offering the managed service are bringing to the market.

A suggestion from a smooth transition is to avoid adopting a technology for the sake of it and start with a well scoped small project, demonstrating the value it brings to the business. This approach provides two benefits: laser focused implementation and short term ROI. The last important advice is to concentrate on what matters: knowing all the internals of PostgreSQL, its backup mechanisms, High Availability is a must have in the long term but it can be avoided at the beginning of the journey by adopting production ready managed services.
 

Q3. Could you walk us through a specific, powerful use case, such as real-time fraud detection or dynamic customer personalization? Please explain the distinct role each component (Kafka for ingestion, Flink for processing, and PostgreSQL for storage/serving) plays in taking data from an event to an action.

Let’s cover a case of real-time fraud detection, pretty common whether you work for any shop having an online presence. The usual operational backend for such companies is represented by a PostgreSQL database where data for customers, orders, and inventory is properly stored across various tables. Applying the KISS principle, you could start immediately performing analytical queries directly in the PostgreSQL database to understand unexpected order patterns in your website. 

However this pattern has two main problems:

  • It works in “batch mode”: you run the analytical query on demand every hour or every 5 minutes, leaving the batch time free for bad behaviours to happen. E.g. a customer makes 20 orders in 2 minutes, you will find out only 3 minutes later, in some cases this is too late.
  • It puts stress on the operational database: the main responsibility of the  PostgreSQL database is to reply to operational queries, while you’re also using it for a completely different query pattern. While this on small data volumes might not be a problem, it will create slowness in the database once the traffic (and the type of analytical queries grows)

The first evolution could be to associate a PostgreSQL read only replica, where all the analytical queries could land with no impact to the operational database. However, it only solves the point 2 above, not the need of working in real time. This is where Kafka And Flink enter the scene.

Apache Kafka, with the Kafka Connect framework can extract all the changes happening to one or more tables in real time by reading from the PostgreSQL WAL (Write Append Log) file. Once the data is in Kafka, Apache Flink can create a pipeline using for example window function or pattern matching to understand weird behaviours in customer orders. A classic example is analysing in real time if a customer made a large group of orders within a small timeframe or a number of orders from a different set of locations. All these examples are pretty trivial to create in Apache Flink using its powerful SQL layer.

Finally, once the fraud has been detected in real time, the users affected could immediately be blocked by directly having Flink writing into a dedicated column of the Customer table or/and by sending a dedicated alert to the abuse team.

Q4. While Kafka and Flink are native to stream processing, PostgreSQL is often seen as a classic system of record. How has its role evolved in modern data architectures, and how do its advanced features, like JSONB support and extensibility, make it a uniquely powerful partner in a real-time stack?

Sometimes technology needs to be boring, even more for a database. If a database is boring it means that it works, is resilient, secure and scalable. For PostgreSQL this is all true, but also the innovation happening around it from the community started bringing cutting edge data features like JSONB support, vector/hybrid search and the ability to support mixed OLTP/OLAP workloads.

The ever growing set of PostgreSQL extensions make it usable in a huge variety of real-time and AI solutions, a few examples:

  • JSON support: the debate has been around in the database industry since ages between relational and document databases. With the JSON support PostgreSQL closes the gap by offering the benefits of a relational structure for when it matters and the flexibility of document JSON column with dedicated indexing to speed up the performances
  • PGVector, PGAI, PGVectorScale: these extensions provide full vector support enabling vector/hybrid similarity search, key for AI workloads like RAG Generative AI. In modern solutions where AI RAG pipelines need to perform vector search in real time, PostgreSQL offers scalable and performant solutions.
  • Logical replication: whether is used for primary to read-only replication or for change data capture flows including other technologies like Kafka, PostgreSQL WAL based replication opens the door of propagating the operational changes to other systems in near real time.
     
  1. Looking ahead, what major trends or other open-source projects do you see integrating with this ecosystem? Specifically, how might developments in real-time AI/ML or technologies like Apache Iceberg further enhance what’s possible with this powerful trio?

The last few months showcase the success of Apache Iceberg as the default format for the data lake, a lot of major technologies are creating integrations towards this powerful data format. A clear example of this is Apache Kafka, which is driving towards making Iceberg the default data format for long retention data. This will provide a new paradigm: ingest once (via Kafka) and query both in real time and in batch effectively, reshaping how data is ingested, transformed and made available.

If we are talking about the future, we need to cover AI. Generative AI is changing the way we build and work. Major innovations are happening in this space across technologies where AI is assisting humans to perform their task better and faster. A clear example of this are AI database optimizers like the Aiven one, that provide to database professionals all the visibility needed to understand what to improve in the database as well as suggestions on how to improve the performance. In the future I can clearly state that AI and data will work very closely together: real time inference and training will be part of the future where behaviours and models are not trained on the last 6 months of data but more on the last three hours therefore they’ll be way quicker to adapt to new trends and understand anomalies. On the flip side, with AI changing and reshaping itself at such a fast pace, metrics, observability and automatic control of models will be key, and the ability to quickly take actions in case a model behaviour changes too drastically will be a need for any user and company. 

The last comment is about the pace of change in AI, models and vendors. We are witnessing an evolution that can’t be compared to anything before, new models arriving to the market on a weekly basis from an increasing variety of providers. The ability to switch the model powering an application is a must have, therefore the ability to serve the data to the model in a performant and cost effective way is key. This is why open source will be crucial in the future: the ability to seamlessly move data where the new model is will be key to follow innovation at the best price point. It’s way cheaper to move the dataset once from one hyperscaler to the other to use a new model available from a particular vendor, than to perform cross cloud inference. Open source provides this, the ability to move your solution across clouds seamlessly, therefore the adoption of open source technologies like PostgreSQL, Apache Kafka, Apache Flink and ClickHouse is crucial to create cost effective, secure, scalable solutions that are future proof.

……………………………………………….

Francesco Tisiot, Field CTO at Aiven, is dedicated to helping customers achieve success through data-driven innovation. Francesco believes that technology should be an enabler, not a barrier, and works closely with clients to understand their unique needs and develop tailored solutions that deliver measurable results. A seasoned technologist with a proven track record, Francesco is a trusted advisor to organizations across industries.

You may also like...