Kafka partition calculator for geolocation data workloads

Best practices

Geolocation data streams are central to vehicle tracking and routing applications. Let’s look at the geolocation data in a ride-sharing system. In this example the geolocation data messages are small, only around 500 bytes. The messages typically include GPS coordinates, timestamps, and vehicle IDs.

The message rate of geolocation data streams is high, clocking in at two thousand messages per second (m/s). The streaming pattern tends to be cyclical, with peak activity during commuting hours.

The Kafka partition calculator inputs below are good starting points for geolocation data streaming projects with small, medium, and large fleets of vehicles generating data.

Read assumptions

The brokers are of similar capability.
The load on the brokers’ machines is similar.
The messages don't diverge too much in size.
The messages are evenly distributed across all partitions.
The number of brokers makes sense in this context.
Brokers have similar latencies between producers and consumers.
The throughput per producer is less than 10MB/s.
Individual brokers have less than 40k partitions.
The cluster has less than 200k partitions in total.

Kafka UI for your team

Get started free

Results

number of producers

number of consumer

192

Expected lag

100

number of partitions

192

Business size preset

The size of the company in terms of article production (not number of employees, readership numbers, market capitalization, etc…)

It affects calculations as it influences mostly data volume, and resource needs.

Medium

Number of brokers

The number of brokers in the cluster where the topic will be created.

This affects calculations as the load put in by producers and consumers should be equally distributed between all the brokers.

1 140

Producer processing time

The average time it takes to produce a message in ms.

This affects calculations as it puts a hard limit on how fast a single producer can produce messages.

0.1 ms1,000 ms

Consumer processing time

The average amount of time it takes to consume a message in ms.

This affects calculations as it puts a hard limit on how fast a single consumer can consume messages.

0.1 ms1,000 ms

Throughput

The amount of messages that the system should process per second.

This affects calculations as it will determine how much parallelism is required to achieve the desired throughput.

msg/s

1 msg/s10,000 msg/s

Results

number of producers

number of consumer

192

Expected lag

100

number of partitions

192

How to increase partitions

Learn how to to increase your topic partitions and what effects this will have on your cluster.

Read full article

Recommended configuration for a medium business size

Expected lag (L) of X1, a reasonable latency given the rapid update frequency.
Recommended number of producers (RP) is X2, facilitating the high throughput.
Recommended number of consumers (RC) is X3, allowing for timely consumption of the messages.
The number of partitions for topics (NP) should be X4, ensuring efficient data distribution within the prescribed partition limits.

Each element of this configuration plays a significant role in optimizing Kafka's use in the ride-sharing industry. This configuration ensures that drivers and passengers experience minimal latency in location updates, contributing to smooth and responsive services.

Understanding the specific needs of a medium-sized company in this context, with moderate data flow and balanced Kafka configurations, is key to achieving optimal message processing.

In conclusion, the effective use of Kafka in ride-sharing services depends on a nuanced understanding of the industry's unique demands and a precise alignment of Kafka's powerful features to meet those needs.