Real-time big data processing with Spark Streaming

Big Data is a trending topic in the IT sector and has been for quite some time. Nowadays vast amounts of data are being produced, especially by web applications, HTTP logs, or Internet of Things devices.

For such volumes, traditional tools like Relational Database Management Systems are no longer suitable. Terabytes or even petabytes are quite common numbers in big data context, which is definitely not the capacity that MySQL, PostgreSQL, or any other database can pick up.

To harness huge amounts of data, Apache Hadoop would generally be the first and natural choice, and it’s probably right, with one assumption: Apache Hadoop is a great tool for batch processing. It proved to be extremely successful for many companies, such as Spotify. Their recommendations, radio, playlist workloads, etc. are suitable for batch processing. However, it has one downside – you need to wait for your turn. It usually takes about one day to process everything, scheduled accordingly and executed in a fail-over manner.

But what if we don’t want or can’t wait?

Read more