Real-time big data processing with Spark Streaming
2016-09-09Big Data is a trending topic in the IT sector and has been for quite some time. Nowadays vast amounts of data are being produced, especially by web applications, HTTP logs, or Internet of Things devices.
For such volumes, traditional tools like Relational Database Management Systems are no longer suitable. Terabytes or even petabytes are quite common numbers in big data context, which is definitely not the capacity that MySQL, PostgreSQL, or any other database can pick up.
To harness huge amounts of data, Apache Hadoop would generally be the first and natural choice, and it’s probably right, with one assumption: Apache Hadoop is a great tool for batch processing. It proved to be extremely successful for many companies, such as Spotify. Their recommendations, radio, playlist workloads, etc. are suitable for batch processing. However, it has one downside – you need to wait for your turn. It usually takes about one day to process everything, scheduled accordingly and executed in a fail-over manner.
But what if we don’t want or can’t wait?
In this instance, streaming technology comes to the rescue. Everything started with Apache Storm project, released in September 2011 and later acquired by Twitter. It’s still a significant player in the streaming market, but nowadays Apache Spark and its streaming module have gained incredible popularity. Spark Streaming provides scalable, high-throughput, and fault-tolerant stream processing of live data streams.
Unlike Apache Storm, which is purely a stream events processor, Spark can be combined with streaming of other libraries, such as machine learning, SQL or graph processing.
That gives endless possibilities and use case coverage.
Overall, Spark Streaming doesn’t differ that much from regular Spark batch jobs. In both cases, we operate on some input, apply transformations and/or compute something out of our input data and then output it somewhere. The only difference is the continuous character of streaming jobs – they run indefinitely until we terminate them (just like a stream of water in a river compared to a bucket filled with water, as an analogy to batch processing). That said, we can choose one of the sources of our streams:
- Apache Kafka
- Apache Flume
- HDFS or S3 filesystem
- Amazon Kinesis
- TCP socket
- MQTT
- ZeroMQ
Custom source (not available for Python yet, just Java and Scala).
As seen in the list above, there’s quite a rich choice of inputs. From those above, two are really fascinating: Kafka and Flume.
Apache Kafka is a natural way forward when one needs to deal with a huge throughput. Designed by LinkedIn engineering team, it’s capable of handling millions of requests per second and can guarantee the order of messages in the queue. It was built with scalability in mind and it is one of the easiest ways to integrate with Spark Streaming. What’s worth highlighting – Kafka comes very handy in terms of installation and publishing/subscribing (provides binding in the most popular languages).
On the other hand, it often happens that sources of streams are very various and have various formats. In this case, Apache Flume is an intelligent choice. It is a standalone Linux program, which takes care on collecting and moving large amounts of data. The concept is quite similar to Spark Streaming, with one exception – it’s only responsibility is to move data from a source to a sink.
Flume supports various sources, like execute command output, directory spool, Kafka, Syslog, HTTP. There’s also a decent choice of sinks, like HDFS, file roll, Kafka, ElasticSearch to name just a few. But the power of Flume rests in custom sinks and sources. One of them is custom Spark Sink, which is used to either push data to Spark directly (through Avro buffer) or expose that data, so Spark can pull it. There’s an impressive list of Flume plugins, be it sinks or sources, which allows you to connect multiple sources to Spark sink for further stream processing.
There are few points to keep in mind working with streaming as opposed to regular batch processing:
- Stateful operations – each DStream (discretized stream) consists of RDDs carried in defined intervals. We can process them with pretty much the same API functions as normal Map-Reduce tasks. The only difference is that we start from a blank point in every interval. To relate to data processed in previous intervals, the state functions need to be applied on the stream (mapWithState or updateStateByKey),
- Window operations – they can be applied over a sliding window of data to apply some transformations,
- Check-pointing – in order to ensure resilience and fault-tolerance, Spark Streaming comes with check-pointing, useful for recovering from node failures,
- Machine Learning operations – MLlib can be used in streaming jobs and out there are especially streaming algorithms (like Streaming Linear Regression or Streaming KMeans). For other applications, traditional MLlib models, already created based on historical values could be used to be applied to streaming data.
As seen above, Apache Spark is a tremendously powerful way to deal with streaming workloads. It leverages common API for RDD transformations which eases the way for achieving clean and coherent Lambda Architecture and allows to utilize Apache Spark set up for all kinds of big data processing.