Google BigQuery – querying repeated fields

Google BigQuery is probably one of the best data warehouses in the market nowadays. It dominated Big Data landscape with its infinite scaling capabilities (querying over petabytes of data), ANSI SQL support and ease of use. It has proven its worth in many use cases.

One of the least used and least appreciated features, in my opinion, is repeated fields. The name doesn’t indicate well enough the intention, so for a sake of simplicity please consider it as an array field or nested field. You can define any structure inside the repeated field you like, leveraging types of columns which regular columns can be. The important part is to set mode REPEATED for the field of type RECORD.

Read more

Quickly ingest initial data to Redis

Imagine, you have massive data pipeline and, where thousands of requests per seconds needs to read (that’s easy) or write (that’s harder) data. The obvious and often right choice would be to use Redis to handle all that.

But what happens when you start it on production and need to have some historical data, in order to keep consistency? Of course – there is a need to import that. There are many ways to achieve that, including writing some custom script. I urge you to have a look at redis --pipe option, also called Redis Mass Insertion, where you can leverage Redis’ protocol in order to really quickly ingest a lot of data (way faster than writing a custom script to migrate data using Redis SDK).

Read more

Producing AVRO messages with PHP for Kafka Connect

Apache Kafka has became an obvious choice and industry standard for data streaming. When streaming large amounts of data it’s often reasonable to use AVRO format, which has at least three advantages:

  • it’s one of most size efficient (compared to JSON, protobuf, or parquet); AVRO serialized payload can be 10 times smaller than the JSON equivalent,
  • enforces usage of a schema,
  • works out of the box with Kafka Connect (it’s a requirement if you’d like to use BigQuery sink connector).

Let’s see how to send data to Kafka in AVRO format from PHP producer, so that Kafka Connect can parse it and put data to sink.

Read more

Real-time big data processing with Spark Streaming

Big Data is a trending topic in the IT sector and has been for quite some time. Nowadays vast amounts of data are being produced, especially by web applications, HTTP logs, or Internet of Things devices.

For such volumes, traditional tools like Relational Database Management Systems are no longer suitable. Terabytes or even petabytes are quite common numbers in big data context, which is definitely not the capacity that MySQL, PostgreSQL, or any other database can pick up.

To harness huge amounts of data, Apache Hadoop would generally be the first and natural choice, and it’s probably right, with one assumption: Apache Hadoop is a great tool for batch processing. It proved to be extremely successful for many companies, such as Spotify. Their recommendations, radio, playlist workloads, etc. are suitable for batch processing. However, it has one downside – you need to wait for your turn. It usually takes about one day to process everything, scheduled accordingly and executed in a fail-over manner.

But what if we don’t want or can’t wait?

Read more