Google BigQuery – querying repeated fields
2020-10-25Google BigQuery is probably one of the best data warehouses in the market nowadays. It dominated Big Data landscape with its infinite scaling capabilities (querying over petabytes of data), ANSI SQL support and ease of use. It has proven its worth in many use cases.
One of the least used and least appreciated features, in my opinion, is repeated fields. The name doesn’t indicate well enough the intention, so for a sake of simplicity please consider it as an array field or nested field. You can define any structure inside the repeated field you like, leveraging types of columns which regular columns can be. The important part is to set mode REPEATED for the field of type RECORD.
In below example, you’ll see the table definition and SQL INSERT query, which will populate it.
insert into test.oscars values
('Best Actor', 2019, [
struct('Joaquin Phoenix', 'Arthur Fleck / Joker', 'Joker', true),
struct('Antonio Banderas', 'Salvador Mallo', 'Pain and Glory', false),
/* .. */
]),
('Best Actor', 2018, [
struct('Rami Malek', 'Freddie Mercury', 'Bohemian Rhapsody', true),
struct('Christian Bale', 'Dick Cheney', 'Vice', false),
/* .. */
]));
Now, you can query this table and immediately you’ll see a benefit of it – the records will be flattened. What’s worth noting – if you use BigQuery SDK for any language or even bq
CLI tool, you’ll get 3 records and every one of them will have our nominee field.
Now, bear in mind, that our example is very simple. Imagine that in real-world scenarios, you’ll have 100+ columns in the repeated field, but you only need a few. Also, due to column-oriented nature of underneath storage in BigQuery, it’s always the best to limit the number of columns (see: Control projection – Avoid SELECT *
).
Let’s say, we want to get only an award and year and the actor who has been nominated and the flag for the winner. My first bet was to use:
SELECT award, year nominees.actor, nominees.winner
FROM test.oscars
Wrong! It turned out, it’s not the way how it works. You need to use ARRAY mechanism and UNNEST the nominees in order to subquery it and return a RECORD in the form you need. Here goes:
SELECT
award,
year,
ARRAY(
SELECT STRUCT(actor, winner)
FROM UNNEST(nominees)
ORDER BY winner DESC, actor
) AS nominees
FROM test.oscars
ORDER BY year
By using ARRAY construct you can leverage full query on repeated field which gives you a lot of flexibility. Hope you’ll find it out useful!
Cover photo created by natanaelginting – www.freepik.com