Dynamic Typing in SQL
In this post, look at Rockset's approach to making the first schemaless SQL data platform.
Join the DZone community and get the full member experience.
Join For FreeQuerying unstructured data using SQL is a painful process. Moreover, developers frequently prefer dynamic programming languages, so interacting with the strict type system of SQL is a barrier.
We at Rockset have built the first schemaless SQL data platform. In this post, we'd like to introduce you to our approach. We'll walk you through our motivations, a few examples, and some interesting technical challenges that we discovered while building our system.
Many of us are fans of the Python programming language. We like its pragmatism, its no-nonsense “There should be one — and preferably only one — obvious way to do it” attitude (The Zen of Python), and, importantly, its simple but powerful-type system.
Python is strongly and dynamically typed:
- Strong, because values have one specific type (or
None
), and values of incompatible types don't automatically convert to each other. Strings are strings, numbers are numbers, booleans are booleans, and they do not mix except in clear, well-defined ways. Contrast with Javascript, which is weakly typed. Javascript allows (for example) addition and comparison between numbers and strings, with confusing results. - Dynamic, because variables acquire type information at runtime, and the same variable can, at different points in time, hold values of different type.
a = 5
will makea
hold an integer; a subsequent assignmenta = 'hello'
will makea
hold a string. Contrast with Java and C, which are statically typed. Variables must be declared, and they may only hold values of the type specified at declaration.
Of course, no single language falls neatly into one of these categories, but they nevertheless form a useful classification for a high-level understanding of type systems.
Most SQL databases, in contrast, are strongly and statically typed. Values in the same column always have the same type, and the type is defined at the time of table creation and is difficult to modify later.
What's Wrong With SQL's Static Typing?
This impedance mismatch between dynamically typed languages and SQL's static typing has driven development away from SQL databases and towards NoSQL systems. It's easier to build apps on NoSQL systems, especially early on, before the data model stabilizes. Of course, dropping traditional SQL databases means you also tend to lose efficient indexes and the ability to perform complex queries and joins.
Also, modern data sets are often in a semi-structured form (JSON, XML, YAML) and don't follow a well-defined static schema. One often has to build a pre-processing pipeline to determine the correct schema to use, clean up the input data, and transform it to match the schema, and such pipelines are brittle and error-prone.
Even more, SQL doesn't traditionally deal very well with deeply nested data (JSON arrays of arrays of objects containing arrays...). The data pipeline then has to flatten the data or at least the features that need to be accessed quickly. This adds even more complexity to the process.
What's the Alternative?
What if we tried to build a SQL database that is dynamically typed from the ground up without sacrificing any of the power of SQL?
Rockset's data model is similar to JSON: values are either
- scalars (numbers, booleans, strings, etc)
- arrays, containing any number of arbitrary values
- maps (which, borrowing from JSON, we call “objects”), mapping string keys to arbitrary values
We extend JSON's data model to support other scalar types as well (such as types related to date and time).
Crucially, documents don't have to have the same fields. It's perfectly okay if a field occurs in (say) 10 percent of documents; queries will behave as if that field were NULL
in the other 90 percent.
Different documents may have values of different types in the same field. This is important; many real data sets are not clean, and you'll find (for example) ZIP codes that are stored as integers in some part of the data set and stored as strings in other parts. Rockset will let you ingest and query such documents. Depending on the query, values of unexpected types could be ignored, treated as NULL
, or report errors.
There will be slight performance degradation caused by the dynamic nature of the type system. It is easier to write efficient code if you know that you're processing a large chunk of integers, for instance, rather than having to type-check every value. But, in practice, truly mixed-type data is rare — maybe there will be a few outlier strings in a column of integers, so type-checks can, in practice, be hoisted out of critical code paths. This is, at a high level, similar to what Just-In-Time compilers do for dynamic languages today: yes, variables may change types at runtime, but they usually don't, so it's worth optimizing for the common case.
Traditional relational databases originated in a time when storage was expensive, so they optimized the representation of every single byte on disk. Thankfully, this is no longer the case, which opens the door to internal representation formats that prioritize features and flexibility over space usage, which we believe to be a worthwhile trade-off.
A Simple Example
I'd like to walk you through a simple example of how you can use dynamic types in Rockset SQL. We'll start with a trivially small dataset, consisting of basic biographical information for six imaginary people, given as a file with one JSON document per line (which is a format that Rockset supports natively):
{"name": "Tudor", "age": 40, "zip": 94542}
{"name": "Lisa", "age": 21, "zip": "91126"}
{"name": "Hana"}
{"name": "Igor", "zip": 94110.0}
{"name": "Venkat", "age": 35, "zip": "94020"}
{"name": "Brenda", "age": 44, "zip": "90210"}
As is often the case with real-world data, this data set is not clean. Some documents are missing certain fields, and the zip code field (which should be a string) is an int
for some documents, and a float
for others.
Rockset ingests this data set with no problem:
$ rock add tudor_example1 /tmp/example_docs
COLLECTION ID STATUS ERROR
tudor_example1 3e117812-4b50-4e55-b7a6-de03274fc7df-1 ADDED None
tudor_example1 3e117812-4b50-4e55-b7a6-de03274fc7df-2 ADDED None
tudor_example1 3e117812-4b50-4e55-b7a6-de03274fc7df-3 ADDED None
tudor_example1 3e117812-4b50-4e55-b7a6-de03274fc7df-4 ADDED None
tudor_example1 3e117812-4b50-4e55-b7a6-de03274fc7df-5 ADDED None
tudor_example1 3e117812-4b50-4e55-b7a6-de03274fc7df-6 ADDED None
And we can see that it preserved the original types of the fields:
$ rock sql
> describe tudor_example1;
+-----------+---------------+---------+--------+
| field | occurrences | total | type |
|-----------+---------------+---------+--------|
| ['_meta'] | 6 | 6 | object |
| ['age'] | 4 | 6 | int |
| ['name'] | 6 | 6 | string |
| ['zip'] | 1 | 6 | float |
| ['zip'] | 1 | 6 | int |
| ['zip'] | 3 | 6 | string |
+-----------+---------------+---------+--------+
Note that the zip
field exists in 5 out of the 6 documents and is a float
in one document, an int
in another, and a string
in the other three.
Rockset treats the documents where the zip
field does not exist as if the field were NULL
:
> select name, zip from tudor_example1;
+--------+---------+
| name | zip |
|--------+---------|
| Brenda | 90210 |
| Lisa | 91126 |
| Venkat | 94020 |
| Tudor | 94542 |
| Hana | <null> |
| Igor | 94110.0 |
+--------+---------+
> select name from tudor_example1 where zip is null;
+--------+
| name |
|--------|
| Hana |
+--------+
And Rockset supports a variety of cast
and type introspection functions that let you query across types:
> select name, zip, typeof(zip) as type from tudor_example1
where typeof(zip) <> 'string';
+--------+--------+---------+
| name | type | zip |
|--------+--------+---------|
| Igor | float | 94110.0 |
| Tudor | int | 94542 |
+--------+--------+---------+
> select name, zip::string as zip_str from tudor_example1;
+--------+-----------+
| name | zip_str |
|--------+-----------|
| Hana | <null> |
| Venkat | 94020 |
| Tudor | 94542 |
| Igor | 94110 |
| Lisa | 91126 |
| Brenda | 90210 |
+--------+-----------+
> select name, zip::string zip from tudor_example1
where zip::string = '94542';
+--------+-------+
| name | zip |
|--------+-------|
| Tudor | 94542 |
+--------+-------+
Querying Nested Data
Rockset also enables you to query deeply nested data efficiently by treating nested arrays as top-level tables and letting you use full SQL syntax to query them.
Let's augment the same dataset, and add information about where these people work:
{"name": "Tudor", "age": 40, "zip": 94542, "jobs": [{"company":"FB", "start":2009}, {"company":"Rockset", "start":2016}] }
{"name": "Lisa", "age": 21, "zip": "91126"}
{"name": "Hana"}
{"name": "Igor", "zip": 94110.0, "jobs": [{"company":"FB", "start":2013}]}
{"name": "Venkat", "age": 35, "zip": "94020", "jobs": [{"company": "ORCL", "start": 2000}, {"company":"Rockset", "start":2016}]}
{"name": "Brenda", "age": 44, "zip": "90210"}
Add the documents to a new collection:
$ rock add tudor_example2 /tmp/example_docs
COLLECTION ID STATUS ERROR
tudor_example2 a176b351-9797-4ea1-9869-1ec6205b7788-1 ADDED None
tudor_example2 a176b351-9797-4ea1-9869-1ec6205b7788-2 ADDED None
tudor_example2 a176b351-9797-4ea1-9869-1ec6205b7788-3 ADDED None
tudor_example2 a176b351-9797-4ea1-9869-1ec6205b7788-4 ADDED None
tudor_example2 a176b351-9797-4ea1-9869-1ec6205b7788-5 ADDED None
We support the semi-standard UNNEST
SQL table function that can be used in a join or subquery to “explode” an array field:
> select p.name, j.company, j.start from
tudor_example2 p cross join unnest(p.jobs) j
order by j.start, p.name;
+-----------+--------+---------+
| company | name | start |
|-----------+--------+---------|
| ORCL | Venkat | 2000 |
| FB | Tudor | 2009 |
| FB | Igor | 2013 |
| Rockset | Tudor | 2016 |
| Rockset | Venkat | 2016 |
+-----------+--------+---------+
Testing for existence can be done with the usual semijoin ( IN
/ EXISTS
subquery) syntax. Our optimizer recognizes the fact that you are querying a nested field on the same collection and is able to execute the query efficiently. Let's get the list of people who worked at Facebook:
> select name from tudor_example2
where 'FB' in (select company from unnest(jobs) j);
+--------+
| name |
|--------|
| Tudor |
| Igor |
+--------+
If you only care about nested arrays (but don't need to correlate with the parent collection), we have special syntax for this; any nested array of objects can be exposed as a top-level table:
> select * from tudor_example2.jobs j;
+-----------+---------+
| company | start |
|-----------+---------|
| ORCL | 2000 |
| Rockset | 2016 |
| FB | 2009 |
| Rockset | 2016 |
| FB | 2013 |
+-----------+---------+
I hope that you can see the benefits of Rockset’s ability to ingest raw data without any preparation or schema modeling and still power strongly typed SQL efficiently.
Published at DZone with permission of Tudor Bosman, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments