Databricks: An Understanding Inside the WH
This article presents an understanding of Databricks with helpful links and an explanation of the author's knowledge of the topic.
Join the DZone community and get the full member experience.
Join For FreeBelow is a summarized write-up of Databricks and my understanding of Databricks. There are many different types of data warehouses in the market, but here, we are just going to focus on Databricks alone.
Databricks is a similar concept to a data catalog using a hive meta store. Your data resides in s3 and not in any storage database that resides inside an HDD or an SSD.
After the data is in s3, the process is similar to the data catalog in Glue — with the help of crawlers how the data is read and ready for users.
The source data can be in any format, but it is internally stored in parquet format.
Data will be in your s3, and on top of it, there is a Unity Catalog, which does fine-grain governance on top of your data before it is ingested.
Data Loading
One interesting feature that I liked in Databricks is the Autoloader. These are common practices that happen in any OLTP or OLAP databases, but a small change is that the file can be in any format after understanding the data structure loads into a parquet format. Let's say you have a CSV with four rows and four columns, the data will be loaded into databricks into a parquet format soon it identifies a file in the specified location.
There are many other ways to load data, like any DBT tools, and we can also use Glue (if you are using AWS) — you can read more on Glue with Delta Lakes.
Another way to load, similar to any data warehouse, is COPY INTO. Using a SQL query, you can just give the path and copy it into the delta table using the table name.
You can also play around using SQL, Python, R, and Scala.
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Reference SQL Source: COPY INTO
The data is stored in Delta Lakes in three tiers: Bronze, Silver, and Gold.
Bronze: This is used for RAW data ingestion and any historical data.
Silver: Cleansed data or filtered or any augmented data.
Gold: Used for business aggregations.
Instance Types
There are two types of ways you can spin up Databricks, either through serverless or though on-demand instances. These on-demand instances are photon-type instances with Graviton or other instance types.
It's easy to spin up one on the Databricks page: Databricks on AWS.
You can also calculate your instance pricing on the Pricing Calculator page.
IDE Setups
IDEs are important from a developer's perspective on how your teams can collab, run, and commit your code into your relevant code repositories.
There are a couple of options where you can use the notebooks, which are collaborative with the development teams, SQL Editor, or you have a lot of extensions for any IDE. The one I liked was with VS Code Plugin.
Conclusion
Databricks has an ecosystem of a data warehouse that reads data directly from s3, where there is no need to have a storage layer. The combination of Ingestion, Data/AI Platform, and Data Warehousing is Databricks Lakehouse.
Above is my understanding of how Databricks work with my initial knowledge. Will keep adding more details to these blogs. Please share your experience with Databricks.
Links are given at regular checkpoints.
Opinions expressed by DZone contributors are their own.
Comments