Exploring CockroachDB With Jupyter Notebook and Python
Today, we're going to explore CockroachDB from the Data Science perspective, using a popular exploratory web tool called Jupyter Notebook and Python language.
Join the DZone community and get the full member experience.
Join For FreeThis is the next installment in the series of tutorials on CockroachDB and Docker Compose.
I was inspired to write this post based on this article. The article goes over using Jupyter with Oracle, MySql, and Postgresql, we're going to do the same with Cockroach! One caveat here is the heavy reliance on ipython-sql
library. We're going to use Pandas library as the ipython-sql
magic functions are not compatible with Cockroach today. Hopefully, you will find it useful.
You can find the older posts here: Part 1, Part 2, Part 3, Part 4.
- Information on CockroachDB can be found here.
- Information on Jupyter Notebook can be found here.
We're going to continue using our trusty docker-compose
. Jupyter project publishes its images on Docker Hub. We're going to use the minimal image for this tutorial.
Here's my compose file:
version: '3.9' services: crdb: image: cockroachdb/cockroach:v21.2.3 container_name: crdb-1 ports: - "26257:26257" - "8080:8080" command: start-single-node --insecure volumes: - ${PWD}/cockroach-data/crdb:/cockroach/cockroach-data:rw jupyter: image: jupyter/minimal-notebook container_name: jupyter environment: - GRANT_SUDO=yes ports: - "8888:8888" volumes: - $PWD:/home/jovyan/work
If you've been following my posts, you may notice that I switched to CockroacihDB version 19.2.3. For the Jupyter environment, I'm passing GRANT_SUDO=yes
as an environment variable in case, we need to install some additional software inside the container. I'm also exposing the notebook at port 8888
.
- Open Jupyter notebook
When docker-compose up
is executed, Jupyter will output an access token to the Jupyter notebook.
To access the notebook, open this file in a browser: file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html Or copy and paste one of these URLs: http://1b8009ed3252:8888/?token=37a818efee156c6e3b2d3bd21db07827ce250560a336987b or http://127.0.0.1:8888/?token=37a818efee156c6e3b2d3bd21db07827ce250560a336987b
You can also access this information with the `docker logs ' command. Copy one of the provided URLs and paste it into a browser.
- Once you're in the notebook, you will notice a
work
directory. Clicking into it will expose all of the files in your host's current directory.
This behavior can be explained by the following volume mapping.
volumes: - $PWD:/home/jovyan/work
Let's create a new notebook by clicking New
and selecting Python 3
. You can click on the title of the notebook and name it to your liking.
- Install prerequisite software
The beauty of Jupyter's environment is its ability to adapt to different scenarios. In the next step, we're going to install Python libraries using pip
from within the notebook!
!pip install sqlalchemy cockroachdb pandas psycopg2-binary matplotlib
NOTE: Since the release of this post, cockroachdb
adapter for sqlalchemy
has been renamed to sqlalchemy-cockroachdb
.
Then click on the Run
icon to execute.
Before we go to the next step, let's switch to Cockroach and load some data.
- Initialize a workload.
docker exec -it crdb-1 ./cockroach workload init movr
This will initialize a workload for our application. We've chosen a sample movr application.
10:18 $ docker exec -it crdb-1 ./cockroach workload init movr I200211 15:18:30.420866 1 workload/workloadsql/dataload.go:135 imported users (0s, 50 rows) I200211 15:18:30.431182 1 workload/workloadsql/dataload.go:135 imported vehicles (0s, 15 rows) I200211 15:18:30.468309 1 workload/workloadsql/dataload.go:135 imported rides (0s, 500 rows) I200211 15:18:30.498384 1 workload/workloadsql/dataload.go:135 imported vehicle_location_histories (0s, 1000 rows) I200211 15:18:30.528526 1 workload/workloadsql/dataload.go:135 imported promo_codes (0s, 1000 rows)
Now we're ready to populate the database:
docker exec -it crdb-1 ./cockroach workload run movr --duration=1m
The workload will run for 1m, specified by the --duration
parameter.
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 59.9s 0 439 7.3 1.9 1.9 2.4 2.5 4.5 addUser _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 59.9s 0 124 2.1 3.6 3.7 4.5 4.5 5.0 addVehicle ...
There are different workload generators available with Cockroach, feel free to explore them here. When load completes, you can log in to the database and poke around in the database.
docker exec -it crdb-1 ./cockroach sql --insecure
root@:26257/defaultdb> show databases; database_name +---------------+ bank defaultdb movr postgres system (5 rows) Time: 2.847ms root@:26257/defaultdb> use movr; SET Time: 856.8µs root@:26257/movr> show tables; table_name +----------------------------+ promo_codes rides user_promo_codes users vehicle_location_histories vehicles (6 rows) Time: 2.459ms root@:26257/movr>
While we're there, let's also add a user with access to our database.
- Create a user and grant access
CREATE USER IF NOT EXISTS maxroach; GRANT ALL ON DATABASE movr TO maxroach; GRANT ALL ON TABLE movr.* TO maxroach; \q
We can test access for users maxroach
with:
docker exec -it crdb-1 ./cockroach sql --insecure --user=maxroach --database=movr
- Access the database in Jupyter.
import pandas as pd from sqlalchemy.engine import create_engine engine = create_engine('cockroachdb://maxroach@crdb-1:26257/movr') df = pd.read_sql('select count(*) from vehicles', engine)
If you're familiar with Python, sqlalchemy
is a popular Python framework for working with databases. Feel free to explore our docs for more in-depth examples.
Once a database connection is established, we use pandas
data frame to read the results of the select
statement. Click on Run
to execute. Nothing happened!
Because we stored the result in a variable called df
, we need to use the methods provided by the library.
Type the following and hit Run
.
df.head()
The good news, the result matches the total count of vehicles in the database!
Let's try another query:
df = pd.read_sql('select city, count(city) from vehicles group by city', engine) print(df)
- Use
matplotlib
to graph a plot
Jupyter is also capable of graphing your results, here's a simple graph chart of the data frame.
df.plot(x ='city', y='count', kind = 'bar')
- Graph a pie chart.
df.plot.pie(y='count',figsize=(5, 5),autopct='%1.1f%%', startangle=90)
This is just a taste of the powerful Jupyter Notebook and CockroachDB Python capabilities. Feel free to visit the docs for more.
Please share your feedback in the comments.
Published at DZone with permission of Artem Ervits. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments