The Power of ShardingSphere With Spring Boot
Learn how to use Apache ShardingSphere with Spring Boot to enable sharding, read-write splitting, and data masking.
Join the DZone community and get the full member experience.
Join For FreeI have developed a small-scale application that concentrates on a straightforward business scenario of customer reviews for the purpose of illustrating various use cases. This application is implemented using the Spring Boot framework and communicates with a MySQL database via Spring Data JPA, with the code being written in Kotlin. It exposes a simple REST API featuring CRUD operations on reviews.
Spoiler alert: The use cases illustrated in this article are intentionally simplistic, intended solely to showcase the integration with ShardingSphere functionalities; the discussed problems here can be solved in various ways and maybe even in better ways, so don't spend too much on thinking "why." So, without further ado, let's dive into code.
Here is the main entity:
@Entity
@Table(name = "reviews")
data class Review(
var text: String,
var author: String,
@Column(name = "author_telephone")
var authorTelephone: String? = null,
@Column(name = "author_email")
var authorEmail: String? = null,
@Column(name = "invoice_code")
var invoiceCode: String? = null,
@Column(name = "course_id")
var courseId: Int? = null
) : AbstractEntity()
And we have the following REST API:
@RestController
@RequestMapping("/api/v1/reviews")
class ReviewController(val reviewService: ReviewService) {
@GetMapping("/filter", params = ["author"])
fun getReviewsByAuthor(@RequestParam("author") author: String): ResponseEntity<List<Review>> {
val reviews = reviewService.findAllByAuthor(author)
return ResponseEntity.ok(reviews)
}
@GetMapping("/filter", params = ["courseId"])
fun getReviewsByCourseId(@RequestParam("courseId") courseId: Int): ResponseEntity<List<Review>> {
val reviews = reviewService.findAllByCourseId(courseId)
return ResponseEntity.ok(reviews)
}
@GetMapping("/{id}")
fun getReviewById(@PathVariable("id") id: Int): ResponseEntity<Review> {
val review = reviewService.findById(id)
return ResponseEntity.ok(review)
}
@PostMapping
fun createReview(@RequestBody review: Review): ResponseEntity<Review> {
val savedReview = reviewService.save(review)
return ResponseEntity.status(HttpStatus.CREATED).body(savedReview)
}
@DeleteMapping("/{id}")
fun deleteReviewById(@PathVariable("id") id: Int): ResponseEntity<Unit> {
reviewService.deleteById(id)
return ResponseEntity.noContent().build()
}
}
Here is the MySQL container from docker-compose.yml:
mysql-master:
image: 'bitnami/mysql:latest'
ports:
- '3306:3306'
volumes:
- 'mysql_master_data:/bitnami/mysql/data'
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
environment:
- MYSQL_USER=my_user
- MYSQL_PASSWORD=my_password
- MYSQL_DATABASE=reviews-db
Note: The init.sql
contains DDL for the reviews table.
Now if we are to execute some HTTP requests like creating a review (all the mentioned requests are in requests.http
).
POST http://localhost:8070/api/v1/reviews/
Content-Type: application/json
{
"text": "This is a great course!",
"author": "John Doe",
"authorTelephone": "555-1234",
"authorEmail": "johndoe@example.com",
"invoiceCode": "ABC123",
"courseId": 123
}
We'll observe the following query (p6spy enabled):
INFO 16784 --- [nio-8070-exec-1] p6spy : #1681730984533 | took 4ms | statement | connection 2| url jdbc:mysql://localhost:3306/reviews-db?allowPublicKeyRetrieval=true&useSSL=false
insert into reviews (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values ('2023-04-17T14:29:44.450+0300', '2023-04-17T14:29:44.450+0300', 'John Doe', 'johndoe@example.com', '555-1234', 123, 'ABC123', 'This is a great course!', 2);
Data Sharding
The reviews application discussed above involves a database storing a large number of reviews for courses. As the number of courses and reviews grows, the reviews table in the database can become very large, making it difficult to manage and slowing down performance. To address this issue, we can implement data sharding, which involves breaking up the reviews
table into smaller, more manageable pieces called shards. Each shard contains a subset of the data in the reviews
table, with each shard being responsible for a specific range of data based on a shard key, such as the course id.
Table sharding can help the reviews application manage its growing reviews table more effectively, improving performance and scalability while also making backups and maintenance tasks easier to manage.
But, there is always a but – implementing table sharding manually can be a complex and challenging task. It requires a deep understanding of database design and architecture, as well as knowledge of the specific sharding implementation being used. There are a lot of challenges that can arise when implementing table sharding in a Spring Boot application.
It is time we met ShardingSphere:
Apache ShardingSphere is an ecosystem to transform any database into a distributed database system, and enhance it with sharding, elastic scaling, encryption features and more.
Apache ShardingSphere comes in two flavors:
- ShardingSphere-JDBC is a lightweight Java framework that provides additional services at Java's JDBC layer.
- ShardingSphere-Proxy is a transparent database proxy, providing a database server that encapsulates database binary protocol to support heterogeneous languages.
ShardingSphere provides a range of features, including data sharding, distributed transaction, read/write splitting, high availability, data migration, query federation, data encryption, and shadow database for full-link online load testing scenarios, to help manage large volumes of data and ensure data integrity and security.
For now, we'll focus on ShardingSphere-JDBC's data sharding, and we'll use the following dependencies:
implementation("org.apache.shardingsphere:shardingsphere-jdbc-core:5.3.2")
implementation("org.apache.shardingsphere:shardingsphere-cluster-mode-core:5.3.2")
implementation("org.apache.shardingsphere:shardingsphere-cluster-mode-repository-zookeeper:5.3.2")
implementation("org.apache.shardingsphere:shardingsphere-cluster-mode-repository-api:5.3.2")
Note: The ShardingSphere team had starters for spring boot in 5.1.x versions, but they moved away from starters in favor of consistency in their project and now are recommending using the latest version (non-starter), which can be configured a bit differently, but still fairly simple. In my repo, through the commits, you can find examples of spring boot starter configuration too.
ShardingSphere-JDBC can be configured mainly in two ways: YAML configuration and Java configuration. I picked the YAML configuration for this article.
So at the moment, my datasource
configuration from application.yaml
looks like this:
spring:
datasource:
username: my_user
password: my_password
url: jdbc:mysql://localhost:3306/reviews-db?allowPublicKeyRetrieval=true&useSSL=false
tomcat:
validation-query: "SELECT 1"
test-while-idle: true
jpa:
properties:
hibernate:
dialect: org.hibernate.dialect.MySQL8Dialect
open-in-view: false
hibernate:
ddl-auto: none
To enable ShardingSphere JDBC, we'll have to make it look like this:
spring:
datasource:
driver-class-name: org.apache.shardingsphere.driver.ShardingSphereDriver
url: jdbc:shardingsphere:classpath:sharding.yaml
jpa:
properties:
hibernate:
dialect: org.hibernate.dialect.MySQL8Dialect
open-in-view: false
hibernate:
ddl-auto: none
We specified that the driver being used for the data source will be the ShardingSphereDriver
and the url
should be picked based on this file sharding.yaml
Okay, pretty simple, right? Let's continue; let's create the sharding.yaml
dataSources:
master:
dataSourceClassName: com.zaxxer.hikari.HikariDataSource
driverClassName: com.mysql.jdbc.Driver
jdbcUrl: jdbc:mysql://localhost:3306/reviews-db?allowPublicKeyRetrieval=true&useSSL=false
username: my_user
password: my_password
connectionTimeoutMilliseconds: 30000
idleTimeoutMilliseconds: 60000
maxLifetimeMilliseconds: 1800000
maxPoolSize: 65
minPoolSize: 1
mode:
type: Standalone
repository:
type: JDBC
rules:
- !SHARDING
tables:
reviews:
actualDataNodes: master.reviews_$->{0..1}
tableStrategy:
standard:
shardingColumn: course_id
shardingAlgorithmName: inline
shardingAlgorithms:
inline:
type: INLINE
props:
algorithm-expression: reviews_$->{course_id % 2}
allow-range-query-with-inline-sharding: true
props:
proxy-hint-enabled: true
sql-show: true
Now let's analyze the most important properties:
dataSources.master
– here lies the definition of our master data source.mode
– which can be either standalone withJDBC
type or cluster withZookeeper
type (recommended for production), which is used for configuration information persistencerules
– here, we can enable various ShardingSphere features like -!SHARDING
tables.reviews
– here, we describe the actual tables based on theinline
syntax rules, meaning that we'll have two tablesreviews_0
andreviews_1
sharded by thecourse_id
column.shardingAlgorithms
– here, we describe the manual inline sharding algorithm via a groovy expression telling that the reviews table is divided into two tables based on thecourse_id
column.
props
– here, we enabled intercepting/formatting sql queries (p6spy can be disabled/commented).
Important: Before starting our application, we need to make sure that our defined shards are created, so I created two tables in my database: reviews_0
and reviews_1
(init.sql
).
Now we are ready to start our application and do some requests:
### POST a new review
POST http://localhost:8070/api/v1/reviews/
Content-Type: application/json
{
"text": "This is a great course!",
"author": "John Doe",
"authorTelephone": "555-1234",
"authorEmail": "johndoe@example.com",
"invoiceCode": "ABC123",
"courseId": 123
}
We can see the following log:
INFO 35412 --- [nio-8070-exec-2] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_1 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 15:42:01.8069745, 2023-04-17 15:42:01.8069745, John Doe, johndoe@example.com, 555-1234, 123, ABC123, This is a great course!, 4]
If we are to execute one more request with a different payload:
INFO 35412 --- [nio-8070-exec-8] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_1 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 15:43:47.3267788, 2023-04-17 15:43:47.3267788, Mike Scott, mikescott@example.com, 555-1234, 123, ABC123, This is an amazing course!, 5]
We can notice that both our reviews placed by Mike and John went into reviews_1
table, what if we are to change the course_id
to 124
and execute the same POST request again?
INFO 35412 --- [nio-8070-exec-4] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_0 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 15:44:42.7133688, 2023-04-17 15:44:42.7133688, Mike Scott, mikescott@example.com, 555-1234, 124, ABC123, This is an amazing course!, 6]
We can see that our new review got saved in reviews_0
table.
Now we can execute two GET requests based on the course_id
### GET reviews by course ID
GET http://localhost:8070/api/v1/reviews/filter?courseId=123
GET http://localhost:8070/api/v1/reviews/filter?courseId=124
And observe in the logs how routing between our two tables took place.
INFO 35412 --- [nio-8070-exec-9] ShardingSphere-SQL: Actual SQL: master ::: select review0_.id as id1_0_, review0_.created_at as created_2_0_, review0_.last_modified_at as last_mod3_0_, review0_.author as author4_0_, review0_.author_email as author_e5_0_, review0_.author_telephone as author_t6_0_, review0_.course_id as course_i7_0_, review0_.invoice_code as invoice_8_0_, review0_.text as text9_0_ from reviews_1 review0_ where review0_.course_id=? ::: [123]
INFO 35412 --- [nio-8070-exec-5] ShardingSphere-SQL: Actual SQL: master ::: select review0_.id as id1_0_, review0_.created_at as created_2_0_, review0_.last_modified_at as last_mod3_0_, review0_.author as author4_0_, review0_.author_email as author_e5_0_, review0_.author_telephone as author_t6_0_, review0_.course_id as course_i7_0_, review0_.invoice_code as invoice_8_0_, review0_.text as text9_0_ from reviews_0 review0_ where review0_.course_id=? ::: [124]
The first select
was directed to reviews_1
table and the second one to reviews_0
- Sharding in action!
Read-Write Splitting
Now let's imagine another problem, the reviews application time may experience high stress during peak hours, leading to slow response times and decreased user experience. To address this issue, we can implement read/write splitting to balance the load and improve performance.
And how lucky we are that ShardingSphere offers us a read/write splitting solution. Read-write splitting involves directing read queries to replica databases and write queries to a master database, ensuring that read requests do not interfere with write requests and that database performance is optimized.
Before configuring the read-write splitting solution, we'll have to make some changes to our docker-compose
in order to have some replicas for our master db (credits to bitnami for providing this):
mysql-master:
image: 'bitnami/mysql:latest'
ports:
- '3306:3306'
volumes:
- 'mysql_master_data:/bitnami/mysql/data'
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
environment:
- MYSQL_REPLICATION_MODE=master
- MYSQL_REPLICATION_USER=repl_user
- MYSQL_REPLICATION_PASSWORD=repl_password
- MYSQL_ROOT_PASSWORD=master_root_password
- MYSQL_USER=my_user
- MYSQL_PASSWORD=my_password
- MYSQL_DATABASE=reviews-db
mysql-slave:
image: 'bitnami/mysql:latest'
ports:
- '3306'
depends_on:
- mysql-master
environment:
- MYSQL_USER=my_user
- MYSQL_PASSWORD=my_password
- MYSQL_REPLICATION_MODE=slave
- MYSQL_REPLICATION_USER=repl_user
- MYSQL_REPLICATION_PASSWORD=repl_password
- MYSQL_MASTER_HOST=mysql-master
- MYSQL_MASTER_PORT_NUMBER=3306
- MYSQL_MASTER_ROOT_PASSWORD=master_root_password
Let's start our containers like this (one master and two replicas):
docker-compose up --detach --scale mysql-master=1 --scale mysql-slave=2
Now we need the mapped ports for our slaves.
$ docker port infra-mysql-slave-1
3306/tcp -> 0.0.0.0:49923
$ docker port infra-mysql-slave-2
3306/tcp -> 0.0.0.0:49922
Okay, now that we have our master and its replicas in place, we are ready to configure two new dataSources
like this:
dataSources:
master:
dataSourceClassName: com.zaxxer.hikari.HikariDataSource
driverClassName: com.mysql.jdbc.Driver
jdbcUrl: jdbc:mysql://localhost:3306/reviews-db?allowPublicKeyRetrieval=true&useSSL=false
username: my_user
password: my_password
connectionTimeoutMilliseconds: 30000
idleTimeoutMilliseconds: 60000
maxLifetimeMilliseconds: 1800000
maxPoolSize: 65
minPoolSize: 1
slave0:
dataSourceClassName: com.zaxxer.hikari.HikariDataSource
driverClassName: com.mysql.jdbc.Driver
jdbcUrl: jdbc:mysql://localhost:49922/reviews-db?allowPublicKeyRetrieval=true&useSSL=false
username: my_user
password: my_password
connectionTimeoutMilliseconds: 30000
idleTimeoutMilliseconds: 60000
maxLifetimeMilliseconds: 1800000
maxPoolSize: 65
minPoolSize: 1
slave1:
dataSourceClassName: com.zaxxer.hikari.HikariDataSource
driverClassName: com.mysql.jdbc.Driver
jdbcUrl: jdbc:mysql://localhost:49923/reviews-db?allowPublicKeyRetrieval=true&useSSL=false
username: my_user
password: my_password
connectionTimeoutMilliseconds: 30000
idleTimeoutMilliseconds: 60000
maxLifetimeMilliseconds: 1800000
maxPoolSize: 65
minPoolSize: 1
And then, we can add the read-write splitting rule to the rules:
- !READWRITE_SPLITTING
dataSources:
readwrite_ds:
staticStrategy:
writeDataSourceName: master
readDataSourceNames:
- slave0
- slave1
loadBalancerName: readwrite-load-balancer
loadBalancers:
readwrite-load-balancer:
type: ROUND_ROBIN
Here I think everything is self-explanatory: we have specified the written data source name to be the master
and the read data sources to point to our slaves: slave0
and slave1
; and we picked a round-robin
load balancer algorithm.
Important: One last change to be made is regarding the sharding rule, which knows nothing about the newly configured read-write splitting rule and points directly to the master:
- !SHARDING
tables:
reviews:
actualDataNodes: readwrite_ds.reviews_$->{0..1}
Now our sharding we'll be wrapped by the read-write splitting rule, too, and the data source decision will be made before picking the correct table (pay attention to readwrite_ds.reviews_$->{0..1}
).
Okay, we can start our application, run the same POST request and observe the logs:
INFO 22860 --- [nio-8070-exec-1] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_0 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 16:12:07.25473, 2023-04-17 16:12:07.25473, Mike Scott, mikescott@example.com, 555-1234, 124, ABC123, This is an amazing course!, 7]
Nothing surprising here, the sharding still works, and the query took place in the master
data source (write data source). But if we are to run a couple of GET requests, we'll observe the following:
INFO 22860 --- [nio-8070-exec-2] ShardingSphere-SQL: Actual SQL: slave0 ::: select review0_.id as id1_0_, review0_.created_at as created_2_0_, review0_.last_modified_at as last_mod3_0_, review0_.author as author4_0_, review0_.author_email as author_e5_0_, review0_.author_telephone as author_t6_0_, review0_.course_id as course_i7_0_, review0_.invoice_code as invoice_8_0_, review0_.text as text9_0_ from reviews_0 review0_ where review0_.course_id=? ::: [124]
INFO 22860 --- [nio-8070-exec-4] ShardingSphere-SQL: Actual SQL: slave1 ::: select review0_.id as id1_0_, review0_.created_at as created_2_0_, review0_.last_modified_at as last_mod3_0_, review0_.author as author4_0_, review0_.author_email as author_e5_0_, review0_.author_telephone as author_t6_0_, review0_.course_id as course_i7_0_, review0_.invoice_code as invoice_8_0_, review0_.text as text9_0_ from reviews_0 review0_ where review0_.course_id=? ::: [124]
You can observe read-write splitting in action; our write queries take place in master
data sources, but our read queries take place in master's replicas (slave0
and slave1
) and this while maintaining the correct sharding rule.
Data Masking
Off to another imaginary problem regarding our application. Imagine that sensitive information such as customer email, phone number, and invoice code may need to be accessed by certain users or applications while remaining hidden from others due to data privacy regulations.
To address this issue, we could implement a data masking solution to mask sensitive data when mapping the result or at the SQL level. But as you guessed, why bother? ShardingSphere is here to save the day with another easy-to-enable feature – data masking.
So let's update our configuration with this new rule:
- !MASK
tables:
reviews:
columns:
invoice_code:
maskAlgorithm: md5_mask
author_email:
maskAlgorithm: mask_before_special_chars_mask
author_telephone:
maskAlgorithm: keep_first_n_last_m_mask
maskAlgorithms:
md5_mask:
type: MD5
mask_before_special_chars_mask:
type: MASK_BEFORE_SPECIAL_CHARS
props:
special-chars: '@'
replace-char: '*'
keep_first_n_last_m_mask:
type: KEEP_FIRST_N_LAST_M
props:
first-n: 3
last-m: 2
replace-char: '*'
Let's see what we have here:
table.reviews
– we defined three masking algorithms for each column mentioned beforemaskAlgorithms.md5_mask
– we specified theMD5
algorithm type for invoice_codemaskAlgorithms.mask_before_special_chars_mask
– we configured theMASK_BEFORE_SPECIAL_CHARS
algorithm for theauthor_email
column, meaning that all the characters before @ symbol will be replaced with the * symbol.maskAlgorithms.keep_first_n_last_m_mask
– we configured theKEEP_FIRST_N_LAST_M
algorithm forauthor_telephone
column, meaning that only the first 3 and last 2 characters of a telephone number will stay unchanged; everything in between will be masked by the * symbol.
Note: You can find a lot of other masking algorithms here.
All right, let's start our application and do the same POST request.
INFO 35296 --- [nio-8070-exec-1] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_0 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 16:26:51.8188306, 2023-04-17 16:26:51.8188306, Mike Scott, mikescott@example.com, 555-1234, 124, ABC123, This is an amazing course!, 9]
We'll see nothing new here; the exact values that we provided in the body are the ones that got in the database. You can check the master/slave db, too, for that.
But the magic comes in if we want to execute our GET request, which gives us the following body.
[
{
"text": "This is an amazing course!",
"author": "Mike Scott",
"authorTelephone": "555***34",
"authorEmail": "*********@example.com",
"invoiceCode": "bbf2dead374654cbb32a917afd236656",
"courseId": 124,
"id": 9,
"lastModifiedAt": "2023-04-17T15:44:43"
},
]
As you can see, the data stays unchanged in the database, but when queried and delivered, the telephone, email, and invoice codes got masked according to our defined algorithm in the data masking rule.
Conclusion
That's it for today, folks! I hope this article gave you a good understanding of how ShardingSphere can be used for database sharding, read-write splitting, and data masking with Spring Boot. We covered the basics of ShardingSphere, how to configure it for sharding, how to use read-write splitting, and how to apply data masking to sensitive data. Make sure to check out all the features that ShardingSphere provides and its amazing documentation here.
All the code mentioned in this article can be found here
Happy coding!
Opinions expressed by DZone contributors are their own.
Comments