How to Handle One Million Transactions per Second on a Single CPU Core
Get an in-depth understanding of the complex concept of asynchronous processing with in-memory databases with a simple example involving planes, trains, and automobiles!
Join the DZone community and get the full member experience.
Join For FreeIn my couple of my last articles, I was talking about persistence with in-memory databases. Check this out here and here.
In this article, I would like to touch upon the performance problems of in-memory databases. For starters, let's just talk about performance in the simplest case: when you change the value of a specified key. And let's simplify this case even further: assume there's no database server at all — I mean no client-server interaction over the network. So, the database resides totally inside your application's RAM space.
If you didn't have a database server, then you would probably store key-value pairs inside your application's memory in a hash table. In C/C++, it would be a data structure like this:
std::unordered_map
In order to check how fast this data structure is, I created a file named 1.cpp
with the following content:
#include <map>
#include <unordered_map>
#include <iostream>
const int SIZE = 1000000;
int main()
{
std::unordered_map<int,int> m;
m.reserve(1000000);
long long c = 0;
for (int i = 0; i < SIZE;++i)
c += m[i*i] += i;
std::cout << c << std::endl;
}
Then I compiled and ran it:
g++ -std=c++11 -O3 1.cpp -o 1
MacBook-Air-anikin:Downloads anikin$ time ./1
And got this result:
real 0.465s
user 0.422s
sys 0.032s
What observations can we get from that? What I know is:
- I love C/C++!
- I love my good old MacBook Air (actually, I don't, because it's getting slow over time, but let me leave that for another article).
- I love using the
-O3
option. Some folks are afraid of it, but please don't be. If you are, performance could be as poor as this:
The result without O3 is twice as worse:MacBook-Air-anikin:Downloads anikin$ g++ -std=c++11 1.cpp -o 1 MacBook-Air-anikin:Downloads anikin$ time ./1
real 0.883s user 0.835s sys 0.033s
- The application spent most of its time in the user mode. A little bit of time spent for the system mode here was a fixed cost needed, I believe, to initially allocate pages for the hash table (and there is no
O3
helping to reduce this time) to perform and load the executable file. - This application inserts roughly one million keys into the hash table. The word roughly here means that it can be less than one million because of repeating keys caused by an overflow at
i*i
. So, insertions can become updates. But the number of operations with the hash table is still one million. - This application inserts one million keys and runs around 0.5 seconds, so it makes roughly two million set-a-value-by-a-key operations per second. This observation is very interesting to me. You can consider that you already have an in-memory key-value storage engine named
std::unordered_map
that's able to perform two million key-value operations per second on a single CPU core on a good old MacBook Air:MacBook-Air-anikin:Downloads anikin$ uname -a
Notice that I used integer keys and values. I could use strings, but I didn't just because I didn't want memory allocation and copying to interfere with the test. Also, if you use std::unordered_map>
, then you're likely to have collisions that will reduce performance.
What we see now is that an in-memory hash table is able to perform two million operations per second on a single CPU core. But remember, I was going to talk about in-memory databases. How is an in-memory database different from an in-memory hash table? Well, a database is a server application, whilst a hash table is a library. So, a database is a hash table plus something more! And this something more includes at least a server application.
Let's build a database server application around std::unordered_map<int, int>
. A naive approach could look like this:
Accept connections in the main thread.
Create a new thread for each accepted connection (or have a thread pool with threads created in advance).
Protect
std::unordered_map
with some synchronization primitive, i.e.mutex
.And don't forget about persistence — log each update to a transaction log.
I don't want to bore you with writing code for this server, so assume we have already done that. Indeed, look at any database server based on a one-thread-per-connection architecture (MySQL, MariaDB, Postgres, etc.) — it can perform tens of thousands of requests per second on a single CPU core in the best case scenario. The best performance for traditional databases that I could find on the Internet was around one million queries per second. It was MariaDB running on a 20-core machine. See details here. So, it is 50K requests per second. One of the best databases on the market, tuned by probably the best database people on earth can only provide 50K requests per second on a single CPU core.
50K vs. 2 million: The difference is 40 times as compared with std::unordered_map
. How do you like that?! You just surrounded a data structure with a server in order to grant other applications remote access to this data structure and got a 40x performance penalty! This is so sad that I think that we should forget about multitier architecture and write all the business logic and database logic in a single application inside a single process. Or... we can try to optimize the database server.
Let's look closer at what's happening when a database server with the above-mentioned architecture processes a transaction in terms of system calls.
Read a request from the network.
Lock the hash.
Unlock the hash.
Write to the transaction log.
Write to the network.
These are at least five syscalls to a database per request. Each syscall requires entering the kernel mode and exiting the kernel mode.
On entering/exiting the kernel mode, there is a context switching. Context switching implies tons of extra work — you need to back up and to restore all the registers and other sensitive information. See details here.
To give you an idea of how bad syscalls are, I wrote another program (in C):
#include <stdio.h>#include <fcntl.h>#include <unistd.h>
What it does is to just call read byte-by-byte from the /dev/zero
file one million times. The result of this test is:
MacBook-Air-anikin:Downloads anikin$ time ./2
First of all, the program spends nearly all the time in the kernel mode. Second of all, it makes roughly 1.5 million syscalls per second. Remember that the hash lookup rate was about two million times per second. Isn't it interesting that a single read syscall is 30% slower than a hash table lookup? And that was a very simple syscall. It didn't access disk or network. It just returned zeros.
As you can see above, within a database server, we need at least five syscalls to handle a hash table lookup. So, we need at least 5*1.3=6.5x time just for syscalls! If you think about syscalls as taxes, then this is like an 85% tax. Would you be happy if you paid an 85% tax on your salary? I mean, what if you got only $15 out of $100 that you made? Now, let's think about the real read, write, and other syscalls. They do a lot of work — read from a network buffer, allocate slabs in the Linux kernel, look up and change internal kernel structures, etc. A 95%+ tax does not look that fantastic. Indeed, you can strace
MySQL or any other traditional database and check out how many syscalls it issues during request processing.
OK, so syscalls are evil. Can we abandon syscalls and move all the database logic to the kernel? That's a good idea. But there's probably something more feasible. Let's look at another example:
#include <stdio.h>#include <fcntl.h>#include <unistd.h>
And the result is:
MacBook-Air-anikin:Downloads anikin$ time ./2
This program does exactly the same thing as the previous one; it copies one million bytes from /dev/zero
. And its running time is 7 ms compared with 639 ms in the previous example. That is almost 100x faster! What is the trick? The trick here is that we did fewer syscalls and more work per syscall. It turns out that syscalls are not bad if they do a lot of work. You pay only fixed cost per syscall, and then you can use them almost for free. It's like in the Disneyland — pay once for admission and use it all day long, or less than all day long, still paying the same admission, but paying more per attraction.
So, in order to speed up our database server, we just need to issue fewer syscalls and do more work within each syscall. How do we do that? Let's consider group requests and processing:
Read 1,000 requests from the network (via a single read syscall).
Lock the hash.
Process 1,000 requests.
Unlock the hash.
Write 1,000 transactions to the transaction log (via a single write/writev syscall).
Write 1,000 responses to the network (via a single write syscall).
Great! But wait a minute. A database is not a batch processor. It is an online transaction processor. As soon as it gets a request, it should process it immediately with the best latency it can. It can't wait for a whole batch of 1,000 requests to come in because it can wait forever.
How can we solve this problem?
Let's look at public transport. They have already solved this problem in the last hundred years. A bus ticket is cheaper than a taxi cab because buses have higher throughput and less cost per passenger. But the latency (waiting time) of a bus (or a train, depending on the public transport strategy of a specific city) is roughly the same as that of a taxi cab in a busy downtown (at least, I saw that trend in New York City, Moscow, Paris, Berlin, and Tokyo).
What's the trick?
The trick is that a bus never waits for a hundred passengers to arrive at a bus stop. It always has enough passengers at a bus stop to pick them up because the downtown is busy (the workload is high). This way, a bus pulls in, picks everybody up (until the bus is full or until there is nobody at the bus stop), and pulls out immediately. Then, the next bus pulls in and picks up enough people again because they've already arrived at the bus stop since the previous bus was gone.
To do the same thing to a database, we need to treat a network subsystem, a transaction processor, and a disk subsystem as independent buses (or trains if you wish). Each of those buses is working asynchronously with respect to the others. And each of those buses boards as many people as it has at the bus stop. If there are not enough people at the bus stop, then, yes, we use CPU inefficiently because buses with high fixed costs are almost empty. But on the other hand, who cares? We perfectly serve the workload that we get. This way, we can have 99% CPU usage on 10K requests per second and the same 99% CPU usage on 1000K requests per second with the same good latency because the number of syscalls remains the same. And that number matters more than the number of bytes transferred within a syscall. The CPU is always busy, but it magically stretches up to higher workloads, staying as busy as it was with smaller workloads. Remember that the latency of a syscall doing 100x work is almost the same as that of a syscall doing 1x work because of an extremely high fixed cost per a syscall.
What's the best way to implement the whole thing? Just do everything asynchronously:
In Tarantool, we maintain three threads: the network thread, the transaction processing (we call it TX) thread, and the disk thread.
The network thread reads requests from the network (whatever amount of requests it can read from a network buffer without blocking on I/O, whether it's 1 or 1,000 or more). And then it puts all the requests to the TX. Also, it gets responses from the TX and writes them to the network. And again, it does so within a single network package, no matter if it contains a single response or thousands of responses.
The TX thread just processes transactions in-memory, group-by-group, taking them from the network thread. After it's done processing a group in-memory, TX passes the group to the disk thread. And again, it passes transactions to the disk thread group-by-group, handling as much data as it has — just like when people step out of the train within a big group and they all go to the bus. The bus takes everyone who reaches the bus until it has no people at the bus stop. Folks who are running late are going to take another bus. A bus doesn't wait even a millisecond; if there is nobody after the last person, then it hits the road. As long as the disk thread is done with the group, it returns this group to the TX thread to commit transactions and to return all the requests within this group to the network thread.
That's how we significantly reduced the number of syscalls in Tarantool. And this is how it works and handles one million transactions on a single CPU core.
You can see the entire workflow at this picture:
The main thing here is that each thread is working in parallel and does not interfere with other threads making their job. The more parallel and heavy the workload is, the fewer syscalls per request and the more requests per second the system can handle.
At the same time, the latency is good because threads don't wait for other threads — they just handle as much work as they have right now, and while they're handling it, a new piece of work is being prepared in parallel.
There will be new articles about transaction processing in an in-memory database. Stay tuned!
Published at DZone with permission of Dennis Anikin, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments