This is the first post on building a NoSQL key-value database known as CrabDB. An important anecdote is that CrabDB is in the early stages of development, so whatever is stated in this blog post can and most likely will change. This post should be treated more as a point-in-time snapshot of CrabDB in its current state.
Also, the code shown in this blog post will be mostly devoid of any real error handling and such things; all code is just used for illustrative purposes. If you want to see the actual code, go to the CrabDB repo.
About CrabDB
A couple of months ago, I decided that I wanted to build a database from scratch so that I could better understand how the internals of these programs I use every day work. I made the following decisions on my database:
- It will be a NoSQL database as at the time I just wasn't in the mood to make a SQL compiler.
- I will use no external dependencies since, one, I do really like programming and, two, I think I will learn a lot more if I code everything by hand. So in general, if it is not offered by the language's standard library, I won't use it.
- The database engine will be programmed in Rust so that it can be blazingly fast 🔥 and since it currently is my favourite programming language.
Now, after I decided all this, I made the decision that I would be making a key/value database since I recently read a blog post on implementing a concurrent hashmap and thought it would be cool to build one myself. I then named the database CrabDB since "crab" sounds like "grab" and crab is nice and rusty.
Without further ado, let's get into the meat of the blog.
CrabDB Internals
CrabDB is split into the following main sections:
- object: Concerns itself with how objects are stored and represented as well as what data types can be stored in the database.
- server: This system is responsible for listening for connections as well as receiving data and sending data back.
- storage: All logic pertaining to how data is stored in the database is handled here.
- engine: Everything comes together here, and this can be thought of as the manager of all the other systems; it contains the
main.rsfile. - There are other miscellaneous crates littered around, but they are not integral to the functionality but will still be described in this post.
Object
An object is an item stored in CrabDB. For now, the following types that I want CrabDB to support are as follows:
- String (known as Text in CrabDB)
- Int
- List
- Map (aka json object, aka dictionary, aka record, aka struct)
This is where the most amount of changes and rewrites occurred in these initial stages of the database, and at the current rate, it will be rewritten again at some point. Since I had to suffer with my own insanity and indecisiveness, I will explain to all the readers every implementation; I will even number them for clarity.
Attempt 1 - Enums
The most natural way to model what I wanted to achieve in Rust with an enum with a variant per data type as follows:
struct Text { /* ... */}
struct Int { /* ... */}
struct List { /* ... */}
struct Map { /* ... */}
enum Object {
Text(Text),
Int(Int),
List(List),
Map(Map),
}
This is exactly what I did, and it worked very well. All data is received as bytes over TCP, so a payload could very easily be structured with a single byte for the type followed by the bytes to build the object, so the code to serialize and deserialize the objects is as follows:
impl Object {
fn serialize(&self) -> Vec<u8> {
// This sucks though that I am just constantly repeating code
match self {
Self::Text(obj) => obj.serialize(),
Self::Int(obj) => obj.serialize(),
Self::List(obj) => obj.serialize(),
Self::Map(obj) => obj.serialize(),
}
}
fn deserialize(bytes: &[u8]) -> Self {
let type_id = bytes[0];
match type_id {
0 => Self::Text(Text::from_bytes(&bytes[1..]))
1 => Self::Int(Int::from_bytes(&bytes[1..]))
2 => Self::List(List::from_bytes(&bytes[1..]))
3 => Self::Map(Map::from_bytes(&bytes[1..]))
_ => // error no type
}
}
}
This is great and it's easy plus it's super simple for future Michael to go and debug. And each variant has its own struct that handles the underlying logic. It is also important to note at this time data would be deserialized and stored in the deserialized format (this will be important later).
Attempt 2 - Boxes, Traits and Dynamic Dispatch
I am a programmer, so I love to struggle and suffer, and why do things the easy way when we can just suffer instead. Essentially what happened was I had the idea: what if the user wants to create a custom type with some format. Now I could just add another variant to the enum to handle this, but instead what if we made an Object trait.
trait Object {
fn serialize(&self) -> Vec<u8>;
}
Have Text, Int, List and Map implement that, then we make use of dynamic dispatch with Box<dyn Object> and then create an ObjectFactory in which you can register
types with function to build them from a byte slice which then returns a Box<dyn Object>.
type TypeId = u8;
// This won't compile its missing higher ranked trait bounds and the object should really also have at least Send implemented
struct ObjectFactory {
factories: HashMap<TypeId, Box<dyn Fn(&[u8]) -> Box<dyn Object>>>
}
impl ObjectFactory {
fn add_factory(&mut self, type_id: TypeId, factory: Box<dyn Fn(&[u8]) -> Box<dyn Object>>) -> Box<dyn Fn(&[u8]) -> Box<dyn Object>> {
// code
}
fn create(&mut self, type_id: TypeId, bytes: &[u8]) -> Box<dyn Fn(&[u8]) -> Box<dyn Object>> {
// code
}
}
Now the ObjectFactory was very cool and I had it as a "singleton", but it was difficult to work with and the whole time I was doubting whether it was even worth it. Besides difficulty in implementing, there were other easier ways to achieve this that would also probably be faster since, as a general rule of thumb, static dispatch with enums is almost always better.
sigh
Attempt 3 - Enums... again
In the end, everything was complicated; the code, while not spaghetti, would give me a headache while looking at it, so I rewrote it back to the original enum implementation, but I also thought while I was at it I will delete all the other code I had, effectively starting from scratch. And you know what, while all what I did was go back to the original implementation, somehow coding it from scratch resulted in less code and, more importantly, cleaner code. A good rule is you will always code something better the second, third, ..., nth time better.
Internal Representation
Originally, all data when received was deserialized into the actual type it was. What this essentially means is that data would be received in raw bytes and I would convert it to what it was defined as. For example, a list would be deserialized to a type like this:
struct List {
items: Vec<Object> // remember the object definition from above
}
This has some benefits. For example, if you want to perform operations on data types, they are already in the correct format. Naturally, though, this does have downsides. Firstly, the above representation is less memory efficient and also deserializing costs time, so the whole database operation is slower and you end up paying this cost twice: once when you deserialize and again when you need to serialize to send it back over TCP. These costs are even worse when you consider that types like Map and Lists can contain maps and lists themselves, so they are recursive types.
Due to the above considerations, there was one more major change to the Object crate, and that was to instead store the data as a flat byte array with a tag specifying what data type is being stored. If any operations are now to be performed, the data has to be first deserialized and then reserialized, which obviously also has a cost, but considering the most common operation on a database is reads, having data be sent to clients faster due to no serializing needing to take place is worth it. The current implementation looks something like this:
enum ObjectKind {
Text,
Int,
List,
Map,
}
struct Object {
object_kind: ObjectKind,
data: Vec<u8>,
}
Server
This part of the database is responsible for listening for client connections, receiving and sending data. The networking is not all that interesting, as all what it requires is designing a protocol to send and receive data and deciding what that protocol is built off of. CrabDB uses TCP as the underlying protocol since it has low overhead (relative to HTTP) and is pretty fast. As all good network protocols work, they start with stating how much data is being sent; this is then followed up by the command that is being executed (GET, SET, etc); this is then followed up by the type id (Text, Int, etc); and finally the actual data for the type if there is any. A full overview of the protocol can be found on the GitHub for CrabDB.
In general, as you can see, the networking part is relatively boring, and as of yet I haven't made any cool optimizations, but the following commands are supported:
- GET: gets an object
- SET: sets an object
- DELETE: deletes an object
- CLOSE: closes the connection
Storage
This is again where things become more interesting. CrabDB is designed to be a hyper-performant database, so that means most, if not all, the data should be kept in memory. To this end, the initial implementation of CrabDB only supported in-memory storage with zero persistence. First a trait was defined:
trait Storage {
fn set(&mut self, key: &str) -> Option<Object>; // returns the previous object under that key if there is one
fn get(&self, key: &str, object: Object) -> Option<Object>;
fn delete(&mut self, key: &str) -> Option<Object>; // returns the deleted object
}
Remember this is not exactly how the trait looks in the actual implementation; it is much more sophisticated there, using a Key type as well as returning a Result for
better error handling; the code shown in this blog post removes all the boring fluff.
An in Memory Only Storage Solution
The most simple way to create a type that implements this trait is with a simple Rust HashMap.
struct InMemoryStorage {
map: HashMap<&str, Object>,
}
impl Storage for HashMap {
// ... you can probably figure out what goes here
}
And while this works and is fast, it's a little bit sad because it's single-threaded and multithreaded code is always better, right? Right? So now you want your code to be multithreaded, so you slap on a mutex or even better a reader-writer lock and call it a day. Unfortunately, at your grand database unveiling, people notice that if they have a task with many mixed reads and writes, your database starts taking a bit too long to, well, do anything. We in the business call that contention. If the problem is not immediately obvious, what is happening is that your database performs fine when you write infrequently and almost always perform reads, but when you write a lot, you need to take a writer's lock which blocks all other access to the data. Now you end up going down a rabbit hole trying to figure out how to implement a truly concurrent hash map since, if you recall, we are not allowed to use third-party packages and everything must be home-grown. sigh. In your travels of the wider internet, you keep coming across people mentioning (which I must assume at this point) Java's legendary concurrent hash map. As an aside, this irks me, since I think Java is one of the worst programming languages ever made.
Here is a rundown of a simple implementation of a concurrent hashmap; if you would like a more fully capable concurrent hashmap, I would recommend checking out dashmap, flashmap or even if you must Java's concurrent hashmap.
struct ConcurrentHashMap<K, V> {
// The trick is to have a list of hashmaps
shards: Vec<RwLock<HashMap<K, V>>>,
}
impl<K, V> ConcurrentHashMap<K, V> {
// Creates a new concurrent hashmap with the specified number of shards
pub fn new(num_shards: usize) -> Self {
let mut shards = Vec::with_capacity(num_shards);
for _ in 0..num_shards {
shards.push(RwLock::new(HashMap::default()));
}
Self { shards }
}
}
impl<K, V> ConcurrentHashMap<K, V>
where
K: Hash + Eq, // required for hashmap
{
// Gets the shard index that a key belongs to
fn shard_index(&self, key: &K) -> usize {
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
hasher.finish() as usize % self.num_shards()
}
// Gets the shard that a key belongs to
fn shard(&self, key: &K) -> &RwLock<HashMap<K, V>> {
// getting the index that the key belongs in
let shard_index = self.shard_index(&key);
// getting shard
let shard = &self.shards[shard_index];
shard
}
// Inserts a new element into the ConcurrentMap
// and returns the element that was already there if one exists
pub fn insert(&self, key: K, value: V) -> Option<V> {
// getting the shard
let shard = self.shard(&key);
// inserting the value
let mut map = shard.write().unwrap();
map.insert(key, value)
}
/// Reads an element from the map and returns None if no element is found
pub fn get<'a>(&'a self, key: &K) -> Option<Ref<'a, K, V>> where V: Clone{
// getting the shard
let shard = self.shard(key);
// getting the element
let map_guard = shard.read().unwrap();
map_guard.get(key).cloned()
}
}
I left out the delete since it is not complicated to implement as well as any other niceties. It is now obvious the trick to having a concurrent hashmap work
is by having a list of non-concurrent hashmaps, each protected by a reader-writer lock, so that multiple hashmaps can be accessed concurrently. The actual
implementation is very simple; first hash the key as you would, then just mod it by the number of shards you have to decide which shard this key belongs to. The ideal number
of shards, though, is a heuristic that must be tested to determine the best number, or an even cooler way to do it would be to have it dynamically grow and shrink.
This also makes use of interior mutability, so only a &self needs to be taken and the Storage
trait is also updated to only use &self's so that it can safely be shared across threads and the underlying type must make sure of the safety guarantees.
Now the above implementation does have some issues that a more seasoned concurrent hashmap wouldn't have. For example:
- The get clones data; surely returning a reference would be better, and it is. The actual implementation of CrabDB's concurrent hashmap does this.
- The default Rust hasher is used, but there are faster ones out there.
- If one hashmap keeps getting hit, we essentially have our previous problem, so the hashing algorithm used should try to split operations evenly between shards.
Overall, this is a good enough first attempt at a concurrent hashmap to get CrabDB going; next we are going to go into persistence since losing all data on restart does kind of suck.
Persistent Storage
Being able to concurrently read and write data to memory is very cool, but unfortunately many users would ideally want their data persisted between database restarts or, god forbid, a crash. Typically there are two main mechanisms used for persisting data in an in-memory database like memcached or redis.
- Append-only file (AOF), which basically means any write operation (currently only set and delete) for us is written to a log file. On startup, this log file can then be read and the in-memory part of the database can be recreated by sequentially applying the operations.
- Taking periodic snapshots, which is just what it sounds like, where you on some interval take a snapshot of the database.
These two solutions both have their pros and cons. AOF is very fast when writing since it is just appending to a log file, but when you need to read the data back in, it can take a variable amount of time, as well as pointless operations can occur. For example:
set key1 0
del key1
set key1 0
The same exact set is used twice and since a set occurs after the delete, the delete is pointless. This can be negated by periodically compacting the log files, but this then pauses all writes to the database for that time. Taking snapshots is much faster to read, but while a snapshot is being created, no writes to the database can occur, and on top of that, any data after a snapshot is not saved until the next snapshot.
Most databases implement both of these where they will first load from a snapshot and only read the log for all data after the backup was taken, offering the best of both worlds.
For the initial implementation, AOF was implemented, in which there are two options to implement, which is what will be discussed below:
Multiple Files with Sharding
This is the current implementation that CrabDB uses. It works in almost the same fashion as the concurrent hashmap where there are multiple files (shards) and each shard in the hashmap maps to a file in a 1:1 manner. Since data is only written to the in-memory storage after successfully being written to the file, this reduces contention. The biggest downside of this manner of doing it is that the number of shards cannot be changed since each file maps to a shard; weird things can happen if you are not careful. This does have the benefit that all files can be reingested in parallel.
Single File with Channel
This is most likely what CrabDB will switch to in the near future. Instead of having multiple files where a shard maps to a file, each write operation sends the operation and data done through a channel, and a separately running thread receives the sent data and writes it to a file. This allows for then changing the number of shards since there is only a single source of truth. This does naturally have some downsides. For example, what is the upper limit of writing to a file (probably more than you think), and you can no longer read data in concurrently.
Engine
This crate is fairly boring and just acts as the glue for all the other crates; it creates the server using the server crate and then passes data to the storage crate through the medium of the object crate. Yup, that's it, nothing special or really interesting.
Other Crates
There are other crates involved, but they are not considered as core as the crates already mentioned.
Threadpool
This is just a threadpool that I implemented which prespawns a set number of threads and allows for queueing up jobs. Every time a client connects to the database, a new job is submitted to the threadpool for it.
Concurrent Map
This is just an implementation of a concurrent map as described above.
Conclusion
This is just the first part of a series, and much will still change. This mainly made as a learning project for now, but who knows what will happen. Check out the code on github.