Storing data on a public blockchain is very different from storing data on a traditional database. One of the main differences is access to the data. Whereas when storing data on a database there is the assumption that that data is private and will remain private, storing data on a public blockchain should have the assumption that that data is public and will remain public. While on the first case private data may become public, something that happens every time there is a leak of information, on the second case public data will never become private again. When it gets public, it stays public. The genie is out of the bottle. Data privacy concerns might not seem that important when storing data on a private database. At least until the data is leaked… and the damage is done.
Surprisingly, few blockchains worry about privacy at all. All data stored there is public for everyone to read. Monero, in opposition, has privacy built in, so no one can figure out who owns what and to whom they transferred to. This is an invaluable feature of Monero that hasn’t received enough credit by the crypto community. But Monero focuses on financial transactions only. Smart contract based blockchains pose a worse kind of a privacy problem since any arbitrary data may be stored there. Nonetheless, they don’t provide any way to encrypt it. This lack of privacy should be a deal breaker for most businesses considering adopting smart contract technologies.
One might argue that anyone wanting to store data on a blockchain should implement their own encryption mechanism for their use case. But encryption is hard and poses many challenges. One of them is related to data search and retrieval. Fully encrypted data is not searchable. This means that the only way to retrieve data from an encrypted dataset is by unencrypting everything and then finding what is being searched for. This is very inefficient and should not even be an option in many cases, because the data might be stored in a remote server managed by an untrustworthy third party like a cloud service provider or, ahem… a blockchain node… , but alas there are some strategies to circumvent this problem with different types of compromises.
Circling back to BlockBase. BlockBase was designed from the ground up to deal with data privacy. All data stored on BlockBase is encrypted by default. This is the opposite from what is done on other data storage platforms, especially on blockchains. BlockBase can store data in an unencrypted format too, but it needs to be explicitly told to do so. In fact, users can decide exactly what they want to encrypt and what to maintain unencrypted. For that purpose, we’ve built our own data querying language — we call it BBSQL — that is very similar to standard ANSI SQL but has additional syntax built in for dealing with data encryption, giving programmers a very powerful way to express exactly what data they want to encrypt.
All encrypted data stored on a BlockBase sidechain can be queried without revealing its contents to the providers. There are very few solutions out there that even try to tackle this problem in the first place. We based our solution on a concept called Order Preserving Encryption, but we took it a step further. OPE is a type of encryption where encrypted data maintains the same alphanumeric ordering as the unencrypted counterpart. But this method has been proven to leak information through very sophisticated statistical analysis attacks.
We’ve built a new encryption method that allows developers to finetune how much order they want to reveal per column of every table on their databases. We call this The Bucketing Strategy. What it does is add an extra column for each column with encrypted data, that will hold a bucket id for to each encrypted row value. A query built to fetch information from the encrypted column, will actually be transformed to use bucket ids, and not the encrypted values themselves. And this is done transparently for the user that issued the query. The requester node will translate the user query and substitute all encrypted elements to instead search with their corresponding bucket ids, and when the results are received all data is properly decrypted on the requester side and presented to the user. With this model, the data storage provider never gets to know the contents of the data.
For example, if all rows of an encrypted column had only one and the same bucket id associated to them, it would mean that any query issued by the user for a row value on that encrypted column would actually be translated to that one bucket id, and therefore would return all the rows because all of them would match up. But the cool thing is that the user can specify the number of bucket ids he wants per column, when he’s designing its tables. So, if the user defines ten buckets for another column, when querying for a row from that column the system will return only one tenth of the rows.
As such, the user can decide how many buckets each encrypted column will have. Increasing the number of buckets will expose more statistical information about the corresponding encrypted data, so the goal is to keep the number of buckets low in comparison to the variety of values that the data of that column may assume. An awesome feature of this approach is that if the user searches for data providing more than one data point, the search time will get exponentially lowered even with a low number of buckets per encrypted column.
For example, if the user query provides two data points for two separate encrypted columns each one with 10 buckets, the query will need, on average, to go through only 1/100th of the dataset (10x10=100). If for a similar example, instead of two data points for two columns we had three data points for three columns, the query would only need to go through 1/1000th of the dataset.
For all of this to properly work in a secure way, the designer of the encrypted database will have to think carefully about the number of buckets he will assign to each of the database table columns, and that is hard, especially since this is a new field of data modeling. But please note that the fact that this is hard to model is actually a good thing, exactly because it’s a new way of data modeling that is centric around data privacy and search. There are more dimensions the developer has to consider, but for a good reason. Finally, we can store and retrieve data in an encrypted format in a fast way without revealing everything to the ones storing our data!
Published by Ricardo Schiller Lead Architect of BlockBase.Network