Proof Provision
How to trust the source without knowing the source.
Hi folks, we are back again!
In GeoDB we continue working hard in order to allow users to benefit from our services as soon as possible. Within the last days we have released new versions of GeoCash for both Android and iOS devices, so we encourage you to help us to test the app by downloading it and start sharing your data in exchange for GEOs*.
However, with this post we are going to present something different, that is, the proof provision protocol based on IOTA that we use to add trust to data.
In GeoDB we aim at allowing users to commercialize their data. This process needs mechanisms that give credibility to the data provided, so it is necessary to guarantee some data properties as a way of data validation. For example, it is important to know when data was created and when it was provided.
Considering this constraint, we have defined a protocol based on DLT to publish data validation information. In particular, we use the IOTA ledger because this DLT allows us to give transparency to the validation process as it is immutable and public. Regarding the verification mechanisms we use sha256 [1].
“Ok this is fine, but if all information is meant to be released to the public, what happens with the privacy of the user that provides the data? Anyone with access to IOTA could check who received rewards for which proofs, and make inferences of which data belongs to the same user”.
(GeoDB enthusiast, 2020)
We are aware of that issue and that’s why we use mechanisms, such as asymmetric cryptography, to reduce the risk of user privacy breach.
Keeping in mind all previous considerations, we define a proof provision protocol that gathers all these requirements.
Proof Provision Protocol
The two main elements of the provision protocol are: the proof provision channels, and the provision strategies. These two elements as well as the workflow of the protocol are described below.
It is necessary to remark that the provision protocol has two parts: (i) proof, and (ii) data. In this post, we only focus on (i) proof provision, so we invite the reader to stay tuned for future posts that will cover (ii) data provision.
Proof Provision Channels
These channels are inspired in MAM channels [2]. MAM channels are based on the Merkle tree [3] and when a new message is written, it references to the root of the next Merkle tree, that will correspond to the following message. Moreover, MAM channels support the control of visibility and access to the data providing public, private and restricted channels.
For the provision protocol, not all features that MAM channels provide are needed. Moreover, MAM channels implementation is currently in alpha [4]. At the time of deciding to use it, we would have needed to develop our own implementation. Instead of developing the whole MAM specification, we opted for designing our own channel to support the specificities of the proof provision protocol.
Having a similar idea, a proof provision channel consists of an array and a seed that generates pairs of keys (public and private). Each position (index) is associated with a key pair. The key pairs used are compatible with Ethereum as they are generated following the same elliptic curve: secp256k1 [5].
The pair of keys associated with each index is used to sign and carry out asymmetric ciphering. In particular, from the public key the IOTA address is generated, that is where we will write all the information which we use to validate the data.
Every proof provision channel is generated with a fixed destinatary and is ciphered with his/her credentials. This means that, even though channel proofs can be read by anyone, it is only the owner who’s able to read the information of the user who generated the data as he/she has the private key that deciphers the channel identity.
And why do we have public and private information in the proof provision channel? Because it meets the needs of two parties: the dataset acquirer and the provision application.
- The dataset acquirers need access to the proofs to check the quality of the dataset they just purchased, whereas, the proofs are stored in plain text.
- The provision application needs to know who is responsible for generating each data point to reward users accordingly. However, if we put together the proofs with the information of the user that generated it, a malicious user could try to infer who is the real user behind the data created. Therefore, we cipher this sensitive part to be only accessible by the provision application.
At this point, we have all data proofs represented in the channels, but when are they written in the IOTA tangle? The provision strategy is the one which is responsible to do this.
Provision Strategy
The provision strategy defines when and how the computed proofs are sent to IOTA. At first, all proofs are stored in a buffer until a condition is met. After that, the process to send the proofs in the IOTA tangle is triggered.
Considering the nature of the data that we are managing, which consists of locations obtained by GPS, we store temporarily all of them in a buffer, and after that, the stored data will be sent to IOTA. The main reason for doing it is because we have a constant flow of small data entries, so buffering is a way to optimize the process of writing this data in iota, as it is not necessary to write every single time that new data is received.
We have defined a simple provision strategy that has the following rules to write in IOTA: (i) when the buffer is full, it has 24 proofs, and (ii) after a predefined amount of time without new data (timeout).
The rationale to write 24 proofs per transaction is the following. We generate the proof of a single data entry using SHA256 and codify the result in Base64. This gives us 44 ascii characters, which takes 88 trytes. Example:
data entry: {“value”: “1”}
sha256 in base64: lsAQZqFogyVN4rwbbGQz2IXn4RXgW3jGg6+/qsxdkrI=
trytes: 9DGDKB9CICEDPBCDVCMDECXBYAFDKDQCQCQB9CNDWASBGCBDYAACGCVCFCXAYCQBVC9BPATAEDGDLDSCZCFDSBGB
Given that an IOTA message has a capacity of 2187 trytes, we can fit 24 proofs and waste 75 trytes which we fill with 9s.
Workflow
The data provision workflow has the following steps:
- Compute data hashes. As long as new data is received, for each data entry a hash is computed and stored in a buffer. In particular, this hash is generated in SHA256 [1] and encoded in base64 [6], and it is used as proof of the data. That is, applying a hash function to a particular data, the returned hash will be always the same, however, it is almost impossible to extract the data from a given hash. This idea is what we use the hash for the verification process.
- Store and empty buffer. Once the provision strategy triggers a write, all the hashes within the buffer are stored under the correspondent index of the proof provision channel. After that, the buffer is emptied for the next time till the provision protocol is run.
- Generate bundle. The next step is the generation of a bundle with all the information. The bundle consists of three transactions:
Tx1: This first transaction contains the proofs computed in step 1. This transaction contains a maximum of 24 proofs.
Tx2: This second transaction includes some metadata. This metadata consists of: (i) The public key of the following index of the proof provision channel (which also determines the next address). (ii) The identity of the user that generates the data. In order to preserve his/her anonymity, this identity is ciphered with a public key issued by the provision application, so only the provision application can read the ciphered message. The ciphered message is generated using the elliptic curve integrated encryption scheme (ECIES) [7] with a random symmetric key, so the ciphered message of the same identity varies between bundles.
Tx3: This third transaction contains the signature (ECDSA) of a JSON composed of the metadata written in Tx2 and the data itself. It is worth to notice that the private key used to sign this data is not the user identity, but the private key used in the proof provision channel generated from the seed and the current index. - Write in IOTA. Finally, the bundle is written in the IOTA tangle in the address that belongs to the public key of the corresponding index of the proof provision channel.
It is necessary to highlight that all information stored in the proof provision channel needs to be notified to the provision app (third party app). The proof provision channel info together with the shared data itself, along with some authorization info, allow the third-party app to verify the proofs, request rewards to GeoDB and give the corresponding reward to each user identity. As mentioned earlier, we will detail this part in a future post.
Concluding remarks
We have presented a data provision protocol based on IOTA in order to guarantee the validity of data. This is absolutely essential for commercializing data because it gives a mechanism for trusting the quality of data itself.
“To be trusted is a greater compliment than being loved” George Macdonald.
We hope that you have found this post interesting and if you have any questions or feedback, please feel free to contact us.
Keep in touch!
References
[1] “SHA-2 Cryptographic functions” https://en.wikipedia.org/wiki/SHA-2
[2] Handy, P. “Introducing Masked Authenticated Messaging” https://blog.iota.org/introducing-masked-authenticated-messaging-e55c1822d50e
[3] “Merkle tree” https://en.wikipedia.org/wiki/Merkle_tree
[4] “IOTA Streams alpha” https://blog.iota.org/iota-streams-alpha-7e91ee326ac0
[5] Cook, J.D. “Bitcoin key mechanism and elliptic curves over finite fields” https://www.johndcook.com/blog/2018/08/14/bitcoin-elliptic-curves/
[6] “Base64” https://en.wikipedia.org/wiki/Base64
[7] “Integrated Encryption Scheme” https://en.wikipedia.org/wiki/Integrated_Encryption_Scheme
* It is necessary to highlight that, currently, the App is running on GeoDB’s Testnet.