Bionetta: Ultimate Client-Side ZKML. Technical Overview

Introduction

Dmytro Zakharov, Lasha Antadze, Oleksandr Kurbatov

The security of current neutral networks is comparable to the security of the Internet in its early days. While all major companies are focused on performance, there is a massive gap in AI safety. We can't trust that there are no hundreds of low-paid workers behind the model. We can't be sure the model doesn't lie. We can't verify the model output. We can't be sure it doesn't collect (actually it does) any personal data about clients and doesn't share it with anyone.

While AI is expanding the world, we need a fundamentally different approach to how it must operate. We introduce Bionetta🌿 -- a zkML framework that is designed to resolve some security concerns and allow achieving the following set of properties:

Provability. The verifier can check the correctness of the model execution.
Zero-knowledge. The verifier cannot extract any information except the statement is correct. For example, a user can scan their face and create proof of likeness, but no biometric data is shared with the verifier.
Client-side. We try to avoid any data leakage to a trusted provider. The framework allows for creating models, the proof of which is computationally cheap.

What are the neural networks anyway?

Simply explained, the neural network ff is a function, parameterized by a large number of parameters (which we call θ). Neural network takes the input xx and outputs the prediction y. Formally, we write that as y=f(x;θ). The inference process is illustrated below in Figure 1.

Figure 1. High-level abstraction of the neural network.

As an example, consider the linear regression model, taking three inputs x = (x₁, x₂, x₃). The model f can be parameterized by weights θ = (θ₀, θ₁, θ₂, θ₃) and the function itself looks as: f(x₁, x₂, x₃; θ₀, θ₁, θ₂, θ₃) = θ₀ + θ₁ x₁ + θ₂ x₂ + θ₃ x₃

When we train the model (aka neural network), we adjust the parameters θ₀, …, θ₃ to minimize the so-called loss value, representing how poorly our model performs on the training data. After the training, we fix values θ̃₀, …, θ̃₃, and whenever we need to gain some information from the incoming input x₁, x₂, x₃,we simply compute f( x; θ̃ ) = θ̃₀ + θ̃₁ x₁ + θ̃₂ x₂ + θ̃₃ x₃

When we train the model (aka neural network), we adjust the parameters θ₀, …, θ₃ to minimize the so-called loss value, representing how poorly our model performs on the training data. After the training, we fix values θ̃₀, …, θ̃₃, and whenever we need to gain some information from the incoming input x₁, x₂, x₃, we simply compute f( x; θ̃ ) = θ̃₀ + θ̃₁ x₁ + θ̃₂ x₂ + θ̃₃ x₃

Note! The reason why out of a sudden we decided to recall the machine learning theory (although “quite” superficially) is that one of the key optimizations we came up with is closely related to how we will interpret the parameters θ. Keep that in mind!

So, where is ZK?

Now, having the neural network f(x;θ), how do we wrap it in a zero-knowledge system? More importantly, what are the goals we are pursuing? In ZKML, we have two primary use-cases:

The model parameters are private (architecture is still public!), but the input is public. For example, OpenAI might want to hide the weights of their new model (which they do by the way), but in order to prove to the user that this is the neural network that generates the input, they might use some existing ZKML framework (such as EZKL).
The model parameters are public, but the input is private. This is the case we are primarily interested in, and which is called the client-side proving. For instance, the neural network might be a face recognition system, while the input is the person’s photo.

This way, to implement the neural network circuit, we need to provide two sets of signals: input values x̂ (in a specifically formatted way to fit in the circuits, which we call the “quantized value” and denote with a hat) and quantized weights θ̂. Depending on the use case, we either conceal x̂ or θ̂. Then, we represent all the calculations of the original neural network inside the circuit f̂ so that y = f(x) and ŷ = f̂( x̂; θ̂ ) represent approximately the same value. That is pretty much it.

If the only goal is to check the output correctness, we might add another public signal ŷc — claimed quantized output. The circuit can then check whether f̂( x̂; θ̂ ) ≈ ŷc (for example, by checking that the distance (norm) between f̂( x̂; θ̂ ) and ŷc is below a certain small threshold ε).

The whole process is depicted below, in Figure 2. First, we calculate the neural network output y = f(x; θ), quantize this value to get the claimed output ŷ. Then, given the neural network circuit f̂ and quantized input x̂, we compute f̂( x̂; θ̂ ) and inside the zero-knowledge prover assert that ŷ ≈ f̂ ( x̂; θ̂ ), yielding the proof π. Finally, the user can publish (y, π).

Figure 2. Overview of ZKML Systems

Bionetta Improvements

Circuit-embedded Weights: Exploiting R1CS

So what is so special about Bionetta🌿? As said earlier, we specifically target the case when the input xx is private while weights θ are public, leading to very neat optimization tricks, when implemented in the R1CS system. To see the trick, suppose you want to implement in zero-knowledge calculation of the following simple model, consisting of a single matrix-vector multiplication with three elements, outputting two elements (in fancy terms, called multidimensional linear regression):

For simplicity, assume that all parameters are already properly quantized, so we are working over elements from some field. So, what is the number of gates/constraints needed to implement the verification of the function f(x;W)?

All current approaches (including EZKL or keras2circom) do the following: we introduce public weight signals w₁,₁ … w₂,₃, private input signals x₁, x₂, x₃, and compute six products wᵢ,ⱼxⱼ for i ∈ {1, 2} and j ∈ {1, 2, 3}. Finally, we compute the corresponding sums to get the output vector elements. The example implementation in Circom might look like:

template MatVecMul2x3() {
    // Public input: weights matrix W of size 2x3
    signal input W[2][3];

    // Private input: three input neurons x
    signal input x[3];

    // Output: matrix-vector product result y=W*x
    signal output y[2];

    for (var i = 0; i < 2; i++) {
        signal sum = 0;
        for (var j = 0; j < 3; j++) {
            sum += W[i][j] * x[j];
        }
        y[i] <== sum;
    }
}

Halo2/PlonK-ish circuits would have a similar structure. Seems like the end of the story, we have successfully implemented the circuit.

🪄 The Trick

However, what if we tell you that calculating such f(x;W) costs zero constraints in Bionetta🌿? The trick is almost obvious and lies on the surface, but we have not seen it anywhere in the ZKML jungle! Behold:

All neural network weights must not be signals, but must be embedded into the circuit as constants. Such construction would save LOADS of signals.

Indeed, when implemented in R1CS, any multiplication by constants and additions are essentially free. Consider the general example of f(x;W)=Wx where the weight matrix W is of size n×m (and thus x is of size m).

Note that probably such optimization might be somehow exploited in PlonK, but the direct implementation would not give any results: additions and multiplications by constant in PlonK still require separate gates and thus cannot be efficiently implemented as in R1CS.

With such an approach, classical machine learning methods such as linear regression, PCA/LDA, or even logistic regression cost roughly zero constraints.

Bad News

Free linearities are great news, but here are the bad news: the number of non-linear activations is still overwhelming in the modern neural networks. Indeed, consider, for example, the simplest fully connected neural network with a single layer:

And the corresponding illustration of this execution:

Figure 3. Illustration of the MLP computation.

Note that this is just a previous matrix-vector product example, but here we simply apply the non-linear function ϕ to each output value.

While PlonKish-based approaches can handle such a problem quite efficiently by utilizing the lookup tables, the R1CS in its vanilla form does not possess such power. For this very reason, verifying the computation of y = ϕ(x) requires, for the most basic neural network activations such as ReLU, roughly 255 constraints per call (corresponding to roughly the bitsize of the prime field). In the MLP above, for example, verifying the computation of ϕ(Wx + β) would cost zero constraints for computing Wx + β, but applying φ element-wise would add roughly 255n constraints. How to solve this problem?

As of now, we propose the following two solutions:

Build a custom neural network architecture that uses an insignificant number of non-linearity calls, but can use a high degree of linearity (e.g., large matrix products and such).
Implement the custom look-up logic for R1CS.

Both approaches have their own solution and implementations worth their blogs, so we will skip the details here and move on

Summary

All linear operations inside our framework are free.
Every non-linear call costs 255 constraints. However, this number would be severely reduced after introducing incoming optimization tricks.
If the neural network is built in the specific manner — that is, it utilizes many linear calls with a relatively low number of activations — we can build efficient circuits requiring a negligible amount of resources for both proving and verification.
Nonetheless, our benchmarks show that Bionetta🌿 (V1) is faster than existing approaches even when assessed on the models with a significant number of activations.

Effective Quantization

How to perform operations over floats?

Now, the previous discussion completely omitted the fact that we need to somehow work not over the prime field, but over floats/doubles. In other words, proving that x × y = z is very straightforward in the circuit if x, y, z are assumed to be the field elements, but how do we prove statements like x × y = z when x, y, z are arbitrary numbers (say, 0.2×0.3=0.06)?

While there are many amazing papers out there explaining how to prove such statements (see “Zero-Knowledge Location Privacy via Accurate Floating-Point SNARKs” by Jens Ernstberger et al., for instance), we are not looking for perfect accuracy. For that reason, we are going to use a standard technique: instead of multiplying x × y directly, we are going to quantize them first: that is, find x̂ = ⌊2ᵖ x⌉ and ŷ = ⌊2ᵖ y⌉ (where ⌊·⌉ denotes rounding), and then compute the product ẑ ← x̂ · ŷ. We call ρ the precision coefficient, which specifies the degree of accuracy we want from the computation. The question, though, is how to interpret the result of such a product: in other words, how do we get the value of z? The answer is very simple: simply divide by 2²ᵖ! That being said, z ≈ 2⁻²ᵖ ẑ

Example. Suppose we want to multiply x = 0.2 by y = 0.3. We choose the precision coefficient ρ := 10. This way, the quantization is performed as follows:

x̂ = ⌊2¹⁰ · 0.2⌉ = [204.8] = 205

ŷ = ⌊2¹⁰ · 0.3⌉ = [307.2] = 307

Then, one multiplies the two values to get

ẑ = x̂ · ŷ = 205 · 307 = 62935.

To get the “real” value of z, we divide the result by 2²⁰, so we get

z ≈ 62935 · 2⁻²⁰ ≈ 0.0600195.

As one might see, this is close to the expected value of 0.06.

Note. However, we noticed that range check on [0, (p−1)/2] can be sometimes suboptimal, so we decided to change it to [0, 2^b−1] where b is the bit-size of the prime field order p. This way, to check the sign of some integer, we simply inspect its (b−1)ᵗʰ bit. This still does not save us from b constraints per range check, but certain complex operations (such as computing LeakyReLU()) might be better optimized.

Now, the idea that we need to divide by 2²ᵖ is not random: notice that x̂·ŷ is precisely [2ᵖx]·[2ᵖy], but since we expect both 2ᵖx and 2ᵖy to be large enough (ρ is in practice larger than 20), we can drop the flooring operation:

[2ᵖx]·[2ᵖy] ≈ 2ᵖx · 2ᵖy = 2²ᵖxy.

So all in all,

z ≈ 2⁻²ᵖ x̂·ŷ,

as specified above.

➖ How to handle negative floats?

Now, since we need to encode numbers into finite-field elements, we need to somehow handle negative integers. The idea here is simple: during the quantization step, if the integer x is negative, we simply set x̂ ← p + [2ᵖ x].

In this scenario, to check whether the quantized value x̂ is positive or negative, typically one checks whether it falls in the range [0, ⌊p−1⌋ / 2] or (⌊p−1⌋ / 2, p), so that we allocate roughly p / 2 “slots” equally for both positive and negative integers. In the latter case, we assume the integer is positive, while in the former that it is negative.

Note. However, we noticed that range check on [0, (p − 1)/2] can be sometimes suboptimal, so we decided to change it to [0, 2ᵇ⁻¹] where b is the bit-size of the prime field order p. This way, to check the sign of some integer, we simply inspect its (b − 1)th bit. This still does not save us from b constraints per range check, but certain complex operations (such as computing LeakyReLU) might be better optimized.

Bad News (again)

This sounds awesome, but we encounter the same issue that keras2circom once faced. Suppose our goal is to compute f(x) = x¹⁰ with the precision coefficient ρ := 35. In this case, if we compute the quantization x̂ and calculate the result as x̂ ¹⁰, we would be screwed: multiplying a (roughly) 35-bit integer 10 times results in a 350-bit integer, which exceeds the allowed prime field order. Needless to say, the result of such operation would be completely incorrect (as the result would be in fact x̂ ¹⁰ mod p).

One way to fix this problem is to use smaller ρ. For example, setting ρ := 20 would suffice: in that case, x̂ ¹⁰ would result in a (roughly) 200-bit integer which is lower than allowed 253 bits. However, what if we told you to compute g(x) = x¹⁰⁰? You would still need to set ρ := 2 (for ρ = 3, the result would be already with 300 precision bits), resulting in a drastic drop of accuracy.

Example. To see why, suppose we want to compute g(0.99) in-circuit. You would proceed as follows: you quantize the value of x = 0.99 as

x̂ ← ⌊2ᵖ · 0.99⌉ = ⌊3.96⌉ = 4.

Then, you compute

g( x̂ ) = x̂ ¹⁰⁰ = 4¹⁰⁰

and divide the result by

2^(100 ρ) = 2²⁰⁰.

So the result is 🥁…

z = 4¹⁰⁰ / 2²⁰⁰ = 1,

while the real result equals approximately 0.366.

To resolve the issue, we propose to simply cut the precision at certain intermediate steps. Now, again, suppose we are computing f(x) = x¹⁰ with ρ = 35. We can split the function execution into two parts: f(x) = x⁶·x⁴. Then, we compute the quantization x̂ and compute the first part: x̂ ⁶.

If we were to extract the “real” value of this intermediate result, we would divide this result by 2⁶ρ = 2²¹⁰

(note that we have no issues here since we have 210 bits of precision — significantly less than allowed 253 bits). However, instead of dividing by 2⁶ρ, we are going to divide by 2⁵ρ.

Why? Because such division would correspond (again, approximately) to the quantized value of this intermediate result. This way, if we got, say, r̂, we can compute the final result as r̂ · x̂ ⁴, which would have 5 ρ = 175 bits of precision. To get the final result, we divide by 2¹⁷⁵.

Example. Suppose we are computing f(0.85) in-circuit with ρ = 35. The quantization of

x = 0.85 yields x̂ = 0x6ccccccccc.

Then, we compute r̂ ← x̂ ⁶ (5 ρ), yielding the intermediate result

r̂ = 0x30466f718.

We finally multiply this value by x̂ ⁴ to get

0x19332e33e75e84b365b56f25da9eaf711fe077c49800.

To interpret this result back to floats, we divide this result by 2^{5 ρ}, getting roughly 0.196874, which almost precisely coincides with the “real” value.

The whole idea of our arithemtization scheme is depicted below.

Figure 4. Illustration of the proposed quantization. First, we split the program into subprograms (in our case, separate neural network layers) and after completing every subprogram, we cut the output precision and proceed further. We repeat this until the final result is obtained, which can be easily dequantized.

The biggest issue, though, is that the shift operation is very costly to compute inside the circuit (in fact, the cost of such operation is roughly 255 constraints — the bitsize of prime field). We solve this issue by noticing the following:

Computing ReLU and then cutting precision costs 255 constraints in contrast to expected 510. This comes from the fact that both ReLU and cut precision operations require the bit decomposition of an integer, and thus we can reuse it for both cases.

This way, we can naturally cut precision after each activation applied during the neural network inference. This way, we both apply the non-linearity and take care of an accumulated precision.

In fact, there are even more tricks under the hood (for example, you can easily compute in roughly 256 constraints the execution of the LeakyReLU function max{ x, αx } when α is the negative power of two). Yet, we will describe them more formally in our incoming paper.

Summary

The provided quantization scheme provides almost perfect accuracy for large enough ρ (in practice, values of ρ around 20-35 suffice).
The provided quantization scheme does not impose any additional constraints besides precision cuts, which can be applied together with the non-linearity calls.
The proposed quantization scheme allows to use a wide range of modern activation functions: ReLU, LeakyReLU, and even functions like hard sigmoid or hard swish without losing a single percent of accuracy*.*

👨🏻‍💻 Example Circuit

Suppose we want to implement a very simple neural network, recognizing digits based on the image of size 28 × 28. For that reason, we specify the following neural network architecture: we flatten the input neurons, connect them with 64 neurons with a LeakyReLU used as a non-linearity function (with α = 1/32) and finally use 10 neurons in the end. Simply speaking, each output neuron would represent the model’s confidence that the corresponding digit is depicted (for example, if the output were [0, 0, 0, 1.0, 0.01, 0.2, 0, 0, 0, 0], it would imply that the neural network is most confident that the depicted digit is three). When this neural network is trained and built in Bionetta 🌿 with ρ = 20, we get the following Circom circuit:

// AUTOGENERATED
pragma circom 2.1.6;

include "./weights.circom";
include "../bionetta-circom/circuits/zkml/includes.circom";
include "../bionetta-circom/circuits/hash/poseidonAny.circom";
include "../bionetta-circom/circuits/matrix/matrix.circom";

                   
template Core(FIELD_BITS) {
    var p = 20;
    signal input image[1][28][28];
    
    signal input address;
	  signal addressSquare <== address * address;

	  signal input threshold;

	  signal input nonce;
	  signal nonceSquare <== nonce * nonce;

    signal input features[10];

    component hash_features = PoseidonAny(10);
	  hash_features.inp <== features;
	  signal output featuresHash <== hash_features.out;

	  component layer1Comp = Flatten(28,28,1,64,get_dense_W(),get_dense_b());
	  layer1Comp.inp <== image;

	  component layer2Comp = VectorLeakyReLUwithCutPrecision(64,p,2*p,5,FIELD_BITS);
	  layer2Comp.inp <== layer1Comp.out;

	  component layer3Comp = Dense(64,10,get_dense_1_W(),get_dense_1_b());
	  layer3Comp.inp <== layer2Comp.out;

    component lastLayerComp = VectorLeakyReLUwithCutPrecision(10,p,2*p,0,FIELD_BITS);
	  lastLayerComp.inp <== layer3Comp.out;
    
    component verifier = VerifyDist(10, FIELD_BITS);
    verifier.in1 <== lastLayerComp.out;
    verifier.in2 <== features;
    verifier.threshold <== threshold;

    signal output isCompleted <== verifier.out;
}

component main{public [threshold, nonce, address]} = Core(254);

Notice that instead of simply executing the layer logic (as with keras2circom, for instance), we also cut the precision to handle the overflowing (for example, see VectorLeakyReLUwithCutPrecision function). Our automatic tools allow to spot places where the overflow might occur, so you might not care much about handling precisions when working with Bionetta🌿

Note that in contrast to requiring specifying matrices and biases as signals, we use them as constants, which are hardcoded in the file weights.circom, which looks… Well… As is…

function get_dense_W(){
	var dense_W[64][784];
	dense_W = [[
		0xb75d, 0xad2f, 0x2ebd, 0x5e67,
		0xa981, -0xeaba, -0x60e6, 0x16bc,
		0x126b, -0x7c81, -0x13ba9, -0xb458,
		0x11a1, 0x511b, 0xb117, 0x5c92,
		-0xf179, -0x4855, 0x15a9, -0x10b14,
		0x2e42, 0x146d, -0x541c,...
	]];
	return dense_W;
}

// ...Endless parameters specified here...

function get_dense_1_b(){
	var dense_1_b[10];
	dense_1_b = [
		0x12ff2dc000,0x137ab6e00,0x13d35b2000,
		0xf43e25000,0xfec98a000,0x14e6c40000,
		0x125e2fa000,0x144bc40000,
		0x18c2c4c000,0x12979f8000
	];
	return dense_1_b;
} // Oofff, that's the last one

Now, to pass the input to this neural network, we multiply the input image (formally being a tensor of shape (28, 28, 1) filled with float₃₂ entries) by 2²⁰ element-wise and round the result. For example, suppose I pass the following quantized image:

Figure 5. Example input to the neural network.

The result of such operation, if we check the witness, is (to make the notation smaller, if the integer exceeded 2²⁵³, we subtracted p to get the “sign-aware” format):

To get the “real” result, we divide all values by 2²⁰.

As can be seen, the neural network successfully identified the digit and predicted two. In fact, the number of constraints turned out to be just slightly above 19500: this number is very small and can be easily proven and verified.