[=========================================================================] [-------------------------[ ECDLP Quick & Dirty ]-------------------------] [=========================================================================] _==####==_ .############. .################. .##################.__ __ __ __ _ _ __ __ ##############/¯_`|#\ V V // _` | '_/ -_) ##############\__,|# \_/\_/ \__,_|_| \___| ###########*¯*###### *#########{ }####* ##########._.##### ################ (C) 2006 by .aware computer security research group *############* ¯==####==¯ Web: http://awarenetwork.org Author: rattle@awarenetwork.org ------------------------------------------------[ Table of Contents ]----- i. Abstract ii. Preface iii. Mathematical Background iv. Implementation Issues - Fields v. Implementation Issues - Curves vi. Implementation vii. Conclusion viii. References -[ i ]----------------------------------------------------[ Abstract ]----- We aim to implement a very small and versatile assymetric key exchange algorithm for x86 platforms, based on the Elliptic Curve Discrete Logarithm Problem (ECDLP). This algorithm allows 2 parties to exchange a key of sufficient length for subsequently encrypting traffic with a symmetric cipher. NOTE: This paper might offend the mathematically schooled among you because I was not exactly precise, let alone thorough. -[ ii ]----------------------------------------------------[ Preface ]----- "I encrypt all my data with one-time pad's that consist of an electrometre measuring me urinating into the toilet. I'm secure." - sirukin It has come to my attention that modern cryptography, while being on everyone's favourite feature list, is hardly found explained in papers with the due clarity it requires. More accurately, most articles give you a rough idea of how certain cryptographic schemes work and, succeed in completely failing to give you enough insight (or tools) to implement an actually secure, working algorithm. In particular, I am referring to Elliptic Curve Cryptography (ECC), which I will be revolving around for the rest of the night. If everyone has such a compulsive thirst for knowledge - why the utter lack of enlightenment? Pretty simple. Just like the less sophisticated asymmetric schemes (like RSA), the basic idea of ECC can be wrapped up in a couple of fancy sketches and some first semester algebra. However, once you delve deeper into the topic, you will find that it is headblowingly complicated and not nearly as easy as they told you. Don't be intimidated, though. I will work around most problems of ECC with dirty tricks and hardcoded parameters. What we will get, though, is a working and secure ECDLP key exchange algorithm implemented in C for x86 machines. Mac users, rejoice. -[ iii ]-----------------------------------[ Mathematical Background ]----- "Well, just because you don't understand it, does not mean it's complicated." - My Algebra Professor (1) Fields. A field is a mathematical structure F which behaves similar to the real or the rational numbers. For in time, I recommend you think of these structures as we analyze them in a more abstract fashion. Elements of a field can be added and multiplied. Each element a has a uniquely defined additive inverse (-a) such that a+(-a)=0. Also, each element a!=0 has a multiplicative inverse (1/a) such that a * (1/a) = 1. Also, you have a*b=b*a, a*(b*c)=(a*b)*c and (a+b)*c = ac+bc. All this is obvious for real or rational numbers - but there are much more interesting structures that obey the same laws. As you might have already concluded, each field contains the Elements 0 and 1, the neutral Elements of Addition and Multiplication respectively. (Neutral because a+0=a, a*1=a.) Well, since it seems to be the smallest possible one, how about the field F2 = { 0, 1 } with Addition: Multiplication: + | 0 | 1 * | 0 | 1 ---+---+--- ---+---+--- 0 | 0 | 1 0 | 0 | 0 ---+---+--- ---+---+--- 1 | 1 | 0 1 | 0 | 1 ' ' ' ' The addition is a simple XOR operation, the multiplication works just as though 1 and 0 were integer numbers. This means, you use + and * on 0 and 1 as though they were integer numbers, but the result of your operation is the parity of the calculated number. Since there aren't that many scenarios, you can easily convince yourself that you can operate inside F2 just like you can operate on real and rational numbers. While it's an interesting and entertaining example, let's try to generalize the concept. For any prime p, let F(p) = { 0, 1, ..., p-1 } be the field where every operation is performed modulo p. This means, for any a,b in F(p), you have add(a,b) := (a + b) % p mul(a,b) := (a * b) % p It can be shown that the resulting structure is a field. Why primes? I will not go into detail here, but p being a prime is necessary for the existence of a multiplicative inverse for every a != 0. For instance, if we'd try p = 6, you had mul(2,3) = (2*3)%6 = 0, and consequentially, neither 2 or 3 would have a multiplicative inverse. However, for p=7: mul(1,1) = 1 % 7 = 1 mul(2,4) = (2 * 4) % 7 = 8 % 7 = 1 mul(3,5) = (3 * 5) % 7 = 15 % 7 = 1 mul(6,6) = (6 * 6) % 7 = 36 % 7 = 1 These fields are known as prime fields. It can be shown that every finite field (a field with finitely many elements) has p to the power of m elements, where p is a prime and m a positive integer number. We have covered the fields with p elements already, what do the others look like? Consider polynomials with coefficients in F(p). As you might know, polynomials can be added and multiplied. Most axioms of a field are already given, but we are lacking a multiplicative inverse for every polynomial of degree 1 or more. In fact, the polynomials over F(p) are, as a structure, very similar to the integer numbers - there's a polynomial division with rest, just like you have a division with rest for integers. Accordingly, you will also find prime polynomials. You will have to believe me when I tell you the following two things (or read a big book on algebra): - For any m, there exists a prime polynomial of degree m over F(p). - Performing every operation in F(p) modulo a prime polynomial of degree m yields a field structure with p to the power of m elements. We are particularly excited about polynomials over F2, because they can easily be represented as a bitstring (the coefficients of the polynomial are either 1 or 0). We will refer to this structure as F(2,m) where (m+1) is the degree of the prime polynomial we use for reduction. I cannot venture too deep into the maths here. For now, I just want you to know and trust me that F(2,m) works just like real or rational numbers - except for nasty things like a + a = (1 + 1) * a = 0 * a = 0. (2) Elliptic curves. What is an elliptic curve? An elliptic curve, for us, will be the set of all points (x,y) from any field which satisfy an equation of the form y² + xy = x³ + ax² + b (*) Note that this is not the general definition of an elliptic curve. Indeed, this is a "non-supersingular" curve. However, frankly, that's all we are interested in. We will write, for given coefficients a and b: E(F) = { (x,y) : x in F, y in F, y² + xy = x³ + ax² + b } + { @ } In words: E(F) is the set of all tuples (x,y) with x and y in F, such that x and y satisfy the curve equation (*). We also say that x=infinity and y=infinity satisfy the curve equation and call the point @ the "point at infinity". The sketch below shows how these kinds of curves look over the real numbers: | . | | . | . | . |. . . . .| . . .. . | . | . | ----------------------+----------------- . | . | . . .. . | . . . .| |. | . | . | . | | . There is something particularly interesting about these elliptic curves. Provided that P and Q are points on the curve, draw a line through P and Q. It will intersect the curve in a third point R'. The sum of P and Q is then defined as the negative of R', which is simply the reflection of R' about the x-axis. To add a point P to itself, you draw the tangent instead. | / | . | /. R' | / | / . | / | / . |/ / . /| . / | . / |. . . . / .| . . .. . | . / Q | . / | --------------/-------+----------------- . / | . / | . / . .. . | . . . .| / P |. / | . / | . / | . / | / | . / | / | . / | / | . / | X -R' = (P + Q) To be precise, for P = (x,y) we set -P = (x,-y). Set P + (-P) := @. As an interesting result, E(F) together with this addition law is an abelian group (that means, adding these points technically works like like adding integer numbers) with @ serving as the neutral element (the "zero"). How can we put this structure to cryptographic use? First of all, we consider elliptic curves over finite fields. Also, define a function f(n,P) = nP = P + P + ... + P <-- n times --> For very large n, this function is still easy to calculate for a given point P. However - given Q and P, it is very hard to find an integer Number n such that Q = f(n,P). This is called the Elliptic Curve Discrete Logarithm Problem. Now, due to the abelian structure, you have f(n*m,P) = f(n,f(m,P)) = f(m,f(n,P)) for any two integers n and m. Let me show you how we can exchange keys with this little trick. (1) Alice and Bob agree on a Point P in E(F). (2) Alice secretly chooses a big random number n and calculates nP. Bob also chooses big random number m and calculates mP. (3) Alice sends Bob (nP), Bob sends Alice (mP). (4) Alice calculates n(mP) = nmP =: Q. Bob also calculates m(nP) = nmP = Q. (5) Bob and Alice use Q as the key for encrypting traffic. What does Eve know? She knows P, (nP) and (mP). To get Q, she'd need the product nm. However, this would only be possible by solving the ECDLP for (nP) and (mP) (or solving a similarly hard problem). And that's that, folks. Let's get down to business! -[ iv ]-----------------------------[ Implementation Issues - Fields ]----- "If I recally correctly, the compiler does issue optimizer notes for this, it's just that they don't particularly stand out from the other notes so you might not realize that these particular optimization notes are OPTIMIZATION PROBLEMS WHICH WILL BURN ALL YOUR CPU CYCLES IN THE DEEPEST PITS OF COMPUTATIONAL HELL." - William Newman on run-time casting with SBCL (#lisp 2003/03/29) First of all, the ECDLP is hard only if the two following rather strong conditions are met: - The number of elements in your finite Field is larger than at least 2 to the power of 128. 256 bits is better, 512 scares the NSA. - The number of points on your elliptic curve is almost prime, which means that it is of the form |E(F)| = h * p, where p is prime and h very small. This leads to the following problems: (A) Implement sufficiently fast finite field arithmetics for fields with approximately 1.34e+154 elements (that's around 512 bits). (B) Find a curve which has appropriate order. Problem (A) is our biggest problem, as (B) has been solved by government agencies and smart mathematicians around the world. So we're left with the task of implementing F(2,m) for m ~= 512. Polynomials will be implemented as bitstrings of length m, stored in a little endian array of T = (m/32) machine words (32 bits). (1) Shifting / Multiplication with x. Because we store polynomials as bitstrings, shifting is a cheap linear operation. Only consider shifts of l<32 (for shifts larger than our wordsize of 32, we can shift word-wise first in linear time). We now have to shift and rotate through the carry bit at most 31*(m/32) ~ m times. (2) Addition. Adding two polynomials in F(2,m) is trivially linear. It is implemented as T XOR operations. (3) Multiplication. Multiplication might seem more complicated, but is easily implemented in O(m²). It works like this: -------------------------------------------------------------- Bitwise Comb Polynomial Multiplication Algorithm -------------------------------------------------------------- Input: Polynomials p(x), q(x) as bitstrings Output: c(x) = p(x) * q(x) SET c(x) = 0 FOR i=0 TO m DO: IF bit i of p(x) is set: c(x) := c(x) + (q(x) << i) RETURN c(x) -------------------------------------------------------------- You can achieve better complexity with a divide-and-conquer approach as explained in [K1]. However - for m as small as approximately 512 bit, benchmarks for bitwise comb turn out to be faster than benchmarks for divide&conquer, on several systems. Thus, for our purposes, the algorithm is completely sufficient, if not optimal. (4) Squaring. Calculating p(x)² can be implemented as a very fast linear operation due to the following observation which only applies to fields where 1+1=0: (a + b + c)² = (a + b + c) (a + b + c) = a² + ab + ac + ba + b² + bc + ca + cb + c² = a² + b² + c² + (1+1)ab + (1+1)ac + (1+1)bc = a² + b² + c² Thus, given a polynomial p(x) as bitstring, we can calculate p(x)² by simply inserting zeros between all bits. For instance, (1100111)² = (1010000010101) With a wee bit of precomputation, this is easily done in 4T = m/8 steps. (5) Reduction. The algorithms for squaring and multiplication yield polynomials of degree > m. These polynomials need to be reduced modulo a prime polynomial f(x) of degree m+1. For now, assume we have such an algorithm. (6) Inversion. Last, but definitely not least, we need a "division" algorithm. This means, for a given polynomial p(x), find the (uniquely defined) multiplicative inverse polynomial q(x), such that: p(x) * q(x) = 1 mod f(x) where f(x) is our reduction polynomial. To find the Inverse, we use the polynomial version of euclid's extended algorithm: -------------------------------------------------------------- Euclid's Extended Algorithm (Polynomial) -------------------------------------------------------------- Input: Polynomials p(x) and f(x) Output: Polynomials a(x),b(x) and d(x) such that a(x) * p(x) + b(x) * f(x) = d(x) SET a(x) = 1, a´(x) = 0, b(x) = 0, b´(x) = 1 SET u(x) = p(x), d(x) = f(x) WHILE u(x) != 0 DO: SET j := deg(u(x)) - deg(v(x)) IF j < 0: u <-> d a <-> a´ b <-> b´ j := -j u(x) := u(x) + ( d(x)< b" means "exchange a and b". In this particular case, we always have d(x) = 1 because f(x) is prime, and we don't care about b(x). The only thing that is left to be done is reduction. This is the only algorithm that depends on the form of the reduction polynomial f(x). This is also the only algorithm that requires a lot of work and hand-tuned optimization. You can skip the rest of this chapter if you are not interested in this kind of stuff. We will now try to find a reduction polynomial of degree > 512 which is particularly suited for reduction. On the other hand, why try to find one if one has already been found? The [F1] standard suggests reduction modulo the pentanomial -------------------------------------------------------------- Pentanomial f571(x) -------------------------------------------------------------- f(x) = x**571 + x**10 + x**5 + x**2 + 1 ~ (1<<571) + (1<<10) + (1<<5) + (1<<2) + 1 = 0x 08000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000425 -------------------------------------------------------------- Write f(x) = h(x) + r(x). where h(x)= (1<<571). Also assume we have a polynomial p(x) of degree at most 2m = 1140. Write p(x) = b(x)*h(x) + a(x). with polynomials b(x),a(x) of degree m. Because b(x)h(x) = b(x)f(x) + b(x)r(x) We get that b(x)h(x) = b(x)r(x) mod f(x), thus we have p(x) = b(x)r(x) + a(x) mod f(x). Now we're getting down to the hard part. We need to devise a fast reduction algorithm. Assume that the bitstring of P is stored as an array of unsigned little endian machine words p(x) = P[2T], P[2T-1], ..., P[T], P[T-1], ..., P[0] In this case, our polynomial b(x) starts at bitoffset 26 of P[T-1], because we assume a(x) to be a 570 bit polynomial which does not need to be reduced. Consider: <---- P[2T] ---> <--- P[T] ---> <------------ P[T-1] ---------------> {1151} .. {1120} ... {608} .. {576} {575}{574}{573}{572}{571} {570} ..... <---------------------------- B ----------------------------> <--- A ---> This means we have to understand B[i] == (P[i+T]<<5) ^ (P[i+T-1]>>27). Also, obviously A[i] == P[i]. Now, let's forget this for a while and think about how b(x)r(x) is calculated. We have r(x) = (1<<10)+(1<<5)+(1<<2)+1 = 0x425. We get B[i]*R = B[i] ^ (B[i] << 10) ^ (B[i] << 5) ^ (B[i] << 2) and a (maximum 10-bit) carry: C(B,i) := (B[i] >> 22) ^ (B[i] >> 27) ^ (B[i] >> 30) Thus, in one step of our algorithm, we will have to do ------------------------------------------------------------------------- FOR i = 1, ..., T-1 DO P[i] := A[i] ^ B[i] ^ (B[i] << 10) ^ (B[i] << 5) ^ (B[i] << 2) P[i+1] := A[i+1] ^ C(B,i) ------------------------------------------------------------------------- Consider what we had before and we get P[i] = ((P[i+T]<<5) ^ (P[i+T-1]>>27)) ^ ( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) << 5 ) ^ ( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) << 2 ) ^ ( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) << 10 ) ^ P[i] = P[i] ^ (P[i+T]<<5) ^ (P[i+T]<<7) ^ (P[i+T]<<10) ^ (P[i+T]<<15) ^ (P[i+T-1]>>27) ^ (P[i+T-1]>>27)<<2 ^ (P[i+T-1]>>27)<<5 ^ (P[i+T-1]>>27)<<10 (*) P[i+1] = ( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) >> 22) ^ ( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) >> 27) ^ ( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) >> 30) ^ P[i+1] = P[i+1] ^ (P[i+T]<<5)>>22 ^ (P[i+T]<<5)>>27 ^ (P[i+T]<<5)>>30 however, one step later, P[i+1] will be XOR'd with (*). We have, if we look at a bitwise representation of X: Bits, numbered from A to 6: Variable: Label: ABCDEFGHIJKLMNOPQRSTUVWXYZ123456 (X) ^ 0000000000000000000000ABCDE00000 (X>>27)<<5 (A1) ^ 000000000000000000000000000FGHIJ (X<<5)>>27 (A2) ^ 0000000000000000000000FGHIJKLMNO (X<<5)>>22 (B1) ^ 000000000000000000000000000000FG (X<<5)>>30 (C1) ^ 000000000000000000000000000ABCDE (X>>27) (D) ^ 0000000000000000000000000ABCDE00 (X>>27)<<2 (C2) ^ 00000000000000000ABCDE0000000000 (X>>27)<<10 (B2) /\ /\ /\ || || || \/ \/ \/ ABCDEFGHIJKLMNOPQRSTUVWXYZ123456 (X) ^ 0000000000000000000000ABCDEFGHIJ (X>>22) (A) = (A1)^(A2) ^ 000000000000000000000000000ABCDE (X>>27) (D) ^ 0000000000000000000000000ABCDEFG (X>>25) (C) = (C1)^(C2) ^ 00000000000000000ABCDEFGHIJKLMNO (X>>17) (B) = (B1)^(B2) Hence, we can rewrite our Algorithm to: ------------------------------------------------------------------------- FOR i = T-1, ..., 0 DO SET X := P[i+T] P[i] := P[i] ^ (X<<5) ^ (X<<7) ^ (X<<10) ^ (X<<15) P[i+1] := P[i+1] ^ (X>>17) ^ (X>>22) ^ (X>>25) ^ (X>>27) ------------------------------------------------------------------------- The Polynomial we get after reduction will, of course, still have degree > 570. However, since deg(C(B,T)) <= 10 and deg(r(x)) <= 10, we have deg(C(B,T)r(x)) <= 20, which fits into one single machine word. Hence, we should be able to complete the reduction with one more operation (and a few shifts). Without proof I will tell you that we can complete the reduction algorithm to: ------------------------------------------------------------------------- Polynomial Reduction Algorithm Modulo f571 ------------------------------------------------------------------------- Input: Polynomial p(x) of degree 1140 or less, stored as an array of 2T machinewords. Output: p(x) mod f571(x) FOR i = T-1, ..., 0 DO SET X := P[i+T] P[i] := P[i] ^ (X<<5) ^ (X<<7) ^ (X<<10) ^ (X<<15) P[i+1] := P[i+1] ^ (X>>17) ^ (X>>22) ^ (X>>25) ^ (X>>27) SET X := P[T-1] >> 27 P[0] := P[0] ^ X ^ (X<<2) ^ (X<<5) ^ (X<<10) P[T-1] := P[T-1] & 0x07ffffff RETURN P[T-1],...,P[0] ------------------------------------------------------------------------- We're done! We have each and every function to implement a fast 570 bit binary field. -[ iv ]-----------------------------[ Implementation Issues - Curves ]----- "Premature Optimization is the root of all evil." - Tony Hoare Now that we have a field to work with, we need to sort out some final issues with the curve implementation. As I told you before, the ECDLP is hard only if the curve does not have a smooth order. At this point we shortcut a huge problem of elliptic curve cryptography by choosing a hardcoded (Koblitz) curve over F(2,570) which has almost prime order. That curve is E0: y² + xy = x³ + 1 This is a non-supersingular Koblitz curve over F2. Fancy words, huh? The laws for point addition work as follows: 1. P + @ = @ + P = P for all P in E(F(2,m)) 2. The negative of P = (x,y) is -P = (x, x + y). Also, -@ = @ and of course, P + (-P) = @. 3. Point addition: Let P=(x1,y1) and Q=(x2,y2), where P is neither Q nor -Q. Then we have P + Q = (x3,y3), where x3 = l² + l + x1 + x2 (y1 + y2) with l := --------- y3 = l(x1+x3) + x3 + y1 (x1 + x2) 4. Point doubling. Let P = (x, y) and P is not -P. Then we will write P+P = 2P = (a, b), where a = l² + l = x² + (1/x²) (x + y) with l := ------- b = l(x+a) + a + y x But wait! You said in (iii.) that -(x,y) = (x,-y)! Yea. I lied. That law in (iii) is not the law for non-supersingular curves over F2, but for curves over the real numbers. You see, in F(2,570), many things are very different from other fields because we have 1+1=0. Simple things like dividing by 2 turn out to be real explosive. Hence, most algorithms and laws have to be tweaked a bit. Know that Koblitz curves have special traits which gave birth to a lot of literally insane optimization algorithms. Also know that I did not implement any of these. It might be a topic for later papers. Instead, the above laws have been implemented naively. What requires a wee bit of attention is the point multiplication algorithm. First of all, point multiplication is defined as the n-fold addition of a point P. Here, n is any integer number. We will use a standard square & multiply (double & add) algorithm to implement the point multiplication using point addition. Hence, we will view the integer number as a bitstring and consequentially, we can use the same data type for 576-bit polynomials and 576-bit integer numbers. --------------------------------------------------------- Double & Add Point Multiplication Algorithm --------------------------------------------------------- Input: Point P, Polynomial k(x) over Z Output: k(2)*P SET Q := P FOR i = deg(k), ..., 0 DO: Q := Q+Q IF bit i of k(x) is set THEN: Q := Q+P RETURN Q --------------------------------------------------------- Our work here is almost done. The only remaining problem is - how do we find a point on the curve? (1,1) and (1,0) are trivial solutions, but they are of such small order that they are not really suitable for key exchange via ECDLP. On the other hand, |E(F)| ~= q, where q = |F| (there are approximately as many points on the curve as there are elements in the underlying field). Since we have q² possible point combinations, we have a probability of 1/q ~= 2.59e-172 that a random point actually satisfies the curve equation. We will work around this problem the same way we worked around most problems so far - we won't solve it. The following point is, according to [F1], a generator point of the koblitz curve we chose in little Endian notation: PGEN = { 0xA01C8972,0xE2945283,0x4DCA88C7,0x988B4717,0x494776FB,0xBBD1BA39, 0xB4CEB08C,0x47DA304D,0x93B205E6,0x43709584,0x01841CA4,0x60248048, 0x0012D5D4,0xAC9CA297,0xF8103FE4,0x82189631,0x59923FBC,0x026EB7A8, 0x3EF1C7A3,0x01CD4C14,0x591984F6,0x320430C8,0x7BA7AF1B,0xB620B01A, 0xF772AEDC,0x4FBEBBB9,0xAC44AEA7,0x9D4979C0,0x006D8A2C,0xFFC61EFC, 0x9F307A54,0x4DD58CEC,0x3BCA9531,0x4F4AEADE,0x7F4FBF37,0x0349DC80 } What is a generator point? A point P of E(F) is called a generator point if and only if for every Q in E(F), there is an integer number n such that nP = Q. In plain English, this point generates the entire curve. In step (1) of the key exchange, explained in the last paragraph of chapter (iii), we will always choose the point PGEN. Now, having cheated our way around the more cumbersome problems of ECC, we shall have some fun. -[ vi ]---------------------------------------------[ Implementation ]----- "I considered life, I implemented life, now, life is boring." - thE AWKz In the Appendix, you will find the following files: binfields.h curve.h binfields.c curve.c The binfields module implements polynomial extension fields modulo the prime polynomial presented in (iv). All operations are implemented as described in (iv) and provide sufficiently fast arithmetics. The curve module implements the koblitz curve E0 as described in (iv). Note that the curve module has not been optimized for speed. I have tested the Module, and for a full-blown key exchange, the algorithm takes around 3-4 seconds, which means an effort of max. 2 seconds for each party. This is reasonably fast and definitely fast enough for any real life application. On a slow box, this might grow to time spans of 10 seconds, but I still think it is within acceptable range. Now, how does the Module work? Basically, include curve.h and then, all you need are the following functions: word* pointMul(word* R, word* k, word* P); word* pointGen(); word* polyRand(); polyRand() generates a random polynomial in F(2,570). For the sake of point Multiplication, it can also be interpreted as a 570-bit integer number. pointGen() gives you a hardcoded generator point of E0. For key exchange, each side performs the following operation (after initializing the random number generator): -------------------------------------------------------------- word* n = polyRand(); word* P = pointGen(); pointMul(P,n,P); // P := nP sendToBob(P); // send (nP) to bob recvFromBob(P); // get (mP) from bob pointMul(P,n,P); // calculate Q = nmP = n(mP) free(n); /* use the 36-word array P or any substring of it for symmetric encryption. */ -------------------------------------------------------------- And voila, you're good to go. -[ vii ]------------------------------------------------[ Conclusion ]----- "I know that you believe you understand what you think I said, but I'm not sure you realize that what you heard is not what I meant." - Robert McCloskey We have implemented exactly what we wanted - a quick and dirty ECDLP key exchange algorithm. It is far from being a comprehensive and complete ECC library - it is limited. On the other hand, it is small and easily incorporated with applications you might already have. Any kind of remote networking software can now be equipped with full-blown assymetric encryption. I can imagine a whole lot of ways to put this option to good use. And now, enjoy the code. -[ viii ]-----------------------------------------------[ References ]----- [K1] Karatsuba - Ofman Multiplication using Divide & Conquer http://mathworld.wolfram.com/KaratsubaMultiplication.html [F1] Federal Information Processing Standards Publication FIPS-186-2 - Digital Signature Standard http://csrc.nist.gov/publications/fips/fips186-2/fips186-2-change1.pdf [=========================================================================] [------------------------------[ APPENDIX ]-------------------------------] [=========================================================================] All files have been compiled using Microsoft Visual Studio 6 on Windows. Since there is a lot of inline Assembler, you might have to hand-tweak this code for other platforms. -------------------------------------------------------------- #ifndef _BINFIELDS__H #define _BINFIELDS__H typedef unsigned char byte; typedef unsigned long word; #define SIZE_BITS 0x0000023a #define SIZE_WORDS 0x00000012 #define SIZE_BYTES 0x00000048 #define EMPTY_MASK 0x07ffffff #define SIZE_BITS2 0x00000474 #define SIZE_WORDS2 0x00000024 #define SIZE_BYTES2 0x00000090 void __fastcall lshift(word* a, word n); void __fastcall rshift(word* a, word n); int deg(word *a); int __fastcall polyCmp(word *a, word *b); int __fastcall polyIsZero(word *a); word* __fastcall polyAdd(word* c, word* a, word* b); word* __fastcall polyAddTo(word* a, word* b); word* polySqr(word* c, word* a); word* polyMul(word* c, word *a, word *b); word* polyDiv(word* c, word *a, word *b); word* polyInv(word *c, word *a); word* polyGen(); word* polyRand(); word* polyDup(word* a); #endif -------------------------------------------------------------- #include #include #include #include #include "binfields.h" // Precomputed Array. For each byte a1,a2,...,a8, this // array contains the halfword a1,0,a2,0,a3,0,...,0,a8. // This is used for fast squaring. const static unsigned short POW[0x100] = { 0x0000,0x0001,0x0004,0x0005,0x0010,0x0011,0x0014,0x0015, 0x0040,0x0041,0x0044,0x0045,0x0050,0x0051,0x0054,0x0055, 0x0100,0x0101,0x0104,0x0105,0x0110,0x0111,0x0114,0x0115, 0x0140,0x0141,0x0144,0x0145,0x0150,0x0151,0x0154,0x0155, 0x0400,0x0401,0x0404,0x0405,0x0410,0x0411,0x0414,0x0415, 0x0440,0x0441,0x0444,0x0445,0x0450,0x0451,0x0454,0x0455, 0x0500,0x0501,0x0504,0x0505,0x0510,0x0511,0x0514,0x0515, 0x0540,0x0541,0x0544,0x0545,0x0550,0x0551,0x0554,0x0555, 0x1000,0x1001,0x1004,0x1005,0x1010,0x1011,0x1014,0x1015, 0x1040,0x1041,0x1044,0x1045,0x1050,0x1051,0x1054,0x1055, 0x1100,0x1101,0x1104,0x1105,0x1110,0x1111,0x1114,0x1115, 0x1140,0x1141,0x1144,0x1145,0x1150,0x1151,0x1154,0x1155, 0x1400,0x1401,0x1404,0x1405,0x1410,0x1411,0x1414,0x1415, 0x1440,0x1441,0x1444,0x1445,0x1450,0x1451,0x1454,0x1455, 0x1500,0x1501,0x1504,0x1505,0x1510,0x1511,0x1514,0x1515, 0x1540,0x1541,0x1544,0x1545,0x1550,0x1551,0x1554,0x1555, 0x4000,0x4001,0x4004,0x4005,0x4010,0x4011,0x4014,0x4015, 0x4040,0x4041,0x4044,0x4045,0x4050,0x4051,0x4054,0x4055, 0x4100,0x4101,0x4104,0x4105,0x4110,0x4111,0x4114,0x4115, 0x4140,0x4141,0x4144,0x4145,0x4150,0x4151,0x4154,0x4155, 0x4400,0x4401,0x4404,0x4405,0x4410,0x4411,0x4414,0x4415, 0x4440,0x4441,0x4444,0x4445,0x4450,0x4451,0x4454,0x4455, 0x4500,0x4501,0x4504,0x4505,0x4510,0x4511,0x4514,0x4515, 0x4540,0x4541,0x4544,0x4545,0x4550,0x4551,0x4554,0x4555, 0x5000,0x5001,0x5004,0x5005,0x5010,0x5011,0x5014,0x5015, 0x5040,0x5041,0x5044,0x5045,0x5050,0x5051,0x5054,0x5055, 0x5100,0x5101,0x5104,0x5105,0x5110,0x5111,0x5114,0x5115, 0x5140,0x5141,0x5144,0x5145,0x5150,0x5151,0x5154,0x5155, 0x5400,0x5401,0x5404,0x5405,0x5410,0x5411,0x5414,0x5415, 0x5440,0x5441,0x5444,0x5445,0x5450,0x5451,0x5454,0x5455, 0x5500,0x5501,0x5504,0x5505,0x5510,0x5511,0x5514,0x5515, 0x5540,0x5541,0x5544,0x5545,0x5550,0x5551,0x5554,0x5555 }; void __fastcall lshift2(word* a, word n); word* __fastcall __reduce(word* c, word* temp) { register int i; register word T; if (!deg(&c[SIZE_WORDS]) && deg(c)<=SIZE_BITS) goto __reduce_done; for (i=SIZE_WORDS2-1;i>=SIZE_WORDS;i--) { T = c[i]; c[i-SIZE_WORDS] ^= (T<<5) ^ (T<<7) ^ (T<<10) ^ (T<<15); c[i-SIZE_WORDS+1] ^= (T>>27) ^ (T>>25) ^ (T>>22) ^ (T>>17); } T = c[SIZE_WORDS-1] >> 27; c[0] = c[0] ^ T ^ (T<<2) ^ (T<<5) ^ (T<<10); c[SIZE_WORDS-1] = c[SIZE_WORDS-1] & EMPTY_MASK; __reduce_done: return memcpy(temp, c, SIZE_BYTES); } word* polyMul(word* r, word* a, word* b) { register int k,j; word c[SIZE_WORDS2] = {0}; for (k=31;k>=0;k--) { for (j=0;j>k)&1) polyAddTo(&c[j],b); if (k) lshift2(c,1); } return __reduce(c,r); } word* polySqr(word* r, word *a) { register word i; word c[SIZE_WORDS2] = {0}; for (i=0;i>0x00) & 0xFF]; c[2*i] += POW[(a[i]>>0x08) & 0xFF] << 0x10; c[2*i+1] = POW[(a[i]>>0x10) & 0xFF]; c[2*i+1] += POW[(a[i]>>0x18) & 0xFF] << 0x10; } return __reduce(c,r); } word* polyDiv(word *c, word *a, word *b) { word t[SIZE_WORDS]; return polyMul(c,a,polyInv(t,b)); } word* polyInv(word *r, word* a) { register word t; register int j; word x[5*SIZE_WORDS] = {0}, *v = &x[1*SIZE_WORDS], *u = &x[2*SIZE_WORDS], *g = &x[3*SIZE_WORDS], *f = &x[4*SIZE_WORDS]; memcpy(u,a,SIZE_BYTES); v[0] = 0x425; v[SIZE_WORDS-1] = 0x08000000; g[0] = 1; inv_loop: if (u[0]==1 || u[0]==0) { for (j=1;j u t = (word) g; g = f; f = (word*) t; // g <-> f j = -j; } memcpy(x,v,SIZE_BYTES); lshift(x,j); polyAddTo(u,x); // u = u + v>>j memcpy(x,f,SIZE_BYTES); lshift(x,j); polyAddTo(g,x); // g = g + f>>j goto inv_loop; inv_done: memcpy(r, g,SIZE_BYTES); return r; } __declspec(naked) word* __fastcall polyAddTo(word* a, word* b) { __asm { mov edi, ecx mov ecx, SIZE_WORDS _add_to_loop: mov eax, [edx+4*ecx-4] xor [edi+4*ecx-4], eax loop _add_to_loop mov ecx, edi ret } } word* __fastcall polyAdd(word* c, word* a, word* b) { __asm { mov ebx,ecx mov esi,b mov ecx, SIZE_WORDS _add_loop: mov eax, [edx+4*ecx-4] xor eax, [esi+4*ecx-4] mov [ebx+4*ecx-4], eax loop _add_loop } } _declspec(naked) void __fastcall lshift(word* a, word n) { __asm { test edx,edx jnz _lshift_nonzeroshift ret _lshift_nonzeroshift: mov edi,ecx mov esi,edi mov eax,edx shr eax,3 test eax,eax je _lshift_go mov ecx, SIZE_BYTES sub ecx, eax js _lshift_zero add edi, eax _lshift_preloop: dec ecx mov bl, byte ptr[esi+ecx] mov byte ptr [edi+ecx], bl test ecx, ecx jne _lshift_preloop mov ecx, eax _lshift_zloop: mov byte ptr [esi+ecx-1],0 loop _lshift_zloop ;; add esi, eax shl eax, 3 sub edx, eax jz _lshift_done _lshift_go: mov ecx,SIZE_WORDS mov edi,esi clc _lshift_subloop: rcl dword ptr [edi], 1 inc edi inc edi inc edi inc edi loop _lshift_subloop dec edx jne _lshift_go _lshift_done: ret _lshift_zero: mov ecx,SIZE_BYTES _lshift_zloop2: mov byte ptr [esi+eax],0 loop _lshift_zloop2 ret } } _declspec(naked) void __fastcall lshift2(word* a, word n) { __asm { test edx,edx jnz _lshift2_nonzeroshift ret _lshift2_nonzeroshift: mov edi,ecx mov esi,edi mov eax,edx shr eax,3 test eax,eax je _lshift2_go mov ecx, SIZE_BYTES2 sub ecx, eax js _lshift2_zero add edi, eax _lshift2_preloop: dec ecx mov bl, byte ptr[esi+ecx] mov byte ptr [edi+ecx], bl test ecx, ecx jne _lshift2_preloop mov ecx, eax _lshift2_zloop: mov byte ptr [esi+ecx-1],0 loop _lshift2_zloop ;; add esi, eax shl eax, 3 sub edx, eax jz _lshift2_done _lshift2_go: mov edi, esi mov ecx, SIZE_WORDS2 clc _lshift2_subloop: rcl dword ptr [edi],1 inc edi inc edi inc edi inc edi loop _lshift2_subloop dec edx jne _lshift2_go _lshift2_done: ret _lshift2_zero: mov ecx,SIZE_BYTES2 _lshift2_zloop2: mov byte ptr [esi+eax],0 loop _lshift2_zloop2 ret } } _declspec(naked) void __fastcall rshift(word* a, word n) { __asm { mov edi,ecx mov esi,edi mov eax,edx shr eax,3 test eax,eax je _rshift_go push eax mov ebx, SIZE_BYTES sub ebx, eax js _rshift_zero xor ecx, ecx add edi, eax _rshift_preloop: mov al, byte ptr[edi+ecx] mov byte ptr [esi+ecx], al inc ecx cmp ecx, ebx jl _rshift_preloop mov edi,esi add edi,ebx pop ecx mov eax,ecx _rshift_zloop: mov byte ptr [edi+ecx-1],0 loop _rshift_zloop shl eax, 3 sub edx, eax jz _rshift_done _rshift_go: mov ecx,SIZE_WORDS mov edi,esi sub edi,4 shr dword ptr [edi+4*ecx],1 dec ecx _rshift_subloop: rcr dword ptr [edi+4*ecx], 1 loop _rshift_subloop dec edx jne _rshift_go _rshift_done: ret _rshift_zero: mov ecx,SIZE_BYTES _rshift_zloop2: mov byte ptr [esi+eax],0 loop _rshift_zloop2 ret } } word* polyGen() { return calloc(SIZE_WORDS,sizeof(word)); } word* polyRand() { register int i; register word* p = calloc(SIZE_WORDS,sizeof(word)); for (i=SIZE_WORDS-1;i>=0;i--) p[i] = (rand()<<16) | rand(); p[SIZE_WORDS-1] &= EMPTY_MASK; return p; } int deg(word *a) { __asm { mov edx, a mov ecx, SIZE_WORDS _deg_loop: mov ebx, [edx+4*ecx-4] test ebx,ebx jnz _deg_done loop _deg_loop xor eax,eax jmp _deg_zero _deg_done: finit push 1 fild dword ptr [esp] mov dword ptr [esp],0 push ebx fild qword ptr [esp] fyl2x fstcw word ptr [esp] ; round down fstcw word ptr [esp+2] and word ptr[esp],0xF3FF or word ptr[esp], 0x400 fldcw [esp] frndint fldcw [esp+2] fistp qword ptr [esp] pop ebx pop eax mov eax,32 dec ecx mul ecx add eax,ebx inc eax _deg_zero: } } __declspec(naked) int __fastcall polyCmp(word* a, word* b) { __asm { mov edi, ecx ; edi=a/edx=b xor eax, eax mov ecx, SIZE_WORDS __loop_cmp: mov ebx, [edi+ecx*4-4] ; ebx = a[n] mov esi, [edx+ecx*4-4] ; esi = b[n] test ebx,ebx jz __a_n_zero test esi,esi jnz __b_n_notzero inc eax ret __b_n_notzero: cmp ebx, esi jz __next_cmp jl __lesser_cmp inc eax ret __lesser_cmp: dec eax ret __next_cmp: loop __loop_cmp ret __a_n_zero: test esi,esi jz __next_cmp dec eax ret } } __declspec(naked) int __fastcall polyIsZero(word* a) { __asm { mov edi, ecx ; edi=a/edx=b xor eax, eax mov ecx, SIZE_WORDS __iszero_loop: cmp [edi+ecx*4-4],0 jne __is_not_zero loop __iszero_loop inc eax __is_not_zero: ret } } word* dup(word* a) { register word* b = polyGen(); memcpy(b,a,SIZE_BYTES); return b; } ------------------------------------------------------------------ #ifndef _CURVE_H #define _CURVE_H #include "binfields.h" // We will use, by default, the Koblitz Curve // y2 + xy = x3 + 1 word* pointMul(word* R, word* k, word* P); word* pointAdd(word* R, word* P, word* Q); word* pointDbl(word* R, word* P); word* pointNeg(word* P); word* pointZero(word* P); int isZero(word* P); int isValidPoint(word* P); int pointCmp(word* P, word* Q); word* pointDup(word* P); word* pointGen(); word* pointGenZero(); #endif ------------------------------------------------------------------ #include #include #include "curve.h" #define Y(__P) (&__P[SIZE_WORDS]) #define X(__P) __P word PGEN[]= { 0xA01C8972,0xE2945283,0x4DCA88C7,0x988B4717,0x494776FB,0xBBD1BA39, 0xB4CEB08C,0x47DA304D,0x93B205E6,0x43709584,0x01841CA4,0x60248048, 0x0012D5D4,0xAC9CA297,0xF8103FE4,0x82189631,0x59923FBC,0x026EB7A8, 0x3EF1C7A3,0x01CD4C14,0x591984F6,0x320430C8,0x7BA7AF1B,0xB620B01A, 0xF772AEDC,0x4FBEBBB9,0xAC44AEA7,0x9D4979C0,0x006D8A2C,0xFFC61EFC, 0x9F307A54,0x4DD58CEC,0x3BCA9531,0x4F4AEADE,0x7F4FBF37,0x0349DC80 }; int isValidPoint(word* P) { // y2+xy=x3+1 word t[SIZE_WORDS] = {0}, l[SIZE_WORDS] = {0}; polyAdd(t,X(P),Y(P)); polyMul(t,t,Y(P)); // left side polySqr(l,X(P)); polyMul(l,l,X(P)); l[0] ^= 1; // right side return !polyCmp(t,l); } word* pointDup(word *P) { return memcpy(malloc(SIZE_WORDS2*sizeof(word)),P,SIZE_BYTES2); } int pointCmp(word* P, word* Q) { switch (polyCmp(X(P),X(Q))) { case 1: return 1; case -1: return -1; default: return polyCmp(Y(P),Y(Q)); } } word* pointGen() { word* P = malloc(SIZE_WORDS2 * sizeof(word)); return memcpy(P,PGEN,SIZE_BYTES2); } word* pointGenZero() { word* pt = calloc(SIZE_WORDS2, sizeof(word)); pt[SIZE_WORDS-1] = 1 << 31; return pt; } int isZero(word* P) { return (P[SIZE_WORDS-1]>>31); } word* pointZero(word* P) { memset(P,0,SIZE_BYTES2); P[SIZE_WORDS-1] = 1 << 31; return P; } word* pointDbl(word* R, word* P) { word t1[SIZE_WORDS] = {0}; word t2[SIZE_WORDS]; if (isZero(P)) return (R==P)?R:memcpy(R,P,SIZE_BYTES2); else if (!polyCmp(t1,P)) return pointZero(R); polyDiv(Y(R),Y(P),X(P)); polyAddTo(Y(R),X(P)); R[SIZE_WORDS] ^= 1; polySqr(t2,X(P)); // t2 = x2 polyInv(t1,t2); polyAddTo(t1,t2); // t1 = x' polyMul(Y(R),Y(R),t1); polyAddTo(Y(R),t2); return memcpy(R,t1,SIZE_BYTES); } word* pointNeg(word* P) { if (isZero(P)) return P; polyAddTo(Y(P),X(P)); return P; } word* pointAdd(word* R, word* P, word* Q) { word l[SIZE_WORDS], x[SIZE_WORDS], y[SIZE_WORDS]; if (isZero(Q)) { if (R==P) return R; else return memcpy(R,P,SIZE_BYTES2); } else if (isZero(P)) { if (R==Q) return R; else return memcpy(R,Q,SIZE_BYTES2); } else { if (!pointCmp(P,Q)) { return pointDbl(R,P); } else { word* N = pointNeg(pointDup(Q)); if (!pointCmp(P,N)) { pointZero(R); free(N); return R; } else free(N); polyAdd(l, Y(P), Y(Q)); polyAdd(x, X(P), X(Q)); polyDiv(l, l, x); polyAddTo(x,polySqr(y,l)); polyAddTo(x,l); polyAdd(y, x, X(P)); polyMul(y, y, l); polyAddTo(y, x); polyAdd(Y(R), y, Y(P)); return memcpy(X(R),x,SIZE_BYTES); } } } word* pointSub(word* R, word* P, word* Q) { word* N = pointNeg(pointDup(Q)); pointAdd(R,P,N); free(N); return R; } word* pointMul(word* R, word* k, word* P) { register int i; word* Q = pointDup(P); for (i=deg(k)-1;i>=0;i--) { pointDbl(Q,Q); if ((k[i/32]>>(i%32))&1) pointAdd(Q, Q,P); } memcpy(R,Q,SIZE_BYTES2); free(Q); return R; } --------------------------------[ EOF ]--------------------------------