[=========================================================================]
[-------------------------[ ECDLP Quick & Dirty ]-------------------------]
[=========================================================================]
_==####==_
.############.
.################.
.##################.__ __ __ __ _ _ __ __
##############/¯_`|#\ V V // _` | '_/ -_)
##############\__,|# \_/\_/ \__,_|_| \___|
###########*¯*######
*#########{ }####*
##########._.#####
################ (C) 2006 by .aware computer security research group
*############*
¯==####==¯ Web: http://awarenetwork.org
Author: rattle@awarenetwork.org
------------------------------------------------[ Table of Contents ]-----
i. Abstract
ii. Preface
iii. Mathematical Background
iv. Implementation Issues - Fields
v. Implementation Issues - Curves
vi. Implementation
vii. Conclusion
viii. References
-[ i ]----------------------------------------------------[ Abstract ]-----
We aim to implement a very small and versatile assymetric key exchange
algorithm for x86 platforms, based on the Elliptic Curve Discrete
Logarithm Problem (ECDLP). This algorithm allows 2 parties to exchange a
key of sufficient length for subsequently encrypting traffic with a
symmetric cipher.
NOTE: This paper might offend the mathematically schooled among you
because I was not exactly precise, let alone thorough.
-[ ii ]----------------------------------------------------[ Preface ]-----
"I encrypt all my data with one-time pad's that consist
of an electrometre measuring me urinating into the
toilet. I'm secure."
- sirukin
It has come to my attention that modern cryptography, while being on
everyone's favourite feature list, is hardly found explained in papers
with the due clarity it requires. More accurately, most articles give
you a rough idea of how certain cryptographic schemes work and,
succeed in completely failing to give you enough insight (or tools) to
implement an actually secure, working algorithm. In particular, I am
referring to Elliptic Curve Cryptography (ECC), which I will be
revolving around for the rest of the night.
If everyone has such a compulsive thirst for knowledge - why the utter
lack of enlightenment? Pretty simple. Just like the less sophisticated
asymmetric schemes (like RSA), the basic idea of ECC can be wrapped up
in a couple of fancy sketches and some first semester algebra. However,
once you delve deeper into the topic, you will find that it is
headblowingly complicated and not nearly as easy as they told you.
Don't be intimidated, though. I will work around most problems of ECC
with dirty tricks and hardcoded parameters. What we will get, though,
is a working and secure ECDLP key exchange algorithm implemented in C
for x86 machines. Mac users, rejoice.
-[ iii ]-----------------------------------[ Mathematical Background ]-----
"Well, just because you don't understand it, does not
mean it's complicated."
- My Algebra Professor
(1) Fields. A field is a mathematical structure F which behaves similar to
the real or the rational numbers. For in time, I recommend you think of
these structures as we analyze them in a more abstract fashion. Elements
of a field can be added and multiplied. Each element a has a uniquely
defined additive inverse (-a) such that a+(-a)=0. Also, each element a!=0
has a multiplicative inverse (1/a) such that a * (1/a) = 1. Also, you have
a*b=b*a, a*(b*c)=(a*b)*c and (a+b)*c = ac+bc. All this is obvious for real
or rational numbers - but there are much more interesting structures that
obey the same laws. As you might have already concluded, each field
contains the Elements 0 and 1, the neutral Elements of Addition and
Multiplication respectively. (Neutral because a+0=a, a*1=a.)
Well, since it seems to be the smallest possible one, how about the field
F2 = { 0, 1 } with
Addition: Multiplication:
+ | 0 | 1 * | 0 | 1
---+---+--- ---+---+---
0 | 0 | 1 0 | 0 | 0
---+---+--- ---+---+---
1 | 1 | 0 1 | 0 | 1
' ' ' '
The addition is a simple XOR operation, the multiplication works just
as though 1 and 0 were integer numbers. This means, you use + and * on
0 and 1 as though they were integer numbers, but the result of your
operation is the parity of the calculated number. Since there aren't that
many scenarios, you can easily convince yourself that you can operate
inside F2 just like you can operate on real and rational numbers.
While it's an interesting and entertaining example, let's try to
generalize the concept. For any prime p, let F(p) = { 0, 1, ..., p-1 } be
the field where every operation is performed modulo p. This means, for
any a,b in F(p), you have
add(a,b) := (a + b) % p
mul(a,b) := (a * b) % p
It can be shown that the resulting structure is a field. Why primes? I
will not go into detail here, but p being a prime is necessary for the
existence of a multiplicative inverse for every a != 0. For instance, if
we'd try p = 6, you had mul(2,3) = (2*3)%6 = 0, and consequentially,
neither 2 or 3 would have a multiplicative inverse. However, for p=7:
mul(1,1) = 1 % 7 = 1
mul(2,4) = (2 * 4) % 7 = 8 % 7 = 1
mul(3,5) = (3 * 5) % 7 = 15 % 7 = 1
mul(6,6) = (6 * 6) % 7 = 36 % 7 = 1
These fields are known as prime fields. It can be shown that every finite
field (a field with finitely many elements) has p to the power of m
elements, where p is a prime and m a positive integer number. We have
covered the fields with p elements already, what do the others look like?
Consider polynomials with coefficients in F(p). As you might know,
polynomials can be added and multiplied. Most axioms of a field are
already given, but we are lacking a multiplicative inverse for every
polynomial of degree 1 or more. In fact, the polynomials over F(p)
are, as a structure, very similar to the integer numbers - there's a
polynomial division with rest, just like you have a division with rest for
integers. Accordingly, you will also find prime polynomials. You will
have to believe me when I tell you the following two things (or read a big
book on algebra):
- For any m, there exists a prime polynomial of degree m over F(p).
- Performing every operation in F(p) modulo a prime polynomial of
degree m yields a field structure with p to the power of m
elements.
We are particularly excited about polynomials over F2, because they can
easily be represented as a bitstring (the coefficients of the
polynomial are either 1 or 0). We will refer to this structure as F(2,m)
where (m+1) is the degree of the prime polynomial we use for reduction.
I cannot venture too deep into the maths here. For now, I just want
you to know and trust me that F(2,m) works just like real or rational
numbers - except for nasty things like a + a = (1 + 1) * a = 0 * a = 0.
(2) Elliptic curves. What is an elliptic curve? An elliptic curve, for
us, will be the set of all points (x,y) from any field which satisfy an
equation of the form
y² + xy = x³ + ax² + b (*)
Note that this is not the general definition of an elliptic curve. Indeed,
this is a "non-supersingular" curve. However, frankly, that's all we are
interested in.
We will write, for given coefficients a and b:
E(F) = { (x,y) : x in F, y in F, y² + xy = x³ + ax² + b } + { @ }
In words: E(F) is the set of all tuples (x,y) with x and y in F, such that
x and y satisfy the curve equation (*). We also say that x=infinity and
y=infinity satisfy the curve equation and call the point @ the "point at
infinity".
The sketch below shows how these kinds of curves look over the real
numbers:
| .
|
| .
| .
| .
|.
. . . .|
. . .. . |
. |
. |
----------------------+-----------------
. |
. |
. . .. . |
. . . .|
|.
| .
| .
| .
|
| .
There is something particularly interesting about these elliptic curves.
Provided that P and Q are points on the curve, draw a line through P and
Q. It will intersect the curve in a third point R'. The sum of
P and Q is then defined as the negative of R', which is simply the
reflection of R' about the x-axis. To add a point P to itself, you draw
the tangent instead.
| /
| .
| /. R'
| /
| / .
| /
| / .
|/
/ .
/| .
/ | .
/ |.
. . . / .|
. . .. . |
. / Q |
. / |
--------------/-------+-----------------
. / |
. / |
. / . .. . |
. . . .|
/ P |.
/ | .
/ | .
/ | .
/ |
/ | .
/ |
/ | .
/ |
/ | .
/ | X -R' = (P + Q)
To be precise, for P = (x,y) we set -P = (x,-y). Set P + (-P) := @.
As an interesting result, E(F) together with this addition law is an
abelian group (that means, adding these points technically works like
like adding integer numbers) with @ serving as the neutral element
(the "zero").
How can we put this structure to cryptographic use? First of all, we
consider elliptic curves over finite fields. Also, define a function
f(n,P) = nP = P + P + ... + P
<-- n times -->
For very large n, this function is still easy to calculate for a given
point P. However - given Q and P, it is very hard to find an integer
Number n such that Q = f(n,P). This is called the Elliptic Curve Discrete
Logarithm Problem. Now, due to the abelian structure, you have
f(n*m,P) = f(n,f(m,P)) = f(m,f(n,P))
for any two integers n and m. Let me show you how we can exchange keys
with this little trick.
(1) Alice and Bob agree on a Point P in E(F).
(2) Alice secretly chooses a big random number n and calculates nP.
Bob also chooses big random number m and calculates mP.
(3) Alice sends Bob (nP), Bob sends Alice (mP).
(4) Alice calculates n(mP) = nmP =: Q.
Bob also calculates m(nP) = nmP = Q.
(5) Bob and Alice use Q as the key for encrypting traffic.
What does Eve know? She knows P, (nP) and (mP). To get Q, she'd need the
product nm. However, this would only be possible by solving the ECDLP for
(nP) and (mP) (or solving a similarly hard problem).
And that's that, folks. Let's get down to business!
-[ iv ]-----------------------------[ Implementation Issues - Fields ]-----
"If I recally correctly, the compiler does issue optimizer notes for
this, it's just that they don't particularly stand out from the other
notes so you might not realize that these particular optimization
notes are OPTIMIZATION PROBLEMS WHICH WILL BURN ALL YOUR CPU CYCLES
IN THE DEEPEST PITS OF COMPUTATIONAL HELL."
- William Newman on run-time casting with SBCL (#lisp 2003/03/29)
First of all, the ECDLP is hard only if the two following rather strong
conditions are met:
- The number of elements in your finite Field is larger than at least
2 to the power of 128. 256 bits is better, 512 scares the NSA.
- The number of points on your elliptic curve is almost prime, which
means that it is of the form |E(F)| = h * p, where p is prime and
h very small.
This leads to the following problems:
(A) Implement sufficiently fast finite field arithmetics for fields
with approximately 1.34e+154 elements (that's around 512 bits).
(B) Find a curve which has appropriate order.
Problem (A) is our biggest problem, as (B) has been solved by government
agencies and smart mathematicians around the world. So we're left with
the task of implementing F(2,m) for m ~= 512.
Polynomials will be implemented as bitstrings of length m, stored in a
little endian array of T = (m/32) machine words (32 bits).
(1) Shifting / Multiplication with x. Because we store polynomials as
bitstrings, shifting is a cheap linear operation. Only consider shifts
of l<32 (for shifts larger than our wordsize of 32, we can shift
word-wise first in linear time). We now have to shift and rotate through
the carry bit at most 31*(m/32) ~ m times.
(2) Addition. Adding two polynomials in F(2,m) is trivially linear.
It is implemented as T XOR operations.
(3) Multiplication. Multiplication might seem more complicated, but is
easily implemented in O(m²). It works like this:
--------------------------------------------------------------
Bitwise Comb Polynomial Multiplication Algorithm
--------------------------------------------------------------
Input: Polynomials p(x), q(x) as bitstrings
Output: c(x) = p(x) * q(x)
SET c(x) = 0
FOR i=0 TO m DO:
IF bit i of p(x) is set:
c(x) := c(x) + (q(x) << i)
RETURN c(x)
--------------------------------------------------------------
You can achieve better complexity with a divide-and-conquer approach as
explained in [K1]. However - for m as small as approximately 512 bit,
benchmarks for bitwise comb turn out to be faster than benchmarks for
divide&conquer, on several systems. Thus, for our purposes, the
algorithm is completely sufficient, if not optimal.
(4) Squaring. Calculating p(x)² can be implemented as a very fast
linear operation due to the following observation which only
applies to fields where 1+1=0:
(a + b + c)² = (a + b + c) (a + b + c)
= a² + ab + ac + ba + b² + bc + ca + cb + c²
= a² + b² + c² + (1+1)ab + (1+1)ac + (1+1)bc
= a² + b² + c²
Thus, given a polynomial p(x) as bitstring, we can calculate p(x)² by
simply inserting zeros between all bits. For instance,
(1100111)² = (1010000010101)
With a wee bit of precomputation, this is easily done in 4T = m/8
steps.
(5) Reduction. The algorithms for squaring and multiplication yield
polynomials of degree > m. These polynomials need to be reduced modulo
a prime polynomial f(x) of degree m+1. For now, assume we have such an
algorithm.
(6) Inversion. Last, but definitely not least, we need a "division"
algorithm. This means, for a given polynomial p(x), find the (uniquely
defined) multiplicative inverse polynomial q(x), such that:
p(x) * q(x) = 1 mod f(x)
where f(x) is our reduction polynomial. To find the Inverse, we use
the polynomial version of euclid's extended algorithm:
--------------------------------------------------------------
Euclid's Extended Algorithm (Polynomial)
--------------------------------------------------------------
Input: Polynomials p(x) and f(x)
Output: Polynomials a(x),b(x) and d(x) such that
a(x) * p(x) + b(x) * f(x) = d(x)
SET a(x) = 1, a´(x) = 0,
b(x) = 0, b´(x) = 1
SET u(x) = p(x),
d(x) = f(x)
WHILE u(x) != 0 DO:
SET j := deg(u(x)) - deg(d(x))
IF j < 0:
u <-> d
a <-> a´
b <-> b´
j := -j
u(x) := u(x) + ( d(x)<<j)
a(x) := a(x) + (a´(x)<<j)
b(x) := b(x) + (b´(x)<<j)
RETURN a(x),b(x),d(x)
--------------------------------------------------------------
where "a <-> b" means "exchange a and b".
In this particular case, we always have d(x) = 1 because f(x) is prime,
and we don't care about b(x).
The only thing that is left to be done is reduction. This is the only
algorithm that depends on the form of the reduction polynomial f(x).
This is also the only algorithm that requires a lot of work and
hand-tuned optimization. You can skip the rest of this chapter if you
are not interested in this kind of stuff.
We will now try to find a reduction polynomial of degree > 512 which
is particularly suited for reduction.
On the other hand, why try to find one if one has already been found?
The [F1] standard suggests reduction modulo the pentanomial
--------------------------------------------------------------
Pentanomial f571(x)
--------------------------------------------------------------
f(x) = x**571 + x**10 + x**5 + x**2 + 1
~ (1<<571) + (1<<10) + (1<<5) + (1<<2) + 1
= 0x 08000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000425
--------------------------------------------------------------
Write f(x) = h(x) + r(x). where h(x)= (1<<571). Also assume we have a
polynomial p(x) of degree at most 2m = 1140. Write
p(x) = b(x)*h(x) + a(x).
with polynomials b(x),a(x) of degree m. Because
b(x)h(x) = b(x)f(x) + b(x)r(x)
We get that b(x)h(x) = b(x)r(x) mod f(x), thus we have
p(x) = b(x)r(x) + a(x) mod f(x).
Now we're getting down to the hard part. We need to devise a fast
reduction algorithm. Assume that the bitstring of P is stored as an array
of unsigned little endian machine words
p(x) = P[2T], P[2T-1], ..., P[T], P[T-1], ..., P[0]
In this case, our polynomial b(x) starts at bitoffset 26 of P[T-1],
because we assume a(x) to be a 570 bit polynomial which does not need to
be reduced. Consider:
<---- P[2T] ---> <--- P[T] ---> <------------ P[T-1] --------------->
{1151} .. {1120} ... {608} .. {576} {575}{574}{573}{572}{571} {570} .....
<---------------------------- B ----------------------------> <--- A --->
This means we have to understand B[i] == (P[i+T]<<5) ^ (P[i+T-1]>>27).
Also, obviously A[i] == P[i].
Now, let's forget this for a while and think about how b(x)r(x) is
calculated. We have r(x) = (1<<10)+(1<<5)+(1<<2)+1 = 0x425. We get
B[i]*R = B[i] ^ (B[i] << 10) ^ (B[i] << 5) ^ (B[i] << 2)
and a (maximum 10-bit) carry:
C(B,i) := (B[i] >> 22) ^ (B[i] >> 27) ^ (B[i] >> 30)
Thus, in one step of our algorithm, we will have to do
-------------------------------------------------------------------------
FOR i = 1, ..., T-1 DO
P[i] := A[i] ^ B[i] ^ (B[i] << 10) ^ (B[i] << 5) ^ (B[i] << 2)
P[i+1] := A[i+1] ^ C(B,i)
-------------------------------------------------------------------------
Consider what we had before and we get
P[i]
= ((P[i+T]<<5) ^ (P[i+T-1]>>27)) ^
( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) << 5 ) ^
( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) << 2 ) ^
( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) << 10 ) ^ P[i]
= P[i] ^ (P[i+T]<<5) ^ (P[i+T]<<7) ^ (P[i+T]<<10) ^ (P[i+T]<<15)
^ (P[i+T-1]>>27) ^ (P[i+T-1]>>27)<<2
^ (P[i+T-1]>>27)<<5 ^ (P[i+T-1]>>27)<<10 (*)
P[i+1]
=
( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) >> 22) ^
( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) >> 27) ^
( ((P[i+T]<<5) ^ (P[i+T-1]>>27)) >> 30) ^ P[i+1]
= P[i+1] ^ (P[i+T]<<5)>>22 ^ (P[i+T]<<5)>>27 ^ (P[i+T]<<5)>>30
however, one step later, P[i+1] will be XOR'd with (*). We have,
if we look at a bitwise representation of X:
Bits, numbered from A to 6: Variable: Label:
ABCDEFGHIJKLMNOPQRSTUVWXYZ123456 (X)
^ 0000000000000000000000ABCDE00000 (X>>27)<<5 (A1)
^ 000000000000000000000000000FGHIJ (X<<5)>>27 (A2)
^ 0000000000000000000000FGHIJKLMNO (X<<5)>>22 (B1)
^ 000000000000000000000000000000FG (X<<5)>>30 (C1)
^ 000000000000000000000000000ABCDE (X>>27) (D)
^ 0000000000000000000000000ABCDE00 (X>>27)<<2 (C2)
^ 00000000000000000ABCDE0000000000 (X>>27)<<10 (B2)
/\ /\ /\
|| || ||
\/ \/ \/
ABCDEFGHIJKLMNOPQRSTUVWXYZ123456 (X)
^ 0000000000000000000000ABCDEFGHIJ (X>>22) (A) = (A1)^(A2)
^ 000000000000000000000000000ABCDE (X>>27) (D)
^ 0000000000000000000000000ABCDEFG (X>>25) (C) = (C1)^(C2)
^ 00000000000000000ABCDEFGHIJKLMNO (X>>17) (B) = (B1)^(B2)
Hence, we can rewrite our Algorithm to:
-------------------------------------------------------------------------
FOR i = T-1, ..., 0 DO
SET X := P[i+T]
P[i] := P[i] ^ (X<<5) ^ (X<<7) ^ (X<<10) ^ (X<<15)
P[i+1] := P[i+1] ^ (X>>17) ^ (X>>22) ^ (X>>25) ^ (X>>27)
-------------------------------------------------------------------------
The Polynomial we get after reduction will, of course, still have degree
> 570. However, since deg(C(B,T)) <= 10 and deg(r(x)) <= 10, we have
deg(C(B,T)r(x)) <= 20, which fits into one single machine word. Hence, we
should be able to complete the reduction with one more operation (and a
few shifts). Without proof I will tell you that we can complete the
reduction algorithm to:
-------------------------------------------------------------------------
Polynomial Reduction Algorithm Modulo f571
-------------------------------------------------------------------------
Input: Polynomial p(x) of degree 1140 or less, stored as
an array of 2T machinewords.
Output: p(x) mod f571(x)
FOR i = T-1, ..., 0 DO
SET X := P[i+T]
P[i] := P[i] ^ (X<<5) ^ (X<<7) ^ (X<<10) ^ (X<<15)
P[i+1] := P[i+1] ^ (X>>17) ^ (X>>22) ^ (X>>25) ^ (X>>27)
SET X := P[T-1] >> 27
P[0] := P[0] ^ X ^ (X<<2) ^ (X<<5) ^ (X<<10)
P[T-1] := P[T-1] & 0x07ffffff
RETURN P[T-1],...,P[0]
-------------------------------------------------------------------------
We're done! We have each and every function to implement a fast 570 bit
binary field.
-[ iv ]-----------------------------[ Implementation Issues - Curves ]-----
"Premature Optimization is the root of all evil."
- Tony Hoare
Now that we have a field to work with, we need to sort out some final
issues with the curve implementation. As I told you before, the ECDLP is
hard only if the curve does not have a smooth order. At this point we
shortcut a huge problem of elliptic curve cryptography by choosing a
hardcoded (Koblitz) curve over F(2,570) which has almost prime order.
That curve is
E0: y² + xy = x³ + 1
This is a non-supersingular Koblitz curve over F2. Fancy words, huh? The
laws for point addition work as follows:
1. P + @ = @ + P = P for all P in E(F(2,m))
2. The negative of P = (x,y) is -P = (x, x + y). Also, -@ = @ and
of course, P + (-P) = @.
3. Point addition: Let P=(x1,y1) and Q=(x2,y2), where P is neither Q nor
-Q. Then we have P + Q = (x3,y3), where
x3 = l² + l + x1 + x2 (y1 + y2)
with l := ---------
y3 = l(x1+x3) + x3 + y1 (x1 + x2)
4. Point doubling. Let P = (x, y) and P is not -P. Then we will write
P+P = 2P = (a, b), where
a = l² + l = x² + (1/x²) (x + y)
with l := -------
b = l(x+a) + a + y x
But wait! You said in (iii.) that -(x,y) = (x,-y)! Yea. I lied. That law
in (iii) is not the law for non-supersingular curves over F2, but for
curves over the real numbers. You see, in F(2,570), many things are very
different from other fields because we have 1+1=0. Simple things like
dividing by 2 turn out to be real explosive. Hence, most algorithms and
laws have to be tweaked a bit.
Know that Koblitz curves have special traits which gave birth to a lot of
literally insane optimization algorithms. Also know that I did not
implement any of these. It might be a topic for later papers. Instead,
the above laws have been implemented naively. What requires a wee bit of
attention is the point multiplication algorithm. First of all, point
multiplication is defined as the n-fold addition of a point P. Here, n
is any integer number. We will use a standard square & multiply (double
& add) algorithm to implement the point multiplication using point
addition. Hence, we will view the integer number as a bitstring and
consequentially, we can use the same data type for 576-bit polynomials and
576-bit integer numbers.
---------------------------------------------------------
Double & Add Point Multiplication Algorithm
---------------------------------------------------------
Input: Point P, Polynomial k(x) over Z
Output: k(2)*P
SET Q := P
FOR i = deg(k), ..., 0 DO:
Q := Q+Q
IF bit i of k(x) is set THEN:
Q := Q+P
RETURN Q
---------------------------------------------------------
Our work here is almost done. The only remaining problem is - how do we
find a point on the curve? (1,1) and (1,0) are trivial solutions, but
they are of such small order that they are not really suitable for key
exchange via ECDLP. On the other hand, |E(F)| ~= q, where q = |F| (there
are approximately as many points on the curve as there are elements in
the underlying field). Since we have q² possible point combinations, we
have a probability of 1/q ~= 2.59e-172 that a random point actually
satisfies the curve equation.
We will work around this problem the same way we worked around most
problems so far - we won't solve it. The following point is, according to
[F1], a generator point of the koblitz curve we chose in little Endian
notation:
PGEN = {
0xA01C8972,0xE2945283,0x4DCA88C7,0x988B4717,0x494776FB,0xBBD1BA39,
0xB4CEB08C,0x47DA304D,0x93B205E6,0x43709584,0x01841CA4,0x60248048,
0x0012D5D4,0xAC9CA297,0xF8103FE4,0x82189631,0x59923FBC,0x026EB7A8,
0x3EF1C7A3,0x01CD4C14,0x591984F6,0x320430C8,0x7BA7AF1B,0xB620B01A,
0xF772AEDC,0x4FBEBBB9,0xAC44AEA7,0x9D4979C0,0x006D8A2C,0xFFC61EFC,
0x9F307A54,0x4DD58CEC,0x3BCA9531,0x4F4AEADE,0x7F4FBF37,0x0349DC80 }
What is a generator point? A point P of E(F) is called a generator point
if and only if for every Q in E(F), there is an integer number n such
that nP = Q. In plain English, this point generates the entire curve.
In step (1) of the key exchange, explained in the last paragraph of
chapter (iii), we will always choose the point PGEN. Now, having cheated
our way around the more cumbersome problems of ECC, we shall have some
fun.
-[ vi ]---------------------------------------------[ Implementation ]-----
"I considered life, I implemented life, now, life is boring."
- thE AWKz
In the Appendix, you will find the following files:
binfields.h curve.h
binfields.c curve.c
The binfields module implements polynomial extension fields modulo the
prime polynomial presented in (iv). All operations are implemented as
described in (iv) and provide sufficiently fast arithmetics. The curve
module implements the koblitz curve E0 as described in (iv).
Note that the curve module has not been optimized for speed. I have
tested the Module, and for a full-blown key exchange, the algorithm takes
around 3-4 seconds, which means an effort of max. 2 seconds for each
party. This is reasonably fast and definitely fast enough for any
real life application. On a slow box, this might grow to time spans
of 10 seconds, but I still think it is within acceptable range.
Now, how does the Module work? Basically, include curve.h and then,
all you need are the following functions:
word* pointMul(word* R, word* k, word* P);
word* pointGen();
word* polyRand();
polyRand() generates a random polynomial in F(2,570). For the sake of
point Multiplication, it can also be interpreted as a 570-bit integer
number. pointGen() gives you a hardcoded generator point of E0.
For key exchange, each side performs the following operation (after
initializing the random number generator):
--------------------------------------------------------------
word* n = polyRand();
word* P = pointGen();
pointMul(P,n,P); // P := nP
sendToBob(P); // send (nP) to bob
recvFromBob(P); // get (mP) from bob
pointMul(P,n,P); // calculate Q = nmP = n(mP)
free(n);
/* use the 36-word array P or any substring of it
for symmetric encryption. */
--------------------------------------------------------------
And voila, you're good to go.
-[ vii ]------------------------------------------------[ Conclusion ]-----
"I know that you believe you understand what you
think I said, but I'm not sure you realize that
what you heard is not what I meant."
- Robert McCloskey
We have implemented exactly what we wanted - a quick and dirty ECDLP key
exchange algorithm. It is far from being a comprehensive and complete
ECC library - it is limited. On the other hand, it is small and easily
incorporated with applications you might already have. Any kind of remote
networking software can now be equipped with full-blown assymetric
encryption. I can imagine a whole lot of ways to put this option to good
use.
And now, enjoy the code.
-[ viii ]-----------------------------------------------[ References ]-----
[K1] Karatsuba - Ofman Multiplication using Divide & Conquer
http://mathworld.wolfram.com/KaratsubaMultiplication.html
[F1] Federal Information Processing Standards Publication
FIPS-186-2 - Digital Signature Standard
http://csrc.nist.gov/publications/fips/fips186-2/fips186-2-change1.pdf
[=========================================================================]
[------------------------------[ APPENDIX ]-------------------------------]
[=========================================================================]
All files have been compiled using Microsoft Visual Studio 6 on Windows.
Since there is a lot of inline Assembler, you might have to hand-tweak
this code for other platforms.
-------------------------------<binfields.h>-------------------------------
#ifndef _BINFIELDS__H
#define _BINFIELDS__H
typedef unsigned char byte;
typedef unsigned long word;
#define SIZE_BITS 0x0000023a
#define SIZE_WORDS 0x00000012
#define SIZE_BYTES 0x00000048
#define EMPTY_MASK 0x07ffffff
#define SIZE_BITS2 0x00000474
#define SIZE_WORDS2 0x00000024
#define SIZE_BYTES2 0x00000090
void __fastcall lshift(word* a, word n);
void __fastcall rshift(word* a, word n);
int deg(word *a);
int __fastcall polyCmp(word *a, word *b);
int __fastcall polyIsZero(word *a);
word* __fastcall polyAdd(word* c, word* a, word* b);
word* __fastcall polyAddTo(word* a, word* b);
word* polySqr(word* c, word* a);
word* polyMul(word* c, word *a, word *b);
word* polyDiv(word* c, word *a, word *b);
word* polyInv(word *c, word *a);
word* polyGen();
word* polyRand();
word* polyDup(word* a);
#endif
-------------------------------<binfields.c>-------------------------------
#include <memory.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include "binfields.h"
// Precomputed Array. For each byte a1,a2,...,a8, this
// array contains the halfword a1,0,a2,0,a3,0,...,0,a8.
// This is used for fast squaring.
const static unsigned short POW[0x100] = {
0x0000,0x0001,0x0004,0x0005,0x0010,0x0011,0x0014,0x0015,
0x0040,0x0041,0x0044,0x0045,0x0050,0x0051,0x0054,0x0055,
0x0100,0x0101,0x0104,0x0105,0x0110,0x0111,0x0114,0x0115,
0x0140,0x0141,0x0144,0x0145,0x0150,0x0151,0x0154,0x0155,
0x0400,0x0401,0x0404,0x0405,0x0410,0x0411,0x0414,0x0415,
0x0440,0x0441,0x0444,0x0445,0x0450,0x0451,0x0454,0x0455,
0x0500,0x0501,0x0504,0x0505,0x0510,0x0511,0x0514,0x0515,
0x0540,0x0541,0x0544,0x0545,0x0550,0x0551,0x0554,0x0555,
0x1000,0x1001,0x1004,0x1005,0x1010,0x1011,0x1014,0x1015,
0x1040,0x1041,0x1044,0x1045,0x1050,0x1051,0x1054,0x1055,
0x1100,0x1101,0x1104,0x1105,0x1110,0x1111,0x1114,0x1115,
0x1140,0x1141,0x1144,0x1145,0x1150,0x1151,0x1154,0x1155,
0x1400,0x1401,0x1404,0x1405,0x1410,0x1411,0x1414,0x1415,
0x1440,0x1441,0x1444,0x1445,0x1450,0x1451,0x1454,0x1455,
0x1500,0x1501,0x1504,0x1505,0x1510,0x1511,0x1514,0x1515,
0x1540,0x1541,0x1544,0x1545,0x1550,0x1551,0x1554,0x1555,
0x4000,0x4001,0x4004,0x4005,0x4010,0x4011,0x4014,0x4015,
0x4040,0x4041,0x4044,0x4045,0x4050,0x4051,0x4054,0x4055,
0x4100,0x4101,0x4104,0x4105,0x4110,0x4111,0x4114,0x4115,
0x4140,0x4141,0x4144,0x4145,0x4150,0x4151,0x4154,0x4155,
0x4400,0x4401,0x4404,0x4405,0x4410,0x4411,0x4414,0x4415,
0x4440,0x4441,0x4444,0x4445,0x4450,0x4451,0x4454,0x4455,
0x4500,0x4501,0x4504,0x4505,0x4510,0x4511,0x4514,0x4515,
0x4540,0x4541,0x4544,0x4545,0x4550,0x4551,0x4554,0x4555,
0x5000,0x5001,0x5004,0x5005,0x5010,0x5011,0x5014,0x5015,
0x5040,0x5041,0x5044,0x5045,0x5050,0x5051,0x5054,0x5055,
0x5100,0x5101,0x5104,0x5105,0x5110,0x5111,0x5114,0x5115,
0x5140,0x5141,0x5144,0x5145,0x5150,0x5151,0x5154,0x5155,
0x5400,0x5401,0x5404,0x5405,0x5410,0x5411,0x5414,0x5415,
0x5440,0x5441,0x5444,0x5445,0x5450,0x5451,0x5454,0x5455,
0x5500,0x5501,0x5504,0x5505,0x5510,0x5511,0x5514,0x5515,
0x5540,0x5541,0x5544,0x5545,0x5550,0x5551,0x5554,0x5555 };
void __fastcall lshift2(word* a, word n);
word* __fastcall __reduce(word* c, word* temp) {
register int i;
register word T;
if (!deg(&c[SIZE_WORDS]) && deg(c)<=SIZE_BITS)
goto __reduce_done;
for (i=SIZE_WORDS2-1;i>=SIZE_WORDS;i--) {
T = c[i];
c[i-SIZE_WORDS] ^= (T<<5) ^ (T<<7) ^ (T<<10) ^ (T<<15);
c[i-SIZE_WORDS+1] ^= (T>>27) ^ (T>>25) ^ (T>>22) ^ (T>>17);
}
T = c[SIZE_WORDS-1] >> 27;
c[0] = c[0] ^ T ^ (T<<2) ^ (T<<5) ^ (T<<10);
c[SIZE_WORDS-1] = c[SIZE_WORDS-1] & EMPTY_MASK;
__reduce_done:
return memcpy(temp, c, SIZE_BYTES);
}
word* polyMul(word* r, word* a, word* b) {
register int k,j;
word c[SIZE_WORDS2] = {0};
for (k=31;k>=0;k--) {
for (j=0;j<SIZE_WORDS;j++)
if ((a[j]>>k)&1)
polyAddTo(&c[j],b);
if (k) lshift2(c,1);
}
return __reduce(c,r);
}
word* polySqr(word* r, word *a) {
register word i;
word c[SIZE_WORDS2] = {0};
for (i=0;i<SIZE_WORDS;i++) {
c[2*i] = POW[(a[i]>>0x00) & 0xFF];
c[2*i] += POW[(a[i]>>0x08) & 0xFF] << 0x10;
c[2*i+1] = POW[(a[i]>>0x10) & 0xFF];
c[2*i+1] += POW[(a[i]>>0x18) & 0xFF] << 0x10;
}
return __reduce(c,r);
}
word* polyDiv(word *c, word *a, word *b) {
word t[SIZE_WORDS];
return polyMul(c,a,polyInv(t,b));
}
word* polyInv(word *r, word* a) {
register word t;
register int j;
word x[5*SIZE_WORDS] = {0},
*v = &x[1*SIZE_WORDS],
*u = &x[2*SIZE_WORDS],
*g = &x[3*SIZE_WORDS],
*f = &x[4*SIZE_WORDS];
memcpy(u,a,SIZE_BYTES);
v[0] = 0x425;
v[SIZE_WORDS-1] = 0x08000000;
g[0] = 1;
inv_loop:
if (u[0]==1 || u[0]==0) {
for (j=1;j<SIZE_WORDS;j++)
if (u[j]) goto inv_run;
goto inv_done;
}
inv_run:
if ((j = deg(u)-deg(v)) < 0) {
t = (word) v; v = u; u = (word*) t; // v <-> u
t = (word) g; g = f; f = (word*) t; // g <-> f
j = -j; }
memcpy(x,v,SIZE_BYTES);
lshift(x,j);
polyAddTo(u,x); // u = u + v>>j
memcpy(x,f,SIZE_BYTES);
lshift(x,j);
polyAddTo(g,x); // g = g + f>>j
goto inv_loop;
inv_done:
memcpy(r, g,SIZE_BYTES);
return r;
}
__declspec(naked) word* __fastcall polyAddTo(word* a, word* b) {
__asm {
mov edi, ecx
mov ecx, SIZE_WORDS
_add_to_loop:
mov eax, [edx+4*ecx-4]
xor [edi+4*ecx-4], eax
loop _add_to_loop
mov ecx, edi
ret
} }
word* __fastcall polyAdd(word* c, word* a, word* b) {
__asm {
mov ebx,ecx
mov esi,b
mov ecx, SIZE_WORDS
_add_loop:
mov eax, [edx+4*ecx-4]
xor eax, [esi+4*ecx-4]
mov [ebx+4*ecx-4], eax
loop _add_loop
} }
_declspec(naked) void __fastcall lshift(word* a, word n) {
__asm {
test edx,edx
jnz _lshift_nonzeroshift
ret
_lshift_nonzeroshift:
mov edi,ecx
mov esi,edi
mov eax,edx
shr eax,3
test eax,eax
je _lshift_go
mov ecx, SIZE_BYTES
sub ecx, eax
js _lshift_zero
add edi, eax
_lshift_preloop:
dec ecx
mov bl, byte ptr[esi+ecx]
mov byte ptr [edi+ecx], bl
test ecx, ecx
jne _lshift_preloop
mov ecx, eax
_lshift_zloop:
mov byte ptr [esi+ecx-1],0
loop _lshift_zloop
;; add esi, eax
shl eax, 3
sub edx, eax
jz _lshift_done
_lshift_go:
mov ecx,SIZE_WORDS
mov edi,esi
clc
_lshift_subloop:
rcl dword ptr [edi], 1
inc edi
inc edi
inc edi
inc edi
loop _lshift_subloop
dec edx
jne _lshift_go
_lshift_done:
ret
_lshift_zero:
mov ecx,SIZE_BYTES
_lshift_zloop2:
mov byte ptr [esi+eax],0
loop _lshift_zloop2
ret
} }
_declspec(naked) void __fastcall lshift2(word* a, word n) {
__asm {
test edx,edx
jnz _lshift2_nonzeroshift
ret
_lshift2_nonzeroshift:
mov edi,ecx
mov esi,edi
mov eax,edx
shr eax,3
test eax,eax
je _lshift2_go
mov ecx, SIZE_BYTES2
sub ecx, eax
js _lshift2_zero
add edi, eax
_lshift2_preloop:
dec ecx
mov bl, byte ptr[esi+ecx]
mov byte ptr [edi+ecx], bl
test ecx, ecx
jne _lshift2_preloop
mov ecx, eax
_lshift2_zloop:
mov byte ptr [esi+ecx-1],0
loop _lshift2_zloop
;; add esi, eax
shl eax, 3
sub edx, eax
jz _lshift2_done
_lshift2_go:
mov edi, esi
mov ecx, SIZE_WORDS2
clc
_lshift2_subloop:
rcl dword ptr [edi],1
inc edi
inc edi
inc edi
inc edi
loop _lshift2_subloop
dec edx
jne _lshift2_go
_lshift2_done:
ret
_lshift2_zero:
mov ecx,SIZE_BYTES2
_lshift2_zloop2:
mov byte ptr [esi+eax],0
loop _lshift2_zloop2
ret
} }
_declspec(naked) void __fastcall rshift(word* a, word n) {
__asm {
mov edi,ecx
mov esi,edi
mov eax,edx
shr eax,3
test eax,eax
je _rshift_go
push eax
mov ebx, SIZE_BYTES
sub ebx, eax
js _rshift_zero
xor ecx, ecx
add edi, eax
_rshift_preloop:
mov al, byte ptr[edi+ecx]
mov byte ptr [esi+ecx], al
inc ecx
cmp ecx, ebx
jl _rshift_preloop
mov edi,esi
add edi,ebx
pop ecx
mov eax,ecx
_rshift_zloop:
mov byte ptr [edi+ecx-1],0
loop _rshift_zloop
shl eax, 3
sub edx, eax
jz _rshift_done
_rshift_go:
mov ecx,SIZE_WORDS
mov edi,esi
sub edi,4
shr dword ptr [edi+4*ecx],1
dec ecx
_rshift_subloop:
rcr dword ptr [edi+4*ecx], 1
loop _rshift_subloop
dec edx
jne _rshift_go
_rshift_done:
ret
_rshift_zero:
mov ecx,SIZE_BYTES
_rshift_zloop2:
mov byte ptr [esi+eax],0
loop _rshift_zloop2
ret
} }
word* polyGen() {
return calloc(SIZE_WORDS,sizeof(word));
}
word* polyRand() {
register int i;
register word* p = calloc(SIZE_WORDS,sizeof(word));
for (i=SIZE_WORDS-1;i>=0;i--)
p[i] = (rand()<<16) | rand();
p[SIZE_WORDS-1] &= EMPTY_MASK;
return p;
}
int deg(word *a) {
__asm {
mov edx, a
mov ecx, SIZE_WORDS
_deg_loop:
mov ebx, [edx+4*ecx-4]
test ebx,ebx
jnz _deg_done
loop _deg_loop
xor eax,eax
jmp _deg_zero
_deg_done:
finit
push 1
fild dword ptr [esp]
mov dword ptr [esp],0
push ebx
fild qword ptr [esp]
fyl2x
fstcw word ptr [esp] ; round down
fstcw word ptr [esp+2]
and word ptr[esp],0xF3FF
or word ptr[esp], 0x400
fldcw [esp]
frndint
fldcw [esp+2]
fistp qword ptr [esp]
pop ebx
pop eax
mov eax,32
dec ecx
mul ecx
add eax,ebx
inc eax
_deg_zero:
}
}
__declspec(naked) int __fastcall polyCmp(word* a, word* b) {
__asm {
mov edi, ecx ; edi=a/edx=b
xor eax, eax
mov ecx, SIZE_WORDS
__loop_cmp:
mov ebx, [edi+ecx*4-4] ; ebx = a[n]
mov esi, [edx+ecx*4-4] ; esi = b[n]
test ebx,ebx
jz __a_n_zero
test esi,esi
jnz __b_n_notzero
inc eax
ret
__b_n_notzero:
cmp ebx, esi
jz __next_cmp
jl __lesser_cmp
inc eax
ret
__lesser_cmp:
dec eax
ret
__next_cmp:
loop __loop_cmp
ret
__a_n_zero:
test esi,esi
jz __next_cmp
dec eax
ret
} }
__declspec(naked) int __fastcall polyIsZero(word* a) {
__asm {
mov edi, ecx ; edi=a/edx=b
xor eax, eax
mov ecx, SIZE_WORDS
__iszero_loop:
cmp [edi+ecx*4-4],0
jne __is_not_zero
loop __iszero_loop
inc eax
__is_not_zero:
ret
} }
word* dup(word* a) {
register word* b = polyGen();
memcpy(b,a,SIZE_BYTES);
return b;
}
---------------------------------<curve.h>---------------------------------
#ifndef _CURVE_H
#define _CURVE_H
#include "binfields.h"
// We will use, by default, the Koblitz Curve
// y2 + xy = x3 + 1
word* pointMul(word* R, word* k, word* P);
word* pointAdd(word* R, word* P, word* Q);
word* pointDbl(word* R, word* P);
word* pointNeg(word* P);
word* pointZero(word* P);
int isZero(word* P);
int isValidPoint(word* P);
int pointCmp(word* P, word* Q);
word* pointDup(word* P);
word* pointGen();
word* pointGenZero();
#endif
---------------------------------<curve.c>---------------------------------
#include <stdlib.h>
#include <memory.h>
#include "curve.h"
#define Y(__P) (&__P[SIZE_WORDS])
#define X(__P) __P
word PGEN[]= {
0xA01C8972,0xE2945283,0x4DCA88C7,0x988B4717,0x494776FB,0xBBD1BA39,
0xB4CEB08C,0x47DA304D,0x93B205E6,0x43709584,0x01841CA4,0x60248048,
0x0012D5D4,0xAC9CA297,0xF8103FE4,0x82189631,0x59923FBC,0x026EB7A8,
0x3EF1C7A3,0x01CD4C14,0x591984F6,0x320430C8,0x7BA7AF1B,0xB620B01A,
0xF772AEDC,0x4FBEBBB9,0xAC44AEA7,0x9D4979C0,0x006D8A2C,0xFFC61EFC,
0x9F307A54,0x4DD58CEC,0x3BCA9531,0x4F4AEADE,0x7F4FBF37,0x0349DC80
};
int isValidPoint(word* P) {
// y2+xy=x3+1
word t[SIZE_WORDS] = {0},
l[SIZE_WORDS] = {0};
polyAdd(t,X(P),Y(P));
polyMul(t,t,Y(P)); // left side
polySqr(l,X(P));
polyMul(l,l,X(P));
l[0] ^= 1; // right side
return !polyCmp(t,l);
}
word* pointDup(word *P) {
return memcpy(malloc(SIZE_WORDS2*sizeof(word)),P,SIZE_BYTES2);
}
int pointCmp(word* P, word* Q) {
switch (polyCmp(X(P),X(Q))) {
case 1: return 1;
case -1: return -1;
default: return polyCmp(Y(P),Y(Q));
} }
word* pointGen() {
word* P = malloc(SIZE_WORDS2 * sizeof(word));
return memcpy(P,PGEN,SIZE_BYTES2);
}
word* pointGenZero() {
word* pt = calloc(SIZE_WORDS2, sizeof(word));
pt[SIZE_WORDS-1] = 1 << 31;
return pt;
}
int isZero(word* P) {
return (P[SIZE_WORDS-1]>>31);
}
word* pointZero(word* P) {
memset(P,0,SIZE_BYTES2);
P[SIZE_WORDS-1] = 1 << 31;
return P;
}
word* pointDbl(word* R, word* P) {
word t1[SIZE_WORDS] = {0};
word t2[SIZE_WORDS];
if (isZero(P))
return (R==P)?R:memcpy(R,P,SIZE_BYTES2);
else if (!polyCmp(t1,P)) return pointZero(R);
polyDiv(Y(R),Y(P),X(P));
polyAddTo(Y(R),X(P));
R[SIZE_WORDS] ^= 1;
polySqr(t2,X(P)); // t2 = x2
polyInv(t1,t2);
polyAddTo(t1,t2); // t1 = x'
polyMul(Y(R),Y(R),t1);
polyAddTo(Y(R),t2);
return memcpy(R,t1,SIZE_BYTES);
}
word* pointNeg(word* P) {
if (isZero(P)) return P;
polyAddTo(Y(P),X(P));
return P;
}
word* pointAdd(word* R, word* P, word* Q) {
word l[SIZE_WORDS],
x[SIZE_WORDS],
y[SIZE_WORDS];
if (isZero(Q)) {
if (R==P) return R;
else return memcpy(R,P,SIZE_BYTES2);
} else if (isZero(P)) {
if (R==Q) return R;
else return memcpy(R,Q,SIZE_BYTES2);
} else {
if (!pointCmp(P,Q)) {
return pointDbl(R,P);
} else {
word* N = pointNeg(pointDup(Q));
if (!pointCmp(P,N)) {
pointZero(R);
free(N);
return R;
} else free(N);
polyAdd(l, Y(P), Y(Q));
polyAdd(x, X(P), X(Q));
polyDiv(l, l, x);
polyAddTo(x,polySqr(y,l));
polyAddTo(x,l);
polyAdd(y, x, X(P));
polyMul(y, y, l);
polyAddTo(y, x);
polyAdd(Y(R), y, Y(P));
return memcpy(X(R),x,SIZE_BYTES);
}
}
}
word* pointSub(word* R, word* P, word* Q) {
word* N = pointNeg(pointDup(Q));
pointAdd(R,P,N);
free(N);
return R;
}
word* pointMul(word* R, word* k, word* P) {
register int i;
word* Q = pointDup(P);
for (i=deg(k)-1;i>=0;i--) {
pointDbl(Q,Q);
if ((k[i/32]>>(i%32))&1)
pointAdd(Q, Q,P);
}
memcpy(R,Q,SIZE_BYTES2);
free(Q);
return R;
}
--------------------------------[ EOF ]--------------------------------
|