For this assignment, you are to write a procedure `findCollision` that searches for collisions in a hash function: distinct values that produce the same result. The declaration of `findCollision` is given in the file `findCollision.h`:

The first argument `f` gives the function to test. The remaining arguments `x1` and `x2` are for returning the collision values. To enforce that collisions exist, only the bottom 31 bits of the output of `f` are used. A collision is thus a pair of distinct values `x1` and `x2` such that `f(x1) & 0x7fffffff` and `f(x2) & 0x7fffffff` are equal.

The fact that `f` takes 32-bit inputs and we only care about 31 bits of the output mean that there are going to be a lot of collisions: on average, two input values map to each output value. But finding them efficiently may be tricky. A direct approach of feeding `f` all 2^{32} possible inputs may go through 2^{31}+1 of them before finding the first duplicate output. This is not only going to require a lot of memory to store the past values, it is also going to take a lot of time if `f` is slow to compute. Instead, you may find it helpful to use a randomized birthday attack.

# 1. Exploiting the Birthday Paradox

The Birthday Paradox is the surprising fact that, even though there are 366 possible birthdays, a group of 23 people has a 50% chance of having two people with the same birthday, and the probability of two people with the same birthday rises rapidly as the group gets larger. The reason for this is that even though each pair of persons has a probability of only 1/366 or thereabouts^{1} of sharing a birthday, the number of pairs of people in a group of n people grows as O(n^{2}). So the number of people needed to get a least one pair with the same birthday on average is roughly the square root of the number of possible birthdays.

For 2^{31} possible function values, if we generate inputs at random, we expect to find our first collision after something in the neighborhood of 2^{16} inputs. This is a small enough number that it is feasible both to make 2^{16} calls to `f`, even if `f` is moderately expensive, and to store the results in a data structure that lets us detect collisions easily.

# 2. What you need to write

Your job is to supply the `findCollision` procedure in a file `findCollision.c`. The declaration for `findCollision` is given in findCollision.h above; you should include this file in your `findCollision.c` file to ensure consistency. We have also provided a simple test program testCollision.c and a Makefile. These can be found in `/c/cs223/Hwk8/template` on the Zoo if you don't want to download them from here.

Your `findCollision.c` file should not export any global variables except `findCollision`. If you need to write any other functions, make them static.

You should not assume that `f` returns a value in the range `0`..`0x7fffffff`; you may need to mask off the top bit of `f`'s value yourself.

It may be that `f` is very expensive. You should try to avoid calling it more than you have to.

# 3. Submitting your assignment

Submit your file `findCollision.c` using `/c/cs223/bin/submit 8 findCollision.c` as usual. You can run `/c/cs223/bin/testit 8 public` to check your submitted file using the public test script in `/c/cs223/Hwk8/test.public`, which may or may not resemble the final test script.

# 4. Sample solution

```
1 #include <stdlib.h>
2 #include <assert.h>
3
4 #include "findCollision.h"
5
6 /* a convenient prime */
7 #define HASH_SIZE ((1<<17)+29)
8
9 /* return a rand 32-bit int */
10 /* calls rand() twice to get around RAND_MAX limit */
11 static unsigned int
12 urand(void)
13 {
14 return (rand() << 16) ^ rand();
15 }
16
17 /* output of f is masked by this */
18 #define MASK (0x7fffffff)
19
20 /* finds values x1 != x2 such that f(x1) & MASK == f(x2) & MASK */
21 void
22 findCollision(unsigned int (*f)(unsigned int), unsigned int *x1, unsigned int *x2)
23 {
24 unsigned int input;
25 unsigned int output;
26 unsigned int hash;
27 unsigned int multiplier; /* for hash function */
28
29 /* make this static so we don't blow out the stack */
30 static struct {
31 int full; /* this slot is used */
32 unsigned int input;
33 unsigned int output;
34 } table[HASH_SIZE];
35
36 int i;
37
38 multiplier = 2*rand()+1;
39
40 /* clear the table */
41 for(i = 0; i < HASH_SIZE; i++) {
42 table[i].full = 0;
43 }
44
45 /* keep going until we win */
46 for(;;) {
47 input = urand();
48 output = f(input) & MASK;
49
50 hash = (output * multiplier) % HASH_SIZE;
51
52 if(table[hash].full && table[hash].input != input && table[hash].output == output) {
53 *x1 = input;
54 *x2 = table[hash].input;
55
56 /* let's be safe about what we are claiming */
57 assert(*x1 != *x2);
58 assert((f(*x1) & MASK) == (f(*x2) & MASK));
59
60 return;
61 }
62
63 /* else */
64 table[hash].full = 1;
65 table[hash].input = input;
66 table[hash].output = output;
67 }
68 }
```

The probability is higher than this, both because most years have only 365 days, and because some birthdays are more likely than others. (1)