[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. Debugging in general

Basic method of all debugging:

  1. Know what your program is supposed to do.
  2. Detect when it doesn't.
  3. Fix it.

A tempting mistake is to skip step 1, and just try randomly tweaking things until the program works. Better is to see what the program is doing internally, so you can see exactly where and when it is going wrong. A second temptation is to attempt to intuit where things are going wrong by staring at the code or the program's output. Avoid this temptation as well: let the computer tell you what it is really doing inside your program instead of guessing.

2. Assertions

Every non-trivial C program should include <assert.h>, which gives you the assert macro (see KernighanRitchie Appendix B6). The assert macro tests if a condition is true and halts your program with an error message if it isn't:

   1 #include <assert.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     assert(2+2 == 5);
   7     return 0;
   8 }
no.c

$ gcc -o no no.c
$ ./no
no: no.c:6: main: Assertion `2+2 == 5' failed.

Line numbers and everything, even if you compile with the optimizer turned on. Much nicer than a mere segmentation fault, and if you run it under the debugger, the debugger will stop exactly on the line where the assert failed so you can poke around and see why.

3. gdb

The standard debugger on Linux is called gdb. This lets you run your program under remote control, so that you can stop it and see what is going on inside.

Let's look at a contrived example. Suppose you have the following program bogus.c:

   1 #include <stdio.h>
   2 
   3 /* Print the sum of the integers from 1 to 1000 */
   4 int
   5 main(int argc, char **argv)
   6 {
   7     int i;
   8     int sum;
   9 
  10     sum = 0;
  11     for(i = 0; i -= 1000; i++) {
  12         sum += i;
  13     }
  14     printf("%d\n", sum);
  15     return 0;
  16 }
bogus.c

Let's compile and run it and see what happens:

$ gcc -g3 -o bogus bogus.c
$ ./bogus
-34394132
$

That doesn't look like the sum of 1 to 1000. So what went wrong? If we were clever, we might notice that the test in the for loop is using the mysterious -= operator instead of the <= operator that we probably want. But let's suppose we're not so clever right now—it's four in the morning, we've been working on bogus.c for twenty-nine straight hours, and there's a -= up there because in our befuddled condition we know in our bones that it's the right operator to use. We need somebody else to tell us that we are deluding ourselves, but nobody is around this time of night. So we'll have to see what we can get the computer to tell us.

The first thing to do is fire up gdb, the debugger. This runs our program in stop-motion, letting us step through it a piece at a time and watch what it is actually doing. In the example below gdb is run from the command line. You can also run it directly from Emacs with M-x gdb, which lets Emacs track and show you where your program is in the source file with a little arrow, or (if you are logged in directly on a Zoo machine) by running ddd, which wraps gdb in a graphical user interface.

$ gdb bogus
GNU gdb 4.17.0.4 with Linux/x86 hardware watchpoint and FPU support
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
(gdb) run
Starting program: /home/accts/aspnes/tmp/bogus 
-34394132

Program exited normally.

So far we haven't learned anything. To see our program in action, we need to slow it down a bit. We'll stop it as soon as it enters main, and step through it one line at a time while having it print out the values of the variables.

(gdb) break main
Breakpoint 1 at 0x8048476: file bogus.c, line 9.
(gdb) run
Starting program: /home/accts/aspnes/tmp/bogus 

Breakpoint 1, main (argc=1, argv=0xbffff9ac) at bogus.c:9
9           sum = 0;
(gdb) display sum
1: sum = 1
(gdb) n
10          for(i = 0; i -= 1000; i++)
1: sum = 0
(gdb) display i
2: i = 0
(gdb) n
11              sum += i;
2: i = -1000
1: sum = 0
(gdb) n
10          for(i = 0; i -= 1000; i++)
2: i = -1000
1: sum = -1000
(gdb) n
11              sum += i;
2: i = -1999
1: sum = -1000
(gdb) n
10          for(i = 0; i -= 1000; i++)
2: i = -1999
1: sum = -2999
(gdb) quit
The program is running.  Exit anyway? (y or n) y
$

Here we are using break main to tell the program to stop as soon as it enters main, display to tell it to show us the value of the variables i and sum whenever it stops, and n (short for next) to execute the program one line at a time.

When stepping through a program, gdb displays the line it will execute next as well as any variables you've told it to display. This means that any changes you see in the variables are the result of the previous displayed line. Bearing this in mind, we see that i drops from 0 to -1000 the very first time we hit the top of the for loop and drops to -1999 the next time. So something bad is happening in the top of that for loop, and if we squint at it a while we might begin to suspect that i -= 1000 is not the nice simple test we might have hoped it was.

3.1. My favorite gdb commands

help
Get a description of gdb's commands.
run
Runs your program. You can give it arguments that get passed in to your program just as if you had typed them to the shell. Also used to restart your program from the beginning if it is already running.
quit
Leave gdb, killing your program if necessary.
break
Set a breakpoint, which is a place where gdb will automatically stop your program. Some examples:
  • break somefunction stops before executing the first line somefunction.

  • break 117 stops before executing line number 117.

list
Show part of your source file with line numbers (handy for figuring out where to put breakpoints). Examples:
  • list somefunc lists all lines of somefunc.

  • list 117-123 lists lines 117 through 123.

next
Execute the next line of the program, including completing any procedure calls in that line.
step
Execute the next step of the program, which is either the next line if it contains no procedure calls, or the entry into the called procedure.
finish
Continue until you get out of the current procedure (or hit a breakpoint). Useful for getting out of something you stepped into that you didn't want to step into.
cont

(Or continue). Continue until (a) the end of the program, (b) a fatal error like a Segmentation Fault or Bus Error, or (c) a breakpoint. If you give it a numeric argument (e.g., cont 1000) it will skip over that many breakpoints before stopping.

print

Print the value of some expression, e.g. print i.

display

Like print, but runs automatically every time the program stops. Useful for watching values that change often.

3.2. Debugging strategies

In general, the idea behind debugging is that a bad program starts out sane, but after executing for a while it goes bananas. If you can find the exact moment in its execution where it first starts acting up, you can see exactly what piece of code is causing the problem and have a reasonably good chance of being able to fix it. So a typical debugging strategy is to put in a breakpoint (using break) somewhere before the insanity hits, "instrument" the program (using display) so that you can watch it going insane, and step through it (using next, step, or breakpoints and cont) until you find the point of failure. Sometimes this process requires restarting the program (using run) if you skip over this point without noticing it immediately.

For large or long-running programs, it often makes sense to do binary search to find the point of failure. Put in a breakpoint somewhere (say, on a function that is called many times or at the top of a major loop) and see what the state of the program is after going through the breakpoint 1000 times (using something like cont 1000). If it hasn't gone bonkers yet, try restarting and going through 2000 times. Eventually you bracket the error as occurring (for example) somewhere between the 4000th and 8000th occurrence of the breakpoint. Now try stepping through 6000 times; if the program is looking good, you know the error occurs somewhere between the 6000th and 8000th breakpoint. A dozen or so more experiments should be enough isolate the bug to a specific line of code.

The key to all debugging is knowing what your code is supposed to do. If you don't know this, you can't tell the lunatic who thinks he's Napoleon from lunatic who really is Napoleon. If you're confused about what your code is supposed to be doing, you need to figure out what exactly you want it to do. If you can figure that out, often it will be obvious what is going wrong. If it isn't obvious, you can always go back to gdb.

4. Valgrind

The valgrind program can be used to detect some (but not all) common errors in C programs that use pointers and dynamic storage allocation. On the Zoo, you can run valgrind on your program by putting valgrind at the start of the command line:

valgrind ./my-program arg1 arg2 < test-input

This will run your program and produce a report of any allocations and de-allocations it did. It will also warn you about common errors like using unitialized memory, dereferencing pointers to strange places, writing off the end of blocks allocated using malloc, or failing to free blocks.

You can suppress all of the output except errors using the -q option, like this:

valgrind -q ./my-program arg1 arg2 < test-input

You can also turn on more tests, e.g.

valgrind -q --tool=memcheck --leak-check=yes ./my-program arg1 arg2 < test-input

See valgrind --help for more information about the (many) options, or look at the documentation at http://valgrind.org/ for detailed information about what the output means. For some common valgrind messages, see the examples section below.

4.1. Compilation flags

You can run valgrind on any program (try valgrind ls); it does not require special compilation. However, the output of valgrind will be more informative if you compile your program with debugging information turned on using the -g or -g3 flags (this is also useful if you plan to watch your program running using gdb). See HowToUseTheComputingFacilities for more information about debugging and debugging flags.

4.2. Automated testing

Unless otherwise specified, automated testing of your program will be done using the script in /c/cs223/bin/vg; this runs /c/cs223/bin/valgrind with the --tool=memcheck, --leak-check=yes, and -q options, throws away your program's output, and replaces it with valgrind's output. If you have a program named ./prog, running /c/cs223/bin/vg ./prog should produce no output.

4.3. Examples of some common valgrind errors

Here are some examples of valgrind output. In each case the example program is compiled with -g3 so that valgrind can report line numbers from the source code.

4.3.1. Uninitialized values

Consider this unfortunate program, which attempts to compare two strings, one of which we forgot to ensure was null-terminated:

   1 #include <stdio.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     char a[2];
   7 
   8     a[0] = 'a';
   9 
  10     if(!strcmp(a, "a")) {
  11         puts("a is \"a\"");
  12     }
  13 
  14     return 0;
  15 }
uninitialized.c

Run without valgrind, we see no errors, because we got lucky and it turned out our hand-built string was null-terminated anyway:

$ ./uninitialized 
a is "a"

But valgrind is not fooled:

$ valgrind -q ./uninitialized 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x4026663: strcmp (mc_replace_strmem.c:426)
==4745==    by 0x8048435: main (uninitialized.c:10)
==4745== 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x402666C: strcmp (mc_replace_strmem.c:426)
==4745==    by 0x8048435: main (uninitialized.c:10)
==4745== 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x8048438: main (uninitialized.c:10)
==4745== 

Here we get a lot of errors, but they are all complaining about the same call to strcmp. Since it's unlikely that strcmp itself is buggy, we have to assume that we passed some uninitialized location into it that it is looking at. The fix is to add an assignment a[1] = '\0' so that no such location exists.

4.3.2. Bytes definitely lost

Here is a program that calls malloc but not free:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 int
   5 main(int argc, char **argv)
   6 {
   7     char *s;
   8 
   9     s = malloc(26);
  10 
  11     return 0;
  12 }
missing_free.c

With no extra arguments, valgrind will not look for this error. But if we turn on --leak-check=yes, it will complain:

$ valgrind -q --leak-check=yes ./missing_free
==4776== 26 bytes in 1 blocks are definitely lost in loss record 1 of 1
==4776==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==4776==    by 0x80483F8: main (missing_free.c:9)
==4776== 

Here the stack trace in the output shows where the bad block was allocated: inside malloc (specifically the paranoid replacement malloc supplied by valgrind), which was in turn called by main in line 9 of missing_free.c. This lets us go back and look at what block was allocated in that line and try to trace forward to see why it wasn't freed. Sometimes this is as simple as forgetting to include a free statement anywhere, but in more complicated cases it may be because I somehow lose the pointer to the block by overwriting the last variable that points to it or by embedding it in some larger structure whose components I forget to free individually.

4.3.3. Invalid write or read operations

These are usually operations that you do off the end of a block from malloc or on a block that has already been freed.

An example of the first case:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 int
   6 main(int argc, char **argv)
   7 {
   8     char *s;
   9 
  10     s = malloc(1);
  11     s[0] = 'a';
  12     s[1] = '\0';
  13 
  14     puts(s);
  15 
  16     return 0;
  17 }
invalid_operations.c

==7141== Invalid write of size 1
==7141==    at 0x804843B: main (invalid_operations.c:12)
==7141==  Address 0x419a029 is 0 bytes after a block of size 1 alloc'd
==7141==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==7141==    by 0x8048428: main (invalid_operations.c:10)
==7141== 
==7141== Invalid read of size 1
==7141==    at 0x4026063: __GI_strlen (mc_replace_strmem.c:284)
==7141==    by 0x409BCE4: puts (ioputs.c:37)
==7141==    by 0x8048449: main (invalid_operations.c:14)
==7141==  Address 0x419a029 is 0 bytes after a block of size 1 alloc'd
==7141==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==7141==    by 0x8048428: main (invalid_operations.c:10)
==7141== 

An example of the second:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 int
   6 main(int argc, char **argv)
   7 {
   8     char *s;
   9 
  10     s = malloc(2);
  11     free(s);
  12 
  13     s[0] = 'a';
  14     s[1] = '\0';
  15 
  16     puts(s);
  17 
  18     return 0;
  19 }
freed_block.c

==7144== Invalid write of size 1
==7144==    at 0x804846D: main (freed_block.c:13)
==7144==  Address 0x419a028 is 0 bytes inside a block of size 2 free'd
==7144==    at 0x4024B3A: free (vg_replace_malloc.c:366)
==7144==    by 0x8048468: main (freed_block.c:11)
==7144== 
==7144== Invalid write of size 1
==7144==    at 0x8048477: main (freed_block.c:14)
==7144==  Address 0x419a029 is 1 bytes inside a block of size 2 free'd
==7144==    at 0x4024B3A: free (vg_replace_malloc.c:366)
==7144==    by 0x8048468: main (freed_block.c:11)
==7144== 
==7144== Invalid read of size 1
==7144==    at 0x4026058: __GI_strlen (mc_replace_strmem.c:284)
==7144==    by 0x409BCE4: puts (ioputs.c:37)
==7144==    by 0x8048485: main (freed_block.c:16)
[... more lines of errors deleted ...]

In both cases the problem is that we are operating on memory that is not guaranteed to be allocated to us. For short programs like these, we may get lucky and have the program work anyway. But we still want to avoid bugs like this because we might not get lucky.

How do we know which case is which? If I write off the end of an existing block, I'll see something like Address 0x419a029 is 0 bytes after a block of size 1 alloc'd, telling me that I am working on an address after a block that is still allocated. When I try to write to a freed block, the message changes to Address 0x419a029 is 1 bytes inside a block of size 2 free'd, where the free'd part tells me I freed something I probably shouldn't have. Fixing the first class of bugs is usually just a matter of allocating a bigger block (but don't just do this without figuring out why you need a bigger block, or you'll just be introducing random mutations into your code that may cause other problems elsewhere). Fixing the second class of bugs usually involves figuring out why you freed this block prematurely. In some cases you may need to re-order what you are doing so that you don't free a block until you are completely done with it.

5. Not recommended: debugging output

A tempting but usually bad approach to debugging is to put lots of printf statements in your code to show what is going on. The problem with this compared to using assert is that there is no built-in test to see if the output is actually what you'd expect. The problem compared to gdb is that it's not flexible: you can't change your mind about what is getting printed out without editing the code. A third problem is that the output can be misleading: in particular, printf output is usually buffered, which means that if your program dies suddenly there may be output still in the buffer that is never flushed to stdout. This can be very confusing, and can lead you to believe that your program fails earlier than it actually does.

If you really need to use printf or something like it for debugging output, here are a few rules of thumb to follow to mitigate the worst effects:

  1. Use fprintf(stderr, ...) instead of printf(...); this allows you to redirect your program's regular output somewhere that keeps it separate from the debugging output (but beware of misleading interleaving of the two streams—buffering may mean that output to stdout and stderr appears to arrive out of order). It also helps that output to stderr is usually unbuffered, avoiding the problem of lost output.

  2. If you must output to stdout, put fflush(stdout) after any output operation you suspect is getting lost in the buffer. The fflush function forces any buffered output to be emitted immediately.

  3. Keep all arguments passed to printf as simple as possible and beware of faults in your debugging code itself. If you write printf("a[key] == %d\n", a[key]) and key is some bizarre value, you will never see the result of this printf because your program will segfault while evaluating a[key]. Naturally, this is more likely to occur if the argument is a[key]->size[LEFTOVERS].cleanupFunction(a[key]) than if it's just a[key], and if it happens it will be harder to figure out where in this complex chain of array indexing and pointer dereferencing the disaster happened. Better is to wait for your program to break in gdb, and use the print statement on increasingly large fragments of the offending expression to see where the bogus array index or surprising null pointer is hiding.


CategoryProgrammingNotes


2014-06-17 11:57