[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. String processing in general

Processing strings of characters is one of the oldest application of mechanical computers, arguably predating numerical computation by at least fifty years. Assuming you've already solved the problem of how to represent characters in memory (e.g. as the C char type encoded in ASCII), there are two standard ways to represent strings:

2. C strings

Because delimited strings are more lightweight, C went for delimited strings. A string is a sequence of characters terminated by a null character '\0'. Note that the null character is not the same as a null pointer, although both appear to have the value 0 when used in integer contexts. A string is represented by a variable of type char *, which points to the zeroth character of the string. The programmer is responsible for allocating and managing space to store strings, except for explicit string constants, which are stored in a special non-writable string space by the compiler.

If you want to use counted strings instead, you can build your own using a struct (see C/Structs). Most scripting languages written in C (e.g. Perl, Python_programming_language, PHP, etc.) use this approach internally. (Tcl is an exception, which is one of many good reasons not to use Tcl).

3. String constants

A string constant in C is represented by a sequence of characters within double quotes. Standard C character escape sequences like \n (newline), \r (carriage return), \a (bell), \0x17 (character with hexadecimal code 0x17), \\ (backslash), and \" (double quote) can all be used inside string constants. The value of a string constant has type const char *, and can be assigned to variables and passed as function arguments or return values of this type.

Two string constants separated only by whitespace will be concatenated by the compiler as a single constant: "foo" "bar" is the same as "foobar". This feature is not used in normal code much, but shows up sometimes in macros (see C/Macros).

4. String buffers

The problem with string constants is that you can't modify them. If you want to build strings on the fly, you will need to allocate space for them. The traditional approach is to use a buffer, an array of chars. Here is a particularly painful hello-world program that builds a string by hand:

   1 #include <stdio.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     char hi[3];
   7 
   8     hi[0] = 'h';
   9     hi[1] = 'i';
  10     hi[2] = '\0';
  11 
  12     puts(hi);
  13 
  14     return 0;
  15 }
hi.c

Note that the buffer needs to have size at least 3 in order to hold all three characters. A common error in programming with C strings is to forget to leave space for the null at the end (or to forget to add the null, which can have comical results depending on what you are using your surprisingly long string for).

5. Operations on strings

Unlike many programming languages, C provides only a rudimentary string-processing library. The reason is that many common string-processing tasks in C can be done very quickly by hand.

For example, suppose we want to copy a string from one buffer to another. The library function strcpy declared in string.h will do this for us (and is usually the right thing to use), but if it didn't exist we could write something very close to it using a famous C idiom.

   1 void
   2 strcpy2(char *dest, const char *src)
   3 {
   4     /* This line copies characters one at a time from *src to *dest. */
   5     /* The postincrements increment the pointers (++ binds tighter than *) */
   6     /*  to get to the next locations on the next iteration through the loop. */
   7     /* The loop terminates when *src == '\0' == 0. */
   8     /* There is no loop body because there is nothing to do there. */
   9     while(*dest++ = *src++);
  10 }

The externally visible difference between strcpy2 and the original strcpy is that strcpy returns a char * equal to its first argument. It is also likely that any implementation of strcpy found in a recent C library takes advantage of the width of the memory data path to copy more than one character at a time.

Most C programmers will recognize the while(*dest++ = *src++); from having seen it before, although experienced C programmers will generally be able to figure out what such highly abbreviated constructions mean. Exposure to such constructions is arguably a form of hazing.

Because C pointers act exactly like array names, you can also write strcpy2 using explicit array indices. The result is longer but may be more readable if you aren't a C fanatic.

   1 char *
   2 strcpy2a(char *dest, const char *src)
   3 {
   4     int ;
   5 
   6     i = 0;
   7     for(i = 0; src[i] != '\0'; i++) {
   8         dest[i] = src[i];
   9     }
  10 
  11     /* note that the final null in src is not copied by the loop */
  12     dest[i] = '\0';
  13 
  14     return dest;
  15 }

An advantage of using a separate index in strcpy2a is that we don't trash dest, so we can return it just like strcpy does. (In fairness, strcpy2 could have saved a copy of the original location of dest and done the same thing.)

Note that nothing in strcpy2, strcpy2a, or the original strcpy will save you if dest points to a region of memory that isn't big enough to hold the string at src, or if somebody forget to tack a null on the end of src (in which case strcpy will just keep going until it finds a null character somewhere). As elsewhere, it's your job as a programming to make sure there is enough room. Since the compiler has no idea what dest points to, this means that you have to remember how much room is available there yourself.

If you are worried about overrunning dest, you could use strncpy instead. The strncpy function takes a third argument that gives the maximum number of characters to copy; however, if src doesn't contain a null character in this range, the resulting string in dest won't either. Usually the only practical application to strncpy is to extract the first k characters of a string, as in

   1 /* copy the substring of src consisting of characters at positions
   2     start..end-1 (inclusive) into dest */
   3 /* If end-1 is past the end of src, copies only as many characters as 
   4     available. */
   5 /* If start is past the end of src, the results are unpredictable. */
   6 /* Returns a pointer to dest */
   7 char *
   8 copySubstring(char *dest, const char *src, int start, int end)
   9 {
  10     /* copy the substring */
  11     strncpy(dest, src + start, end - start);
  12 
  13     /* add null since strncpy probably didn't */
  14     dest[end - start] = '\0';
  15 
  16     return dest;
  17 }

Another quick and dirty way to extract a substring of a string you don't care about (and can write to) is to just drop a null character in the middle of the sacrificial string. This is generally a bad idea unless you are certain you aren't going to need the original string again, but it's a surprisingly common practice among C programmers of a certain age.

A similar operation to strcpy is strcat. The difference is that strcat concatenates src on to the end of dest; so that if dest previous pointed to "abc" and src to "def", dest will now point to "abcdef". Like strcpy, strcat returns its first argument. A no-return-value version of strcat is given below.

   1 void
   2 strcat2(char *dest, const char *src)
   3 {
   4     while(*dest) dest++;
   5     while(*dest++ = *src++);
   6 }

Decoding this abomination is left as an exercise for the reader. There is also a function strncat which has the same relationship to strcat that strncpy has to strcpy.

As with strcpy, the actual implementation of strcat may be much more subtle, and is likely to be faster than rolling your own.

6. Finding the length of a string

Because the length of a string is of fundamental importance in C (e.g., when deciding if you can safely copy it somewhere else), the standard C library provides a function strlen that counts the number of non-null characters in a string. Here's a possible implementation:

   1 int
   2 strlen(const char *s)
   3 {
   4     int i;
   5 
   6     for(i = 0; *s; i++, s++);
   7 
   8     return i;
   9 }

Note the use of the comma operator in the increment step. The comma operator applied to two expressions evaluates both of them and discards the value of the first; it is usually used only in for loops where you want to initialize or advance more than one variable at once.

Like the other string routines, using strlen requires including string.h.

6.1. The strlen tarpit

A common mistake is to put a call to strlen in the header of a loop; for example:

   1 /* like strcpy, but only copies characters at indices 0, 2, 4, ...
   2    from src to dest */
   3 char *
   4 copyEvenCharactersBadVersion(char *dest, const char *src)
   5 {
   6     int i;
   7     int j;
   8 
   9     /* BAD: Calls strlen on every pass through the loop */
  10     for(i = 0, j = 0; i < strlen(src); i += 2, j++) {
  11         dest[j] = src[i];
  12     }
  13 
  14     dest[j] = '\0';
  15 
  16     return dest;
  17 }

The problem is that strlen has to scan all of src every time the test is done, which adds time proportional to the length of src to each iteration of the loop. So copyEvenCharactersBadVersion takes time proportional to the square of the length of src.

Here's a faster version:

   1 /* like strcpy, but only copies characters at indices 0, 2, 4, ...
   2    from src to dest */
   3 char *
   4 copyEvenCharacters(char *dest, const char *src)
   5 {
   6     int i;
   7     int j;
   8     int len;    /* length of src */
   9 
  10     len = strlen(src);
  11 
  12     /* GOOD: uses cached value of strlen(src) */
  13     for(i = 0, j = 0; i < len; i += 2, j++) {
  14         dest[j] = src[i];
  15     }
  16 
  17     dest[j] = '\0';
  18 
  19     return dest;
  20 }

Because it doesn't call strlen all the time, this version of copyEvenCharacters will run much faster than the original even on small strings, and several million times faster if src is a megabyte long.

7. Comparing strings

If you want to test if strings s1 and s2 contain the same characters, writing s1 == s2 won't work, since this tests instead whether s1 and s2 point to the same address. Instead, you should use strcmp, declared in string.h. The strcmp function walks along both of its arguments until it either hits a null on both and returns 0, or hits two different characters, and returns a positive integer if the first string's character is bigger and a negative integer if the second string's character is bigger (a typical implementation will just subtract the two characters). A possible but slow implementation might look like this:

   1 int
   2 strcmp(const char *s1, const char *s2)
   3 {
   4     while(*s1 && *s2 && *s1 == *s2) {
   5         s1++;
   6         s2++;
   7     }
   8 
   9     return *s1 - *s2;
  10 }

(The reason this implementation is slow on modern hardware is that it only compares the strings one character at a time; it is almost always faster to compare four characters at once on a 32-bit architecture, although doing so requires no end of trickiness to detect the end of the strings. It is also likely that whatever C library you are using contains even faster hand-coded assembly language versions of strcmp and the other string routines for most of the CPU architectures you are likely to use. Under some circumstances, the compiler when running with the optimizer turned on may even omit a function call entirely and just patch the appropriate assembly-language code directly into whatever routine calls strcmp, strlen etc. As a programmer, you should not be able to detect that any of these optimizations are happening, but they are another reason to use standard C language or library features when you can.)

To use strcmp to test equality, test if the return value is 0:

   1     if(strcmp(s1, s2) == 0) {
   2         /* strings are equal */
   3         ...
   4     }

You may sometimes see this idiom instead:

   1     if(!strcmp(s1, s2)) {
   2         /* strings are equal */
   3         ...
   4     }

My own feeling is that the first version is more clear, since !strcmp always suggested to me that you were testing for not some property (e.g. not equal). But if you think of strcmp as telling you when two strings are different rather than when they are equal, this may not be so confusing.

8. Formatted output to strings

You can write formatted output to a string buffer with sprintf just like you can write it to stdout with printf or to a file with fprintf. Make sure when you do so that there is enough room in the buffer you are writing to, or the usual bad things will happen.

9. Dynamic allocation of strings

When allocating space for a copy of a string s using malloc, the required space is strlen(s)+1. Don't forget the +1, or bad things may happen.1 Because allocating space for a copy of a string is such a common operation, many C libraries provide a strdup function that does exactly this. If you don't have one (it's not required by the C standard), you can write your own like this:

   1 /* return a freshly-malloc'd copy of s */
   2 /* or 0 if malloc fails */
   3 /* It is the caller's responsibility to free the returned string when done. */
   4 char *
   5 strdup(const char *s)
   6 {
   7     char *s2;
   8 
   9     s2 = malloc(strlen(s)+1);
  10 
  11     if(s2 != 0) {
  12         strcpy(s2, s);
  13     }
  14 
  15     return s2;
  16 }

Exercise: Write a function strcat_alloc that returns a freshly-malloc'd string that concatenates its two arguments. Exactly how many bytes do you need to allocate?

10. argc and argv

Now that we know about strings, we can finally do something with argv. Recall that argv in main is declared as char **; this means that it is a pointer to a pointer to a char, or in this case the base address of an array of pointers to char, where each such pointer references a string. These strings correspond to the command-line arguments to your program, with the program name itself appearing in argv[0]2 The count argc counts all arguments including argv[0]; it is 1 if your program is called with no arguments and larger otherwise.

Here is a program that prints its arguments. If you get confused about what argc and argv do, feel free to compile this and play with it:

   1 #include <stdio.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     int i;
   7 
   8     printf("argc = %d\n\n", argc);
   9 
  10     for(i = 0; i < argc; i++) {
  11 	printf("argv[%d] = %s\n", i, argv[i]);
  12     }
  13 
  14     return 0;
  15 }
print_args.c

Like strings, C terminates argv with a null: the value of argv[argc] is always 0 (a null pointer to char). In principle this allows you to recover argc if you lose it.


CategoryProgrammingNotes

  1. In this case you will get lucky most of the time, since the odds are that malloc will give you a block that is slightly bigger than strlen(s) anyway. But bugs that only manifest themselves occasionally are even worse than bugs that kill your program every time, because they are much harder to track down. (1)

  2. Some programs (e.g. /c/cs223/bin/submit) will use this to change their behavior depending on what name you call them with. (2)


2014-06-17 11:57