One thing's for sure. If you use a lot of switch statements in your code (to "case out" user-selected actions, for example), you're sitting on a great opportunity to improve the reliability and maintainability of your code; just follow along and see.
What Is an Associative Array?
An associative array is a collection of data values in which access to any particular value is obtained via a (single) key. Generally speaking, it's a one-to-one mapping of data entities ("values") to keys, such that no two values map to the same key (although two different keys could, coincidentally, contain the same value). For example, in an array of TV listings, you might very well find that two different channels are showing the same episode of Gilligan's Island at 4:00 a.m. Tuesday, but you would never find Channel 2 listed twice, showing two different shows in the same time slot. In a larger sense, associative-array mapping is an application of set theory, and languages that have a robust collections package will invariably implement some form of associative-array mapping.
For a quick example of an associative array, consider the hypothetical entity in Figure 1.
Figure 1. A hypothetical associative array.
For the array shown in Figure 1 to be useful, we need to be able to hold a reference to the array in a variable, and we need to have a syntax for obtaining the value of each member, given the value of the corresponding key. It would also be nice if we could find out how many items are in the array (since it might be arbitrarily large) and to know how to iterate through the members. With an ordered array, you can step through the members in sequence. But in an array like this one, where the keys aren't integers, how do you enumerate the array's contents?
Associative Arrays in JavaScript
var myObject = new Object(); myObject.itemType = 'auto'; myObject.subtype = '4WD'; myObject.make = 'Jeep'; myObject.year = '1998'; myObject.color = 'green';In JavaScript, this is an entirely acceptable way to implement the object suggested by Figure 1. An equivalent syntax that uses literal notation is:
var myObject = { 'itemType' : 'auto', 'subtype' : '4WD', 'make' : 'Jeep', 'year' : '1998', 'color' : 'green' };But are we justified in calling this an associative array? Yes, we are, because it turns out we can also use array notation (square brackets) in JavaScript to refer to object properties:
var myObject = new Object(); myObject['itemType'] = 'auto'; myObject['subtype'] = '4WD'; myObject['make'] = 'Jeep'; myObject['year'] = '1998'; myObject['color'] = 'green'; if (myObject['itemType'] == 'auto' && myObject['make'] == 'Jeep') // true myObject['subtype'] = '4WD'; // assign '4WD' to myObject.subtypeIn JavaScript, square brackets can substitute for dot notation when referring to object properties. The only proviso is, the enclosed property name must be supplied as a string literal (as shown above). This turns out to be an extremely handy syntactical construct, because it means we are not limited to one-word property names. We can now consider constructions like:
myObject['extended warranty'] = true; // legal myObject['original delivery date'] = '1/1/98'; // legal myObject.'original delivery date' = '1/1/98'; // ERROR! Iterating through the contents of an associative array (or the properties of an object) would seem to present a problem, since the contents aren't necessarily ordered and can't be "indexed into" numerically. In JavaScript, however, there is a very handy loop syntax for doing this:
var bigstring = new String(); for (var i in myObject) bigstring += i + "\n";The above code builds a list of all the property names in myObject (separated by newlines), concatenating them into a single big string.
Perl
%myObject = ('itemType' , 'auto', 'subtype' , '4WD', 'make' , 'Jeep', 'year' , '1998', 'color' , 'green'); Notice that commas are used throughout the pairs list, rather than the more readable JavaScript technique of putting colons between keys and values (using commas to separate complete tuples).
To access a hash value in Perl, you use a notation that looks like:
$myObject{ 'color' } = 'red';Notice the seemingly inconsistent use of the dollar sign (instead of the percent sign) and curly braces (instead of parentheses) for accessing individual values of a hash. This is a standard Perl idiom. The percent sign refers only to complete hashes.
Java
With the advent of the Java 2 platform , the java.util package now has a powerful Collections Framework (not to be confused with JGL, the Generic Collections Library for Java by ObjectSpace, which was first-to-market and was widely distributed in IDEs before Sun came out with the J2SE Collections Framework), encompassing sorted and unsorted sets, maps, and lists, plus iterators for same, in thread-safe and non-thread-safe versions. Prior to Java 2, the language had Vector and HashTable classes to accommodate this need, but those classes have been relegated to legacy status. Today, if you need an associative array, you call on the HashMap class, which implements the Map interface: Map map = new HashMap(); map.put("itemType", "auto"); map.put("subtype" , "4WD"); map.put("make" , "Jeep"); map.put("year" , "1998"); map.put("color" , "green");For sorted maps, you can create a TreeMap (using a Map as input, if necessary), which stores its elements in an always-balanced tree.
C and C++
In C++, the Standard Template Library provides rich support for collections idioms of all kinds. The map is an associative container that accomplishes the kind of functionality we've been talking about: map<string,string> myMap; myMap["itemType"] = "auto"; months["subtype"] = "4WD"; // etc. etc.To use it, be sure to #include <map>. See any good reference on STL for further details.
ANSI C, on the other hand, has no built-in support for associative arrays. You can craft your own support, of course, with a little effort. But how? As you might guess from the repeated use of the strange term "hash" in this context, implementing an associative array from scratch requires the use of hash techniques.
What, exactly, is a hash? A hash is basically a mathematical function in the canonical sense: a "black box" that maps domain values to range values, with the requirement that any one domain value can map to no more than one range value. This is exactly the mapping behavior we're looking for, of course.
But what does a hash function really look like? The answer is, it can look like whatever you want it to look like, or (more commonly) whatever you need it to look like. If that sounds a bit vague, that's because hash techniques fall into a curious category of computational heuristics with a long background in computer-science lore. In the old days of limited machine resources, you'd use hash techniques to maximize performance while minimizing storage requirements. Good hash functions weren't often published, and if they were, their workings were rarely self-evident.
A Hash-Code Example
A typical answer to this challenge would be to insert exception words into a linked list, and do a lookup by traversing the list. With 5K of storage and an average word length of five characters, you could store as many as 1,000 words in the list (more than adequate for most exception dictionaries). But lookups will be time-proportional to the number of words in the list, and there will be significant overhead in terms of list management, because each time a new word is entered in the list, you have to be sure it doesn't already exist.
This approach gets a D for ingenuity.
A bright young programmer in your department comes to you at this point and says "Wait! I have a better idea." He explains that if you keep the linked list sorted alphabetically, you can speed lookups because they will occur in time-proportion to the log of the number of entries. List-maintenance overhead is reduced as well, although storage requirements haven't changed.
This represents a significant improvement. Let's give it a C+ (or maybe even a C++) for effort.
At this point, a consultant walks in the door and offers to solve the problem for you in such a way that up to 40,000 words can be stored in your exception dictionary and lookups are instantaneous, with zero table maintenance. You say to him "Thank you, but that's clearly impossible. Don't let the doorknob hit you in the butt on the way out." He goes straight to your competitor, implements the solution, and eventually the competitor puts you out of business. You end up face-down in the gutter, next to an empty bottle of Ripple.
How did the consultant do it? First of all, since you are only concerned with whether a given word has an entry (true) in the dictionary or not (false), you're really looking for a binary (Boolean) answer to a question; hence your 5K dictionary can really store 40K lookups (assuming 8-bit bytes, of course). The dictionary needs to be implemented as a bit table.
To implement a direct lookup of words, we resort to the following heuristic. To store a word in the dictionary, we submit the word string to a hash function, which in turn converts the word from a string to a number in the range of zero to 5K-minus-one. We then index into the bit table at the offset thus obtained, and store the word by flipping the bit "on." (A zero bit means the word doesn't exist in the table.) To look up a word and see if it exists in the exception dictionary, we simple hash the word and do a direct lookup. If the entry in the table at that lookup is true, the word exists. If not, it doesn't. Problem solved.
Finally, an A+ solution!
Hash Functions
But we still haven't resolved the question of what a hash function looks like. Earlier we merely said that it could look like whatever it needs to look like. Let's take the exception-dictionary example we just discussed. How many bits' worth of information does a 5-character string really hold? If the string represents an English word, each character can have one of 26 possible values (except that the first character can be upper or lower case; hence 52 possible values). Storing 26 possibilities requires 4.7 bits; 52 possibilities requires 5.7 bits. The average word therefore requires 5.7 + 4 * 4.7 == just under 25 bits of bandwidth. Every 5-character English word can therefore be converted (mapped directly and unambiguously) to a 25-bit number. Which is to say, a number between zero and 33,554,431. Clearly, if we had 32 million bits (4 megabytes) of storage available, and words could never be longer than five characters, you could convert words directly to binary numbers and use the numbers as offsets into a bit table.
Unfortunately, in the real world, words are not constrained to five characters in length and four megabytes of table storage are not always available. In fact, our example calls for no more than 5 Kbytes (40Kbits) of storage.
Why not take the bottom 15 bits of each word (discarding the rest) and use that as an index into a 40Kbit table? The answer should be obvious. Words often have the same endings. Taking the last 15 bits of 'eating' and 'drinking' would result in both words contending for the same spot in the exception dictionary. Totally unsatisfactory.
The crux of understanding how a hash solution works in this kind of situation is to think long and carefully about the problem. If you think about it, you will realize that the ASCII values in words are not at all randomly chosen. The composition of word strings in English is constrained by spelling rules, which in turn are determined, in part, by constraints on the speakability of syllables (i.e., the geometry of the human voice apparatus). The average English word may consume only 25 bits of bandwidth, but there most certainly are not 32 million different 5-letter words in the English language! The actual number of different 5-letter words you might encounter might only be in the low thousands. Certainly, in a spellchecker context, 5-letter "exception" words (words that are not commonly found in English) would likely be numbered in the low hundreds, if that many.
Careful analysis of the problem space should convince you that what we're really dealing with here is a sparse data set. We will be mapping a few hundred exceptions (or at most, a few thousand) into a 40,000-bit table. All we need is a way to convert words to numbers that map well across the available domain.
Suppose we design a hash function in C as shown in Listing 1 (where ch is a pointer to the word string).
Listing 1: hash()
hash() unsigned int hash(ch) { unsigned int hashvalue = 0; if (!*ch) return 0; // sanity do { hashvalue += *ch++; hashvalue *= 13; } while (*ch); return (hashvalue % TABLESIZE); }This hash function adds the ASCII value of each character in the string, one by one, to a temporary variable. But in between each addition, the variable is multiplied by the magic number 13, which (in binary terms) is equivalent to a left-shift by log(13) bits. (Prove to yourself that this is so.) What's the significance of 13? Nothing. It simply works. Might some other number work better? Yes! That's the magic of hash functions. Often, you can "tune" them to make them work better.
What does "work better" mean? It means you have fewer situations where two different character strings hash to the same numeric value. When two data objects hash to the same value, it's called a hash collision. Collisions are usually undesirable. In our exception-dictionary example, it could mean that a nonsense word maps to the same dictionary location as a perfectly valid word!
Dealing with Collisions
There are various ways of handling collisions. The most common way (not for bit tables, but for tables where actual string information might be stored) is to fork a linked-list node at the point of the collision. Another method is to keep parallel tables, but with different hash functions for each. An "entry" in the dictionary would actually consist of an entry in each table. A lookup would verify that Hash 1 produces a positive result from Table 1 and Hash 2 yields a positive result from Table 2. Prove to yourself that (given suitably designed hash functions) if a collision happens in the first hash table, it's extremely unlikely that a collision also occurs, simultaneously, in the other hash table; therefore, the overall collision rate is the product of two small numbers (the individual failure rates of the hashes). In some situations, collisions can safely be ignored; the "collider" overwrites the old table entry, and that's that. This can actually be desirable in instances where data entities are expected to regularly go out of scope. (In this case, you have what is commonly known as a "weak" associative array). In the spellchecker example, it's debatable whether collisions are important. No spellchecker is ever perfect. If the odds of a misspelled word hashing to the same spot as a legitimate word are less than, say, one in five thousand, would the user really notice? Would he care?
Hash Tuning
The fascinating thing about hash functions like the one above is that they can be tuned for better collision performance (lower collision rates). What makes this fascinating is the fact that, because the problem is so poorly defined (mathematically), an analytical solution is usually out of the question. Therefore, tuning must be done empirically: try a tweak, measure the result. Try another tweak; measure the result. Repeat until happy. You might want to write a short program that takes a text stream (a document) as input and maps the "vocabulary" to a hash table, measuring collision rates in the process. Find out how collision rate varies with load factor (table consumption) for a given hash function. Then try changing the function in some way, and see how the collision performance changes.
The example hash function given in Listing 1 is actually a surprisingly good hash function for most purposes. With a driver program (such as HashMule; see further below), you can prove to yourself, empirically, that this modest-looking function works much better (i.e., produces many fewer collisions) than the same function with the number 12 substituted for the number 13. (This is why I called 13 a magic number.) In fact, when I ran a quick test of Listing 1 using the Declaration of Independence as the source text (hashed into a table with room for 1024 words), I found that changing the magic number 13 to 12 caused collisions to more than double.
In general, even numbers produce poor results, and non-prime odd numbers (while better) are not as good as primes.
When writing a hash function, you will always need to constrain the final output to a certain range (in our example, 5K). Be aware that, for tuning purposes, it can matter a great deal what the "modulo" constraint is. Again, it turns out that a prime number here will improve performance. So make your lookup table an odd size! (But don't take my word for it. Write a test program and prove it to yourself.)
A Common Hash Misconception
A common misconception is that hash functions are designed to distribute values randomly over a given range. You should convince yourself that this is not true. After all, if hashing were this easy, every hash function would simply return a random number (converted to integer form and scaled to fit the desired range). The ideal hash function is usually one that distributes keys evenly across a range, with no collisions. Random numbers tend to cause collisions, at a rate equivalent to the hashtable load factor. (That is, when the lookup table gets half full, each new entry into the table has a 50:50 chance of generating a collision.) A good hash function can do much better, because hash functions can be constructed to remap data non-randomly. And in fact, this is exactly what you are trying to achieve. You are trying to map a sparse, non-random data set to non-random locations in a table: locations that spread out evenly and don't overlap. If you know a thing or two about the data before you begin, you can make intelligent decisions about how to craft the hash function. Consider a list of 500 arbitrarily chosen unique words selected from Webster's dictionary at random. Imagine that you are trying to map these words into a hash table that has enough space for all 500 words, but only enough space. (But you don't necessarily know in advance how many words there are.) Can it be done? Of course. You just need to develop a hash function that maps table positions to the words' alphabetical ranking. If you can analyze the word list in advance, this is a piece of cake. If you don't have advance access to the words, but your hash object is free to reorder table data on the fly, again it's not a hard problem, because you can insert words in alphabetical order and reorder as necessary. Even if you know nothing about the words (except that they are words!), you can easily come up with a hash function that does a better job of mapping them to a table (i.e., with fewer collisions) than mere random dart-throwing.
There's no magic procedure for constructing a good hash function. Each situation is different. What might be a great function for one set of conditions could very well be terrible under another set of conditions. The trick is to understand your data set (and what's unique about it) and understand what it is you are trying to do: map members of the set onto intervals that don't overlap.
Putting Associative Arrays to Work
With a hash function approach in C, we haven't quite achieved the table['string'] lookup ease of other languages, but we've approached it quite closely, because we can instead do: lookup = table[ hash('myItem') ];But the greater question, at this point is: How can you put associative arrays to work? One example is well-known: In compilers, symbol tables are usually accessed via a hash function that computes a likely place to begin a serial search. Since symbol-table lookup is a very frequent occurrence in compilers, the performance of the hash function (both in raw execution speed and in collision rate) can have a dramatic effect on overall compiler performance.
But what if you're not a compiler writer? In that case, you might want to consider a possible application that applies broadly across a variety of programming tasks in a variety of languages. It involves 'case' statements.
The Case Against Case Statements
Most high-level languages support a switch construct that allows options to be associated with desired actions. In C: switch( option ) { case OPTION_A : do_something(); break; case OPTION_B : case OPTION_C : do_something_else(); break; // . . . more options processing default : do_whatever(); }Usually, the cases must refer to hard-coded values. Fall-through occurs by default, which means that if you leave a break statement out (such as with OPTION_B above), execution continues with the next case. The original intent of the switch construct was to keep programmers from having to resort to long chains of if/else actions (while leaving code readable). But hindsight has shown switch/case syntax to be a notably poor syntax design in a number of respects. First of all, it doesn't save any typing (over if/else actions). In fact, switch blocks are usually quite long, because in instances where there are only a couple of cases, programmers tend to collapse everything to a few if/elses. The opportunity for typos is high in long code blocks. Also, the default fall-through behavior of switches is known to be a bad design decision, because it's not the default behavior most programmers are looking for, most of the time. (Some years ago, Sun analyzed its C compiler source code to see how often the default fall-through path was actually used. It turned out the Sun ANSI C compiler's front end had 244 switch statements, averaging seven cases each. Fall-through occurred in only 3% of cases.)
There is also a performance issue, in that the compiler must (because of the possibility of fall-through) evaluate each case in turn, up to the first match. But there's an even bigger problem, architecturally: Switch statements require you to intermix code and static data. This leads to poor maintainability and a host of other ills. It would be better, obviously, to segregate code from static data so that when the data needs to be revised periodically, it can be edited without changing any code.
The answer should be obvious by now: Convert switch blocks to associative arrays. Each case can be used, after all, as a key to find the associated action. (If the associated action is a block of code, then the array need only store a pointer to the appropriate function or macro.) In this way, you can collapse the entire switch block down to one line of code, which executes instantly.
Consider the common situation, in CGI/forms programming, of retrieving a user selection from a drop-down menu and taking action on it. Usually, you have something like this:
var userSelection = element.options[i].value; switch(userSelection) { case 'Kansas' : // JavaScript allows string cases do_this(); break; case 'Ohio' : do_that(); break; // etc. default : whatever(); break; }With an associative array, you can separate data from code and reduce all actions to a single expression. In JavaScript:
// build a lookup table as an Object: var lookupTable = { 'Kansas' : do_this, 'Ohio' : do_that, // etc. } // execute action based on lookup: lookupTable[ userSelection ]( );An entire series of if/else statements (or a big, long switch statement) is thus collapsed to one expression, which does one lookup, then vectors to an action procedure. Code is segregated from data, making maintenance easier and less likely to introduce bugs. Performance is improved, many lines of code are reduced to one, and readability doesn't suffer a bit. The best of all worlds. (Just don't tell your competition.)
Sample App: HashMule
For this article, I wrote an educational utility for creating, studying, and comparing the collision rates of hash functions. The app, called HashMule, is actually a PDF form incorporating about 150 lines of JavaScript. It is designed to run under Acrobat Reader version 4.0 or 4.05. You can download HashMule from http://www.acroforms.com/archive/HashMule.pdf, or by FTP from the Mactech website. One of the things HashMule allows you to do is enter hash functions, then execute them against text in real time. A "Hash It" button on the form will hash every word of a given text block into a hash table, then report statistics such as table load factor and hash collision rate. By making small changes in the hash function and hitting the "Hash It" button, you can see how the collision rate changes as you change "magic number" values or hash-function logic. The text of the Declaration of Independence is included in the form as a sample text block.
No comments:
Post a Comment