Nancy Street

stringhash

Click to see the Site Map
HomeInfoMusicGalleryPetsGeoHobbiesGeo
Site MapWhat's NewRecent ChangesContactsServer StatisticsSite Information Home « Computers « stringhash

Overview

For many years I've been wondering how the standard string class GetHashCode method behaved in the .NET Framework. I wondered how well distributed the hash code integers were and if they were collision resistant. Collision resistance is highly desirable when the integers are used as keys in collections.

I spent a few mornings over the 2005 Christmas holidays writing a small forms application that hashes strings using the standard string GetHashCode and some other algorithms and compares their distribution and collisions. The first tab in the window hashes strings one at a time and shows the resulting hash values, or you can click Run to start a timer loop that hashes hundreds of automatically generated strings per second. The integer hash values are scaled down and plotted inside a pair of 256x256 grids to visually show their distribution. The second tab runs a collision test against all 2-character strings of letters and digits.

Framework 1.1

Shown below is the C code for the Framework versions 1.0 and 1.1 string GetHashCode method. This code is implemented in native code and is clearly designed to be fast rather than robust. Next to the code is a thumbnail of my old Framework 1.1 version of stringhash that shows the hash distribution (on the left) for random strings of length 5 to 6. You can see that the distribution is far from random and looks seriously flawed. My initial tests show that the standard string hash has a dreadfully poor distribution for strings less than 6 characters in length, and it has many collisions for 2-character strings.

// Framework 1.x string hash code
inline ULONG HashStringA(LPCSTR szStr)
{
  ULONG hash = 5381;
  int c;
  while ((c = *szStr) != 0)
  {
    hash = ((hash << 5) + hash) ^ c;
    ++szStr;
  }
  return hash;
}

Framework 1.1 Hash
Click to popup an
enlargement

Framework 2.0

After the arrival of Framework 2.0 I recompiled stringhash to see if and how String GetHashCode had changed. It turns out that the code is no longer native code and can be displayed using Reflector.

[ReliabilityContract(Consistency.WillNotCorruptState,Cer.MayFail)]
public override unsafe int GetHashCode()
{
  fixed (char* text = ((char*) this))
  {
    char* chPtr = text;
    int num = 0x15051505;
    int num2 = num;
    int* numPtr = (int*) chPtr;
    for (int i = this.Length; i > 0; i -= 4)
    {
      num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
      if (i <= 2)
      {
        break;
      }
      num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
      numPtr += 2;
    }
    return (num + (num2 * 0x5d588b65));
  }
}

Framework 2.0 Hash
Click to popup an
enlargement

You can see that roughly the same number of points from the Framework 2.0 String GetHashCode have a much better distribution that they did in Framework 1.x. There are no 2-character collisions any more.

Download

stringhash_src.zip (201KB) - Zip of the full Visual Studio 2005 solution and source code.

stringhash_exe.zip (22KB) - Zip of the executable files. Place them in any folder and run stringhash.exe. No installer is needed and no files are created at runtime.

Links

An Illustrated Guide to Cryptographic Hashes


Contact Information | PGP Keys | Site Map | What's New | Visitor Book
Last Updated: 06-Aug-2007 21:13
Copyright © 1999-2007 Orthogonal Programming