Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks

News



Article Categories

Archives

Post Categories

Image Galleries

Programming

Come on that is easy:

  • bool Equals(object other) compares all member variables of another object instance to the current instance.
  • bool Equals(T other) does the same thing but expects an object of the same type than our self. This method is defined by the IEquatable<T> interface.
  • int GetHashCode() calculates of the internal state of the current object a mostly unique identifier. Mostly because if you have more than 2^32 object states some values must collide.

Most of the time you do not need to bother about these things except if you want to use your object as key in a hash table or if you want to use e. g. the List<T>.Contains(T value) method and you want to check for more than reference equality.

Although it looks like a very easy task to implement object equality and hash methods there are some pitfalls. The most important rule for both methods is that they most not throw any exceptions. Otherwise you make your type unusable in object collections (e.g. Hashtable, ArrayList, List<object>, Dictionary<object, …>, …) where objects of different types are compared against each other.

Nice rule lets violate it:

        public override bool Equals(object obj)

    {

            SimpleType other = (SimpleType)obj;

            ....

Ups this will cause an InvalidCastException if we ever try to search in an List<object> with SimpleTypes for e.g. some string. Ok many people can live with this limitation but it is so easy to make it work with the as operator:

    class SimpleType : IEquatable<SimpleType>

    {

 

        public override bool Equals(object obj)

        {

            return Equals(obj as SimpleType);

        }

 

        public bool Equals(SimpleType other)

        {

            // Reference Equality

            if (object.ReferenceEquals(this, other))

                return true;

 

            // this cannot be null if other is null we must return false

            if (object.ReferenceEquals(other,null))

                return false;

 

            return myb == other.myb && mya == other.mya;

        }

 

        int mya;

        string myb;

       …

The comment “// this cannot be null if other is null we must return false” is not really true. If you look at the string.Equals method

public bool Equals(string value)
{
    if ((value == null) && (this != null))

you find a null check for the this pointer. The reason is that you can call on any class member methods without getting a NullReferenceException until you try to dereference the this pointer the first time. A C# specific thing is that member methods are not called with the call IL instruction but the callvirt instruction which will throw a NullReferenceException when you try to call a method on a null reference. That is interesting for languages like F# which prefers direct calls whenever possible. If speed is a concern it costs you only one instruction

cmp         dword ptr [ecx],ecx

This little guy is responsible for you nice NullReferenceException which does nothing else than to compare the this pointer (= address) of your object which is always in the ecx register (at least for .NET) against the memory location it is supposed to point to. For a null pointer we end up trying to read from memory location 0 which is certainly not a valid object address. The memory page at address 0 is not readable which will cause a Windows SEHException (0xC0005 try with Windbg) which is translated by the CLR into our well known NullReferencException.

I had to tell you this because I tend to forget these little details so I can later look it up here. But back to our main topic. Common sources of errors in Equals and GetHashcode. I found a very interesting pattern to implement GetHashCode.

        public override int GetHashCode()

        {

            StringBuilder sb = new StringBuilder();

            sb.Append(mya);

            sb.Append(myb);

            return sb.ToString().GetHashCode();

Ok this does work and it produces a meaningful hash value for your object. It has only the little problem that this method does allocate strings like crazy which will be garbage collected very soon. If you have a Memory Profiler attached to your application and you see % Time in GC nearly 100% then it could be that somebody tried to use our SimpleType within a Dictionary<SimpleType,xxx> which can cause billions of GetHashCode calls within seconds. Your application will then be a nice unit test how fast the .NET garbage collector can be but there will be nearly no CPU cycles left for your application logic.

Another pitfall is to mix up operators for custom types. The == operator is not automatically routed to your instance once you override your Equals method. Instead the default behavior  for reference equality does kick in.

    class OtherType

    { }

 

    class SimpleType : IEquatable<SimpleType>

    {

        int mya;

        string myb;

        OtherType myOther;

 

        public bool Equals(SimpleType other)

        {

            // Reference Equality

            if (object.ReferenceEquals(this, other))

                return true;

 

            // this cannot be null if other is null we must return false

            if (object.ReferenceEquals(other,null))

                return false;

 

            return myb == other.myb &&

                   mya == other.mya &&

                   myOther == other.myOther; // Wrong this tests for reference equality only!!

        }

The code looks like it does what you meant but it does not. Before using an operator you need always to check if there are any custom operators defined. For strings for example you can compare two null references without getting any exception. But things are different when you try to call the member method Equals. It is also easy to add recursion to your operators by e.g. using the == operator for the null check. If your == operator calls Equals again you need an infinite stack. But these things are not subtle you will find them with unit testing.

So how does working sample look like? Below you find the SimpleType reference implementation which shows how you can add a meaningful hash function for more than one member (Hash codes of other types are combined by the xor operaton ^ which is ok for most cases but there are other approaches possible which make it possible to get a different hash code when different fields are null.

    class SimpleType : IEquatable<SimpleType>

    {

        int mya;

        string myb;

 

        public bool Equals(SimpleType other)

        {

            // Reference Equality

            if (object.ReferenceEquals(this, other))

                return true;

 

            // this cannot be null if other is null we must return false

            if (object.ReferenceEquals(other,null))

                return false;

 

            return myb == other.myb &&

                   mya == other.mya;

        }

 

        public override bool Equals(object obj)

        {

            return Equals(obj as SimpleType);

        }

 

        public override int GetHashCode()

        {

            int ret = mya;

            if (myb != null)

                ret ^= myb.GetHashCode();

 

            return ret;

        }

 

        public static bool operator ==(SimpleType a, SimpleType b)

        {

            // Enable a == b for null references to return the right value

            if (Object.ReferenceEquals(a, b))

                return true;

            // If one is null and the other not. Remember a==null will lead to Stackoverflow!

            if (Object.ReferenceEquals(a,null))

                return false;

 

            return a.Equals((object) b);

        }

 

        public static bool operator !=(SimpleType a, SimpleType b)

        {

            return !(a==b);

        }

 

        public SimpleType(int a, string b)

        {

            mya = a;

            myb = b;

        }

    }

Most programmers understand that GetHashCode and hash tables are somehow related. But only few do really understand how they achieve the ultra fast lookup times of O(1). That means that I can check for the existence of a key in constant time regardless how many objects I have stored in my Hashtable. The precondition for this magic is that GetHashCode is fast and produces a uniform distribution of hash values across the full value range of 0 – 2^32. A typical hash table implementation uses an array to to store its objects. When an object needs to be inserted it calculates the hash code and transforms this value to an index to the array. A widely used formula is

idx = hash % length

Where idx is the array index where our object is stored, hash is the hash value of the object to be stored and length is the length of the  internal storage array of our hash table. The modulo operation will magically give you an array index which is between 0 and length-1. Then you can store the object at the given index. The picture below illustrates this.

Hashtable

Now it is easy to answer the question why the size of the array of a hash table should be a prime number. Lets suppose it is not a prime number e.g. 16 then the remainder of the modulo operation would return 0 for all hash values which are divisible by 16 hence causing collisions which would force the hash table to switch on the collision resolution logic which is more complex and involves calling Equals to all previously stored objects with the same hash value to find out if the same (=Equals) object was already stored.

The lookup of a hash table takes basically 5 steps:

  1. Calculate hash of input key e.g. hash “cd” = 5
  2. Calculate from hash the corresponding array index 5 % 10 = 5
  3. Calculate the hash value from object at array Index 5: hash “cd” = 5
  4. If the value does match call Equals to check if it really the same object
  5. If not we have a hash collision. That can mean we check all further objects in the array with the same hash value until we find a match or stop when we have found an object with a different hash value.

Step 1-3 are the application of our magic formula to calculate the storage index inside our array. The other steps are there to confirm that we did not get only an object with a matching hash code (which you cannot trust to be unique) but we need to compare the key values one by one which have the same hash code until we found it.

There is still much more to say about hash tables such as custom IEqualityComparers to get more speed or a different behavior of your dictionary. It is e.g. quite easy to create a file name dictionary which will throw when you try to add the same file more than once to it. This results in a somewhat imperformant hashing function but for small numbers it can be acceptable.

With a good hashing function you can create many interesting data structures. With .NET 3.5 we did get e.g. the HashSet<T> class which enables a fast check if something was already added to the set or not. It is possible to squeeze more memory out by using a probabilistic data structure called Bloom Filter which has the unique property of constant check time regardless how many items are stored inside it. That coolness comes at the cost that false positives are possible (the bloom filter tells you it has the value stored but it did not) which can be configured during creation of the filter. An implementation in C# can be found at CodePlex.

Bloom Filters are useful for cache managers which need a first fast check if the element is in the cache. If yes we can lookup the value in the cache if no it goes directly to disk. When a false positive is reported we look in the cache in vain but that does not matter since we still can get our data from disk.  Googles Big Table for example uses bloom filters for this very purpose very successfully.

posted on Sunday, February 28, 2010 3:24 PM

Feedback

# re: Pitfalls Of Equals/GetHashCode – How Does A Hash Table Work? 3/15/2010 2:40 AM Dave Black
Hi Alois,

Thanks for writing this article. Here is how I usually implement Equals() and GetHashCode(). Please provide your comments on these implementations and their correctness.

public override bool Equals( object right )
{
//TODO Uncomment the following code IF and ONLY IF the base class overrides Equals()
//// let the base type compare its fields
//if ( !base.Equals( right ) )
//{
// return false;
//}

// check for null
if ( null == right )
{
return false;
}

// check to see if the object we're comparing is the exact same instance as this object
if ( object.ReferenceEquals( this, right ) )
{
return true;
}

// if the types are not the same, they cannot be equal
// Note that even if we're comparing a Base type to a Derived type, they cannot be equal!
if ( this.GetType() != right.GetType() )
{
return false;
}

YourCustomObjectHere rightAsYourCustomObjectHere = right as YourCustomObjectHere;

//TODO: write your custom implementation of Equals() here comparing member value types and reference types
throw new NotImplementedException();
}

/// <remarks>This implementation should ONLY use immutable fields (i.e. cannot be changed) to calculate the Hash Code.
/// Otherwise, you will lose this object in the hash map!</remarks>
public override int GetHashCode()
{
// TODO: write your implementation of GetHashCode() here
throw new NotImplementedException();

// some IMMUTABLE PRIVATE types here
int hashCode = immutableObject1 ^ immutableObject2 ^ immutableObject3....

return hashCode;
}


# re: Pitfalls Of Equals/GetHashCode – How Does A Hash Table Work? 3/15/2010 9:51 AM Alois Kraus
Hi Dave,

except for the Exceptions your code looks ok. It depends on your requirements and goals what is a suitable equals/gethashcode implementation. I would also implement IEquatable since it is easy to do it so the generic collections can use it without boxing (if performance is critical).

Yours,
Alois Kraus


# re: Pitfalls Of Equals/GetHashCode – How Does A Hash Table Work? 3/15/2010 11:24 AM Dave Black
Isn't there a bug in your example of GetHashCode()? From what I understand, it should only calculate the values of the hash code on immutable fields...

# re: Pitfalls Of Equals/GetHashCode – How Does A Hash Table Work? 3/15/2010 11:55 PM Alois Kraus
The fields contain immutable values which are never changed after construction. I have not shown the ctor there. The recommendation to never change the state of an object is a good advice but no more than that.
If you change the hash code because the internal object state has changed you need to rehash your object in all hash tables. It is a dangerous thing to do but in some cases desireable.
If you break a rule you have to know what consequences arise from that. You can deviate from the general rules in specific scenarios to get more performance at the cost of increased complexity.

Yours,
Alois Kraus


# re: Pitfalls Of Equals/GetHashCode – How Does A Hash Table Work? 8/16/2010 12:11 PM Frank Hileman
Hi Alois,

A hash table using modulo prime is a warning sign: the hash function is bad. Good hash functions have no need for the modulo prime; randomness is equally distributed for any bit flip on the input. Modulo prime is like a band aid, and it can fail, if you have ever dealt with bad hash functions that produce multiples of prime numbers.

A good hash function is the google murmer hash.

With a good hash function, the number of buckets is always a power of two, and we simply truncate bits. This works very well.

Regards,
Frank Hileman

# re: Pitfalls Of Equals/GetHashCode – How Does A Hash Table Work? 8/16/2010 12:17 PM Frank Hileman
Hi Alois,

I can send you my Mummer hash implementation in C# if you wish. You have to use it every time you combine hash codes. The fundamental flaw in the .net GetHashcode design: we must use a class that computes hash codes, combines them, and does the finalization (avalanche). The .net design currently leaves the combine operation up to developers, and most do it incorrectly, using xor or some other poor technique. And .net has no finalization step at all.

Regards,
Frank Hileman

# re: Pitfalls Of Equals/GetHashCode – How Does A Hash Table Work? 8/21/2010 1:58 AM Alois Kraus
Hi Frank,

this looks very interesting. If you send me a C# sample I could check how it performs better than the C# Dictionary in terms of memory consumption and or lookup performance.

Yours,
Alois Kraus


Post A Comment
Title:
Name:
Email:
Comment:
Verification: