Geeks With Blogs
Christian Maslen
Technorati Tags: ,,,

Many language and API developers have their example of the “Hello World” program. Either to illustrate the simplest possible program that shows how to get something to run or to highlight a key concept in the language or API. For example how many Fibbonaci examples have you seen when being introduced to a functional language?

Calculating PI and Parallel.For

In the parallel programming world it looks like Hello World is the geometric approximation of PI. I took a look at the very interesting hands on lab on the subject. It outlines how to calculate PI using geometry and the Monte Carlo Method. It also shows how to progress from a single threaded version to a version using Parallel.For and how to analyse the code using the Visual Studio 2010 tools.

I tried this on a dual core PC and found the threaded version and the parallel.For version ran in roughly the same time. On my i7 with 8 hardware threads, I expected a bigger difference but not like this…

  Rough Execution Time
Single Threaded Version 37.1 seconds
Thread Pool Version 16.4 seconds
Parallel.For Version 74.5 seconds

I was disappointed to say the least. I can only conclude the extra hardware threads did more harm than good because of the need to synchronise access to the 2 collections in the sample.

Below is the improved version. This time we pre-calculate the random numbers so they don’t require the lock and we make use of a different overload of the Parallel.For method – one that allows you to keep running totals per thread so you don’t need to lock the total variable per loop iteration but instead lock it when each thread is done.

   1: using System;
   2: using System.Diagnostics;
   3: using System.Collections.Generic;
   4: using System.Threading;
   5: using System.Threading.Tasks;
   6:  
   7: namespace CalculatePI
   8: {
   9:     class Program
  10:     {
  11:         // Tuning constants:
  12:         // If you have lots of memory, increase NUMPOINTS to improve the accuracy
  13:         private const int NUMPOINTS = 10000000;
  14:         private const int RADIUS = 10000;
  15:  
  16:         // Value to seed the random number generator for each calculation.
  17:         // Using the same seed value ensures that the same results should be generated each time
  18:         private const int SEED = 269222;
  19:  
  20:         // If you have a very fast processor, increase SPINWAITS to show the effects of parallelization
  21:         private const int SPINWAITS = 1000;
  22:  
  23:         static double ParallelTasksPI()
  24:         {
  25:             Random random = new Random(SEED);
  26:             var points = new CartesianPoint[NUMPOINTS];
  27:             for (var i = 0; i < NUMPOINTS; i++)
  28:             {
  29:                 points[i] = new CartesianPoint(random.Next(RADIUS), random.Next(RADIUS));
  30:             }
  31:             int numPointsInCircle = 0;
  32:             Stopwatch timer = new Stopwatch();
  33:             timer.Start();
  34:  
  35:             try
  36:             {
  37:                 Parallel.For(0, NUMPOINTS,() => 0, (i, loopState, threadPointsInCircle) =>
  38:                     {
  39:                         double distanceFromOrigin =
  40:                             Math.Sqrt(points[i].XCoord * points[i].XCoord + points[i].YCoord * points[i].YCoord);
  41:                         
  42:                         if (distanceFromOrigin <= RADIUS)
  43:                             threadPointsInCircle++;
  44:  
  45:                         doAdditionalProcessing();
  46:                         return threadPointsInCircle;
  47:                     },
  48:                     (threadPointsInCircle) => Interlocked.Add(ref numPointsInCircle, threadPointsInCircle)
  49:                 );
  50:  
  51:                 double pi = 4.0 * numPointsInCircle / NUMPOINTS;
  52:                 return pi;
  53:             }
  54:             finally
  55:             {
  56:                 long milliseconds = timer.ElapsedMilliseconds;
  57:                 Console.WriteLine("ParallelTasksPI complete: Duration: {0} ms", milliseconds);
  58:                 Console.WriteLine("Points in pointsList: {0}. Points within circle: {1}", NUMPOINTS, numPointsInCircle);
  59:             }
  60:         }
  61:  
  62:         private static void doAdditionalProcessing()
  63:         {
  64:             Thread.SpinWait(SPINWAITS);
  65:         }
  66:  
  67:         static void Main(string[] args)
  68:         {
  69:             Console.WriteLine("Geometric approximation of PI calculated in parallel with TPL");
  70:             double pi = ParallelTasksPI();
  71:             Console.WriteLine("PI = {0}", pi);
  72:             Console.ReadLine();
  73:         }
  74:     }
  75:  
  76:     public class CartesianPoint
  77:     {
  78:         public CartesianPoint(int xCoord, int yCoord)
  79:         {
  80:             XCoord = xCoord;
  81:             YCoord = yCoord;
  82:         }
  83:  
  84:         public int XCoord { get; private set; }
  85:         public int YCoord { get; private set; }
  86:     }
  87: }

The run time now is 5.7 seconds. Much better. So as the functional guys say if you want your code to scale over multiple CPUs don’t share state.

Scaling to the GPU

While on the topic of PI calcs I thought I’d show the Accelerator version – one of a few APIs that offer a managed wrapper over GPU calls without being tied to any one GPU family. This one does it using Direct X so you don’t have to have a NVidia adapter as you’d need for CUDA. Unfortunately it’s still an incubator project with no immediate prospect of moving any time soon, but it’s worth showing the sample to illustrate how different GPU code is.

Here I’ve tried to preserve the calculation method from the lab but since GPU APIs don’t offer any way to make a thread wait (why would they?) we don’t simulate the extra work as before. As you can see things are done very differently indeed…

   1: using System;
   2: using System.Diagnostics;
   3: using System.Collections.Generic;
   4: using System.Threading;
   5: using System.Threading.Tasks;
   6:  
   7: using Microsoft.ParallelArrays;
   8: using A = Microsoft.ParallelArrays.ParallelArrays;
   9: using DPA = Microsoft.ParallelArrays.DoubleParallelArray;
  10: using IPA = Microsoft.ParallelArrays.IntParallelArray;
  11:  
  12: namespace CalculatePI
  13: {
  14:     class Program
  15:     {
  16:         // Tuning constants:
  17:         // If you have lots of memory, increase NUMPOINTS to improve the accuracy
  18:         private const int NUMPOINTS = 10000000;
  19:         private const int RADIUS = 10000;
  20:  
  21:         // Value to seed the random number generator for each calculation.
  22:         // Using the same seed value ensures that the same results should be generated each time
  23:         private const int SEED = 269222;
  24:  
  25:         static double AcceleratorPI()
  26:         {
  27:             Random random = new Random(SEED);
  28:             var xPoints = new double[NUMPOINTS];
  29:             var yPoints = new double[NUMPOINTS];
  30:             
  31:             Stopwatch timer = new Stopwatch();
  32:             timer.Start();
  33:  
  34:             for (var i = 0; i < NUMPOINTS; i++)
  35:             {
  36:                 xPoints[i] = random.Next(RADIUS);
  37:                 yPoints[i] = random.Next(RADIUS);
  38:             }
  39:             double numPointsInCircle = 0D;
  40:  
  41:             try
  42:             {
  43:                 var gpuXs = new DPA(xPoints);
  44:                 var gpuXsSquared = A.Multiply(gpuXs, gpuXs);
  45:  
  46:                 var gpuYs = new DPA(yPoints);
  47:                 var gpuYsSquared = A.Multiply(gpuYs, gpuYs);
  48:                 
  49:                 var gpuDistances = A.Sqrt(A.Add(gpuXsSquared, gpuYsSquared));
  50:                 var gpuCompares = A.Subtract(RADIUS, gpuDistances);
  51:                 
  52:                 var gpuZeros = new DPA(0D, gpuXs.Shape);
  53:                 var gpuOnes = new DPA(1D, gpuXs.Shape);
  54:                 
  55:                 var gpuPointsInCircle = A.Select(gpuCompares, gpuOnes, gpuZeros);
  56:                 var gpuNumPoints = A.Sum(gpuPointsInCircle);
  57:  
  58:                 // var targetDX = new DX9Target();
  59:                 // Not sure why DX9 throws a not supported exception on my PC
  60:                 var targetDX = new X64MulticoreTarget();
  61:                 numPointsInCircle = targetDX.ToArray1D(gpuNumPoints)[0];
  62:                 double pi = 4.0 * numPointsInCircle / NUMPOINTS;
  63:                 return pi;
  64:             }
  65:             finally
  66:             {
  67:                 long milliseconds = timer.ElapsedMilliseconds;
  68:                 Console.WriteLine("Accelerator PI complete: Duration: {0} ms", milliseconds);
  69:                 Console.WriteLine("Points in pointsList: {0}. Points within circle: {1}", 
  70:                     NUMPOINTS, numPointsInCircle);
  71:             }
  72:         }
  73:  
  74:         static void Main(string[] args)
  75:         {
  76:             Console.WriteLine("Geometric approximation of PI calculated in parallel with Accelerator");
  77:             double pi = AcceleratorPI();
  78:             Console.WriteLine("PI = {0}", pi);
  79:             Console.ReadLine();
  80:         }
  81:     }
  82: }

The code ran in about 0.7 of a second but as I mentioned I wasn’t able to put a wait loop in the code. Data is loaded to the GPUs using specialised arrays. I’ve aliased them to A and DPA for ParallelArrays and DoubleParallelArray respectively. Operations can only apply to the entire array and the arrays are immutable. I like to think of the parallel array operations in the same way as the functions in SQL Select statements. You can either apply a SQL function to a column to transform every row of an existing table or you can apply an aggregate to reduce the table rows to a group. This is why every part of the calculation is applied one bit at a time. For example I start with the gpuXs initialised with my random points on line 43 and then square them on the next line to produce a new array of the points squared. For a complex algorithm this can get tedious.

Conclusion

So to wrap things up if you want your code to scale to (many) multiple cores, minimise the shared state. If you want to take advantage of the extra power in most desktop PCs be prepared to make a lot of changes.

Posted on Sunday, July 17, 2011 7:13 PM | Back to top


Comments on this post: Hello World

Comments are closed.
Comments have been closed on this topic.
Copyright © ChristianMaslen | Powered by: GeeksWithBlogs.net