Laplace Benchmark

  The core of the Laplace test code is a simple two-dimensional central
  differencing scheme. Laplace was developed as a vehicle for motivating
  an introductory MPI course and as such, little attention was paid to
  performance-related issues.

  A rectangular grid is defined at run time by the user. A single layer
  of ghost cells surrounds each subdomain. Boundary conditions are set
  on the ghost region surrounding the global domain. Boundary values on
  the global domain are never updated (read only).

  The problem solved by the Laplace benchmark is trivial. The solution is
  a plane located at z=1. Boundary variables are set to 1. Interior ghost
  cells and the subdomains themselves are initialized to 0. This allowed
  a simple estimated error norm to be computed without additional storage
  requirements. The feature was added for validation purposes.

  Laplace consists of a few hundred lines of Fortran code. The differencing
  pattern is a 4-point star. As such, the number of iterations required to
  propagate boundary values into the center-of-gravity of the domain is a
  function of the mesh dimension along the axis that features the fewest
  node points.

  The original benchmark code used an explicit scheme. In version 4, the
  computational kernel was revised. A Gauss-Seidel variant of the central
  differencing scheme was employed for two reasons. It accelerates the
  propagation of boundary conditions and requires the absolute minimum
  storage requirement.

  All floating point data is explicitly declared as REAL*8. Integer sizes
  were not specified. It was not necessary as no integers are communicated.
  Default compiler settings were used (32 bits for all test platforms).

  Working from the Tru64/PBS version, we created AIX/LoadLeveler, Linux/IA-32,
  and Linux/IA-64 ports. A timer library based on the gettimeofday() and the
  getrusage() functions was also moved to each platform to support internal
  measurements including the floating point rate of the computational kernel
  and the transfer rate for the communications steps.


  NCSA Benchmarks

  Dimensioning criteria and iteration counts for the Laplace code are specified
  at run time. For the NCSA tests, an effort was made to fill at least 80% of
  the memory of the smallest single node to be tested (IA-32 at ~1.25 GB available).
  This case was dubbed Small, while Medium and large cases were defined by doubling
  and quadrupling this amount across 2 and 4 nodes respectively. In other words,
  the Medium and Large cases were sized to fill 80% of multiple IA-32 node memories.

  Iteration counts were varied by platform to cause each version of the code to
  complete a "Small" simulation in approximately 1 hour.

  The row count was fixed across all cases at 8000. Other run-time parameters that
  were used are as follows.

    Number of columns:

    Processors   1     2     4     8    16    21    32    41    64   128
    --------------------------------------------------------------------
    Small     8000  4000  2000  1000   500   381   250   195   125    63
    Medium      -   8000  4000  2000  1000   762   500   390   250   125
    Large       -     -   8000  4000  2000  1524  1000   780   500   250

    Notes
    -----
    Cases involving 21 processors were only tested on Copper.
    The extent of Copper testing was limited to 32 processors (node size).
    Cases involving 41 processors were used to test the Linux clusters.
    Column counts are per-process.

    Number of Iterations (fixed across all process counts):
    --------------------
    p690 Copper AIX LoadLeveler  2500
    IA-32 Platinum Linux PBS      500
    IA-64 Titan Linux PBS        2000


  All Copper tests were performed on a dedicated node. All Linux cluster tests were
  performed with the process-per-node count set to 1. See the original Word document
  for additional software and hardware details.


  Tables and Charts

  o  Parameters and Tables (Word)
  o  Charts (Excel)


  Analysis of Results

  The Laplace code can be configured to test numerous machine characteristics. The
  configuration used for the NCSA Benchmarks might be considered 'typical.'

  By reducing the column count and increasing the row count, timings can be skewed
  towards measurement of message passing rates. Conversely, increasing the column
  count and reducing the row count will skew measurements towards measurement of
  floating point performance.

  Since the problem size was fixed for these experiments, the per-process column
  count is decreasing. This creates the potential for increases in performance due
  to caching effects. It also means that network timings will likely increase.

  The communications pattern amounts to simple shifts on the subdomain ghost cells.
  Since the row count was held fixed, the message size also remains fixed. It is
  small though at 8,000 x 8 bytes = 64 Kbytes. For a sustainable bandwidth in
  the neighborhood of 100 MBytes/s, transmission times will be very small and so
  latency is expected to dominate communications timings.

  Floating point performance rates were measured externally with hpmcount on Copper
  and with the psrun utility on Platinum and Titan. Internal timers report elapsed
  time and usage. Corresponding rates are computed for the computational and
  communications parts of the code.

  Since Laplace is a streaming memory code, we expect floating point rates and
  efficiencies to be consistent with the STREAMS benchmark.