On Scalability of the ANSYS Distributed Domain Solver

 

David O'Neal

National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

 

Sam Murgie

Technical Fellow

ANSYS, Inc.

Introduction

The focus of this article is the ANSYS Distributed Domain Solver (DDS), a prominent feature of the Parallel Performance for ANSYS (PPFA) product. DDS is based on the method of finite element tearing and interconnecting (FETI) developed by Farhat and Roux in the early ‘90s. The FETI algorithm is characterized by fewer inter-processor data communications than traditional decomposition methods, and a global rate of convergence that is independent of the subdomain count.

 

Scalability of the research implementation was demonstrated by Farhat, Mandel, and Roux in 1994. Here we examine requirements of the industrial application running in the field. Measurements were taken with three relatively common cluster computers. Support for the project was provided by ANSYS in collaboration with the National Center for Supercomputing Applications and Pittsburgh Supercomputing Center.

Execution Model

The Distributed Domain Solver is a stand-alone Message Passing Interface (MPI) executable, distinct from the finite element analysis (FEA) host application. Selection is made with the DDS option to the EQSLV command. Controls are provided by the DDSOPT command.

 

Execution Schematic

Before launching the solver, the FEA host must complete a sequence of related tasks including the decomposition for parallel execution. After all information required by DDS has been written to disk, the host transfers control to the solver. Input data is read and distributed, a parallel preconditioner is applied to the interface problem, and an iterative solution process is initiated. If the specified convergence criteria are met, a solution file (scratch.u) is created. DDS closes the execution log (file.DCS) and terminates, thus returning control to the host. If a valid solution file is found, post-processing is allowed to continue. Otherwise, the host application is aborted.

 

Note that the host and solver processes may be run separately, but this arrangement is only suitable for testing purposes. It is currently not possible to resume analyses from the point at which the host regains control from the solver. Since we were primarily interested in the contents of the DDS log file, a split-job configuration was used. Efficient allocation of resources was paramount to post-processing of solution data.

 

Test Cases


An extensible model called carrier was designed to facilitate the application of different mesh densities to a single problem. The carrier part was synthesized specifically for DDS testing and has no physical significance. The following images include element edges and subdomain coloring for carrier case c06.

Carrier Model

A tetrahedral mesh was applied in all cases with the volume meshing (VMESH) command. Element size was controlled through manipulation of the SIZLVL argument to the SMRTSIZE automatic mesh generator option. Ten test cases were created. With the exception of the SIZLVL settings, all of the input files were identical. Dimensioning data is summarized below. Hardware sketches are presented in the next section.

 

Dimensioning Details

Note that the consistent increases in the counts for the AlphaServer SC installation indicate a change in the automatic mesh generator between the 5.7 and 6.0 releases. Testing of the NT Supercluster and Origin 2000 was completed last fall with the earlier version. AlphaServer data was collected in April 2002.

Experimental Platforms

A small rack of Windows/NT workstations, a pair of large SGI/Cray Origin 2000 systems, and a cluster of Compaq AlphaServer ES45 nodes were used to measure cumulative memory usage and elapsed time for each of the test cases described in the previous section. The range of performance levels represented in these platforms was ideally suited to this study.

 

NT Supercluster

The Windows/NT test platform consisted of eight Hewlett-Packard Kayak XU Workstations carved from the 128-node NT Supercluster system at NCSA. Each node featured twin 550-MHz Intel Pentium III processors (512 MB full-speed Level 2 cache) and one gigabyte of memory. Myrinet device drivers for MPI/Pro hadn’t been developed yet and so all tests were performed with the standard Ethernet hardware and drivers (200 ms; 10 Mbyte/s). A separate login node was used to run the host application. All systems were running NT4 with Service Pack 6, and MPI/Pro 1.6.

 

SGI/Cray Origin 2000

The NCSA Origin 2000 cluster currently features 12 machines and 1520 processors. Only two were used in testing. The targeted systems were populated with 250 MHz MIPS R10000 processors (32KB Level 1 and 4 MB Level 2 caches) and 1 gigabyte of memory per node (dual-processor boards). The SGI scalable node architecture is based on proprietary NUMAlink hardware (<1 ms; 600 Mbyte/s). Both machines were running IRIX 6.5 at the time of the experiments. Jobs were executed out of a dedicated queue in order to avoid time-sharing effects on elapsed times.

 

AlphaServer SC

The Terascale Computing System (TCS) at the Pittsburgh Supercomputing Center consists of 750 AlphaServer ES45 Model 2 nodes featuring four 1000 MHz EV6.8 processors (8 MB Level 1 cache) and 4 gigabytes of memory each. The interconnecting network is based on a dual-rail Quadrics QsNet design (5 ms; 340 Mbyte/s per rail). All tests were performed with a single rail enabled. All nodes were running Tru64 UNIX 5.1A.

Line graphs summarizing elapsed times and memory requirements across all carrier cases with respect to partition size are presented in the next three sections. Axes are scaled logarithmically and so the ideal curves for all cases are straight lines (linearly decreasing elapsed times and constant memory usage).

NT Supercluster Results

Elapsed times scaled to 16 processors, the size of the test platform, but efficiencies were less than perfect from the start. Sharper reductions are apparent at 16 processes. These runs involved pairs of processes sharing a single PCI bus and network card. Dual-processor results for smaller partitions were also less efficient, but not to the same degree. This suggests the presence of both networking and memory contention problems.

 

   

Scalability Charts for the NT Supercluster

Memory replication levels were near constant at around 250 Mbytes for all but the largest test case, which exceeded 500 Mbytes. Cumulative memory usage was essentially flat throughout the range of partition sizes, giving indication of declining per-processor memory requirements.

 

We also observed that the node running process 0 always consumed more than an equal share of the total memory. Tests performed by ANSYS indicate that an additional 10 to 20 Mbytes is required to accommodate MPI communication buffers allocated by process 0 during the data distribution phase.

Origin 2000 Results

Up to 128 processes were used in testing of the Origin platform. Data replication levels were modest (~500M bytes) throughout the entire range of partition sizes. Scaling was generally excellent.

 

Speedups were noticeably better when 32 or fewer processes were used regardless of problem size. A quick review of Silicon Graphics literature confirmed that the per-node bandwidth of a 64-processor system is half that of a 32-processor system, while the maximum number of router hops almost doubles.

 

Large variations in elapsed time were also observed when the DDS process count was allowed to approach the physical machine size. This symptom was absent when the same tests were run on a larger system.

 

   

Scalability Charts for the Origin 2000

AlphaServer SC Results

Curves for the AlphaServer SC machine are remarkably similar to those for the Origin 2000. Elapsed times track the same way, and cumulative memory usages are nearly identical.

   

Scalability Charts for the AlphaServer SC

AlphaServer performance was about four times better than the Origin. The ratio of system clock times is also a factor of four, but per-processor bandwidths and networking latencies favor the Origin, at least up to 32 processors (both measures are approximately three times better than the AlphaServer). The implication is that DDS scalability requires a certain balance between networking and single-processor performance, beyond which little benefit is derived.

Conclusions

Specification of large workspaces (-m and -db) reduces virtual memory activities associated with the host program, leading to significant reductions in elapsed times (the effect on CPU time is marginal). Although memory allocation has presumably been automated in an optimal fashion, this approach doesn’t work anywhere near as well as the initial specification of a suitably sized workspace.

 

DDS input files can be very large. Test cases consumed about one kilobyte per degree-of-freedom, and so there is a significant file system requirement associated with DDS. The primary data transfer file is usually many times larger than the corresponding database. Serial processing of such large files also represents a performance bottleneck.

 

Pre- and post-processing times can dominate job turnaround and therefore, the ability to split DDS runs away from the serial phases of execution is a highly desirable feature. Allocation of resources required by DDS for the entire duration of an analysis is impractical.

 

Effective hiding of communication cycles behind computations requires a certain balance. If network transactions cannot keep pace with single-processor performance, scalability will quickly fade. Results for the NT Supercluster are indicative of this behavior.

 

Availability of the PPFA product may conjure up images of previously intractable industrial models running on commodity clusters of desktop computers, but its effects are further reaching than this. Improvements in time to solution promote development of increasingly sophisticated models at every level. Better accuracy yields better designs, which implies a competitive edge. However, the most significant aspect of PPFA is that it embraces grid computing. While we don’t expect large co-located resource centers to vanish any time soon, advancements in network performance will continue to erode their relevance.