Fine-Grain MPI for Extreme-Scale Programming

Leader: Alan Wagner <wagner@cs.ubc.ca>
Estimated time: 2 hours

Background

Scalable computing is needed to meet the challenge of computing on ever-increasing amounts of data. Multicore and many-core adds more processing cores but still cannot keep pace with the demand for more computation. There is need for technologies to scale, not only to multiple cores within one chip but also to scale across multiple chips in one machine, and across machines in a data center.

Process Oriented programming is the design of programs as a collection of processes that communicate using message-passing. Process-oriented programs execute concurrently either by: (a) logical concurrency: sharing one processor core and taking turns executing, or (b) physical concurrency: executing in parallel on separate cores in one or more machines. Extreme scale programming is the term we use for the design of programs that use thousands and millions of such processes.

Discussion

Why extreme scale programs? You cannot take advantage of thousands and millions of cores if you cannot express thousands and millions of independently executable processes. Message-passing is the only communication mechanism that is able to scale on one machine, many machines inside a data center or across networks including the Internet.

The only systems, in common use today, that may possibly scale to extreme size are SPMD programs where multiple copies of the same program are started in parallel, or Hadoop-like processing where there is a fixed framework. These two approaches represent only a small subset of the type of extreme-scale programs possible and are only a subset of more general problems that could be tackled at the extreme-scale level. The ability to design Extreme-scale programs that can scale is essential to meet the challenge of data intensive computing. As the volume and velocity of data increases so too does the costs in computing on that data and there is a corresponding need for programs to scale to use more and more processors.

How do we execute extreme-scale programs to enable us to have thousands and millions of processes? It is challenging to design programs to execute on more than a few processors and challenging to muster and provision that many resources. Fine-Grain MPI (FG-MPI) is a modified version of MPI created to support Extreme-scale programming. FG-MPI is derived from MPICH2, which is a widely used version of MPI, and makes it possible to express MASSIVE amounts of concurrency as a combination of multiple MPI processes inside a single OS processor in concert with having multiple of these OS-processes running on multiple cores across multiple machines. This adds an extra degree of freedom that makes it possible to adjust the execution of a program to use some degree of logical concurrency with physical concurrency. The ability to add logical concurrency decouples the number of processes from the hardware making it possible to execute many more processes than there are physical cores.

For example we recently demonstrated the ability for FG-MPI to execute over 100 Million MPI processes (Exascale number of processes) using 6,500 processing cores in a compute cluster. In this way FG-MPI gives an extra scaling factor that in the small allows programmers to develop programs using thousands of processes on a workstation and then take those same programs and scale them, to what we believe can be billions of processes on a large cluster. Today, because MPI binds processes to OS-processes, one cannot design an extreme-scale program without a correspondingly extreme number of machines. The concurrency enabled by FG-MPI allows for in excess of a thousand-way magnifier in scaling the number of processes. An FG-MPI program with 1000 processes can run on one OS-process on a single core and the same program can also run in parallel with 1000 processes each on their own core.

Extreme-scale programming enables a new approach to programming large systems. Once programs can have millions or billions of processes, one can begin to imagine programs where each process models the activity of real entities: whether these be traders in a stock exchange, neurons in neural-computing applications, individuals in a social network, or devices in a sensor net. We can begin to model the real world, explore the properties that emerge, synchronize it with real data or explore new domains.

One cannot develop these types of programs without the ability to create enough processes to test the program with a realistic number of entities. FG-MPI makes these types of programs possible. As the name fine-grain implies, a motivation for the development of FG-MPI was not simply the ability to have lots of processes it was also the ability to express fine-grain concurrency. The usual coarse-grain approach, where one identifies relatively large chunks of computation that can be done in parallel, limits the ability to scale since the parallelism cannot exceed the number of large chunks of computation; to scale further can require a complete redesign of the program.

In FG-MPI, since we know there is more concurrency at the finer-grain (functions) than at the coarse-grain (programs), the programmer can from the start design the program with a large number of fine-grain processes. Using FG-MPI we can now flexibly combine together processes to execute in one OS-process (logically concurrent) to coarsen the parallelism of the application. The main difference is that rather than starting with coarse-grain parallelism and having to re-design the program as we scale we start with many smaller processes and then combine them together as necessary as we scale or port to different machines. This allows the programmer to expose a massive amount of concurrency and in way that can be flexibly mapped onto many types and sizes of clusters.

The approach of combining together smaller processes into a single OS-process is not a new idea. What makes this approach possible and what is new was the idea of taking non-preemptive threads to implement processes and integrating it directly into the middleware. Previous approaches using other types of threads, or layering light-weight threads over top of the middleware do not achieve the scalability and performance possible with FG-MPI. In FG-MPI we are able to support thousands of logically concurrent processes and we have carefully integrated it into the MPICH2 middleware to reduce the overhead of supporting multiple processes per OS-process. For example, we have shown that even for the well-known NAS benchmarks we can improve performance over existing MPI by adding some logical concurrency. Adjusting the logical concurrency (size of the process) makes it possible to better fit the data accessed by processes to the cache and smaller processes leads to more frequent smaller messages and more fluidity. In most of the NAS benchmarks these improvements out-weighed the cost of the added concurrency. In conclusion, FG-MPI not only enables new types of programs but also can improve existing MPI programs.

FG-MPI provides a unique opportunity for developing Extreme-scale programs. There is the opportunity to make it available for widespread use and create an open-source community around the technology. Because FG-MPI is backwards compatible with existing MPI programs there is already a large community to adopt this technology. As well the close association of the FG-MPI group and MPICH2 group at Argonne provides a clear path for incorporating FG-MPI into MPICH2. MPICH2 is a very successful version of MPI that is widely uses as the basis of many commercially available MPIs including Intel's MPI.

Compose-Map-Configure

The workshop will also consider another aspect of the engineering. Parallel programming languages available today focus on the programmability offered by the language in order to transition developers to thinking parallel, but neglect the deployment and placement of those parallel tasks.

Two major factors to performance are the communication overhead as well as the idle-processing overhead. With SPMD programs, the work stealing approach employed by most languages is natural to maximizing processing efficiency. However, with MPMD programs, the amount of time each process spends on computation has more variation and should not be treated equally. These processes may be seen as being services with varying amount of up-time and benefit from being placed statically to reduce unnecessary process migration. Furthermore, simply placing a process based on computational requirements neglect the communication overhead which play a large role in performance for many applications.

With FG-MPI, we can exploit locality by statically placing processes on the available computational resources as well as having processes run concurrently to reduce communication and idle-processing overheads. However, with this new flexibility comes a need to tweak and specify these options quickly. We propose a Compose-Map-Configure (CMC) tool that employs a four-stage approach to software specification and creates a separation of concerns from design to deployment. This allows decisions of architecture interactions, software size, resources capacity, and function to be defined separately. Main ideas include the encapsulation of parallel components using hierarchical process compositions, and a corresponding channel communication support. Opportunities are available to simplify compositions of separately developed services and applications as well as integration with optimization tools to explore and adapt computation to the available resources.

An example of the CMC process will be presented in the workshop, showing how we go from processes plus specification to execution. Feedback on the ideas behind the CMC tool will be sought and very welcome.

Action

The purpose of the workshop will be a practical hands-on introduction to FG-MPI: how to create FG-MPI programs and run them using ``mpiexec''. I will discuss the added flexibility in executing programs and limitations. I will discuss applications and tools we have started to develop and potential extensions.

Fine-Grain MPI (FG-MPI) extends the MPICH2 runtime to support execution of multiple MPI processes inside an OS-process. FG-MPI supports fine-grain, function level, concurrent execution of MPI processes by minimizing messaging and scheduling overheads of processes co-located inside the same OS process. FG-MPI makes it possible have thousands MPI processes executing on one machine or millions of processes across multiple machines.

For the practical part of this workshop, it will help to have downoaded the latest FG-MPI release, together with the code examples on the download page. These are available from the "Downloads" tab on the home page. However, the work can still be followed with just pencil and paper.