ExMatEx | Research Areas | Programming Models

Programming Models Languages and tools for developing multi-scale applicatins.

As explained on the ASPA proxy page, our target multi-scale application is composed of various pieces, or components, that must interact with each other in a dynamic and adaptive fashion as the code runs (see figure below). These components include a coarse scale computation (e.g. LULESH) that adaptively launches a dynamically varying number of fine-scale computations (e.g. VPFFT or CoMD) as the simulation progresses through time. In addition, the application may cache previously computed results in a database to reduce or eliminate duplicate computation. Unfortunately, it seems clear that no single programming language will emerge that supports all of this required functionality—multiple languages and systems must be used in concert to build the final application.

This realization has focused our research efforts on selecting and developing the best combination of programming models and systemware that will enable us to efficiently build and run these multi-scale computations. Our selection of this subset, tested and validated using proxy apps, will be driven by co-design iterations with domain scientists [ref Applications and Algorithms] and our analysis, modeling and simulation capabilities [ref Hardware-Interfacing Tools].

Our research targets three specific areas of functionality that address different aspects of our applications. First, we investigate component- or node-level programming models. Second, we are exploring domain-specific languages as a way to insulate the domain scientist from the complexities and diversity of multiple programming models. Third, we look at programming in the large—how we couple the individual components of our applications and execute them using runtime systems and system services. In the following three sections we describe our efforts and results in each of these three areas.

Node- and Component-Level Programming Models

Each individual, single-scale, component of our target application must execute efficiently on the available hardware. Using this requirement as a driver, we have used our proxy apps to evaluate and assess various programming models and languages. Our goals are twofold: to identify programming models best suited to particular algorithms and hardware, and to provide feedback to vendors and standards bodies based on our results and experience. In year 2 of the project we have examined a wide variety of programming approaches including CUDA, OpenCL, OpenMP, X10, Chapel, and Open Community Runtime (OCR).

Based on our experiences with this variety of models, it is clear that no single programming model has emerged as a universal soultion to exascale challenges, and we believe that programmers will continue to use a variety of programming models. It is also clear that practically all programming models still require significant expertise in hand tuning, often machine specific, in order to obtain high performance. It is critical that programming models continue to evolve to provide a more favorable balance between programmer productivity, code maintainability, and application performance. Features that provide more flexible and effective ways to express concurrency, as well as more powerful mechanisms for expressing data locality, will be critical. Support for legacy code, composability with other models, and abilities to control execution scheduling and express data dependencies are also important ingredients for any programming model to be sucessful at exascale.

Domain Specific Languages

To insulate application scientists and programmers from the complexity of this polyglot language environment, we have embarked on a high-risk, high-payoff project of developing a domain-specific language (DSL) infrastructure. DSLs, sometimes called “little languages”, are concise programming languages that are designed to naturally express the semantics of a narrow problem domain (if you've used SQL, LaTeX, or the Unix shell, you've used a DSL.). So, instead of programming in multiple low-level languages, a scientist will program in a language that represents the science she is studying (e.g. molecular dynamics). The use of DSLs can provide significant gains in the productivity and creativity of application scientists, the portability of applications, and enhanced application performance.

Our DSL effort consists of three related tasks:

We develop an infrastructure to specify the syntax and semantics of many (the environment must enable rapid development of a multitude of DSLs in order to leverage the cost of building it—supporting only a few DSLs would not justify this up-front expense) different DSLs and then compile them to execute efficiently to various platforms (e.g. multicore, GPU, cluster).
As a first step, we compile a single DSL: Liszt. The “domain” of the Liszt DSL is unstructured meshes. Liszt provides language-level primitives for traversing and operating on the elements of a mesh (e.g. faces, verticies, edges, etc.). Our compiler will generate parallel code, but the Liszt user does not need to explicitly describe the parallelism—it's implicit in the semantics of the language (e.g. for (e<-edges(mesh)) to loop over all the edges in the mesh).
Given the Liszt DSL, we can implement an actual application using it. In our case, this application is LULESH (shock hydrodynamics code on an unstructured mesh).

In Y1 of the project we implemented this LULESH application in a version of Liszt that used an earlier, Scala (Scala is a multi-paradigm language built on top of the Java virtual machine)-based compiler framework. As the implementation progressed we identified a number of issues with the design of Liszt that made implementation of LULESH difficult. In particular, the ordering of mesh elements was inconsistent in Liszt, while LULESH required ordered vertex access. This semantic mismatch is understandable, as Liszt was designed prior to the formation of ExMatEx and hence didn't reflect any of the requirements for LULESH. However, this issue highlights a critical point: DSLs must be designed, as part of a language co-design process, in direct collaboration with the domain scientists that will use them.

The Liszt version of LULESH used half the lines of code that serial serial implementation of LULESH did but also ran only half as fast as LULESH. The speed issue was fixed (to within a factor of 1.25) using a workaround and this, along with the productivity improvements, and the fact that the same Liszt source code compiled to multiple compute platforms (serial, multi-core, and GPU) convinced us to proceed with the DSL componet of the project.

However, in addition to the language design issues we identified a number of shortcomings in the existing Scala-based compiler infrastructure:

We need to interoperate with legacy software written in C, C++ and Fortran. The previous version of was written in Scala and used the Java Virtual Machine, which made it hard to link to C-like libraries.
We need to make it easier to buuld embedded DSLs. The new system will use a dynamically typed language for creating DSLs.
We need to be more adaptive, both to processor and node architecture, and to internal data structures. In particular, the previous version worked only on static meshes and was a source-to-source compiler which made it difficult to implemenet just-in-time compilation.
As part the ExMatEx Center, we also want to support multiple, interoperating DSLs, in particular for meshes and particles.

Terra: Y2 Accomplishments

For these reasons, we decided to re-implement the compiler infrastructure and have built Terra, a new low-level system programming language for building DSLs. Terra is designed to interoperate seamlessly with Lua [link], a high-level dynamically-typed programming language:


    -- This top-level code is plain Lua code.
    function printhello()
        -- This is a plain Lua function
        print("Hello, Lua!")
    end
    printhello()

    -- Terra is backwards compatible with C
    -- we'll use C's io library in our example.
    C = terralib.includec("stdio.h")
    
    -- The keyword 'terra' introduces
    -- a new Terra function.
    terra hello(argc : int, argv : &rawstring;)
        -- Here we call a C function from Terra
        C.printf("Hello, Terra!\n")
        return 0
    end
    
    -- You can call Terra functions directly from Lua
    hello(0,nil)

    -- Or, you can save them to disk as executables or .o
    -- files and link them into existing programs
    terralib.saveobj("helloterra",{ main = hello })

Like C, Terra is a simple, statically-typed, compiled language with manual memory management. But unlike C, it is designed from the beginning to interoperate with Lua. Terra functions are first-class Lua values created using the terra keyword. When needed they are JIT-compiled to machine code.

We designed Terra to make it easy to write JIT-compiled domain-specific languages (DSLs), and integrate them with existing applications. We use techniques from multi-stage programming [1] to make it possible to meta-program Terra using Lua. Terra expressions, types, and functions are all first-class Lua values, making it possible to generate arbitrary programs at runtime. This allows you to compile DSLs written in Lua into high-performance Terra code.

Generated Terra code is fast. Terra programs use the same LLVM backend that Apple uses for its C compilers. This means that Terra code performs similarly to equivalent C code. For instance, our translations of the nbody and fannhakunen programs from the programming language shootout [2] perform within 5% of the speed of their C equivalents when compiled with Clang, LLVM's C frontend. Terra also includes built-in support for SIMD operations, and other low-level features like non-temporal writes and prefetches. You can use Lua to organize and configure your application, and then call into Terra code when you need controllable performance.

The Lua-Terra ecosystem is easy to integrate with existing applications. Lua was designed to be embedded in C programs [3]. We leverage this design to make it easy to embed Lua-Terra programs in other software as a library. This design allows you to add a JIT-compiler to your existing software, or integrate a DSL with your already existing high-performance code. The code that Terra generates is ABI compatible with C, so it should work with any code written in C, C++, or FORTRAN.

We also released an open source version of the software at terralang.org, which has received nearly 40,000 external views, and has nearly 500 “stars” on github. We've also used it to implement several domain-specific languages. In a paper presented at PLDI [4], we presented our design of Terra and several example applications. Our auto-tuner for matrix multiply performs within 20% of the state-of-the-art ATLAS auto-tuner, and at over 60% of the peak floating point performance of the core. We also presented Orion, a DSL for image processing that can fuse, vectorize, and parallelize stencil computations. We've further developed Orion, providing an optimizer based on a performance model that can quickly autotune image processing programs.

We are also currently working on a probabilistic programming language similar to Church [5] that is embedded in Terra. By using a low-level compiled language rather than an interpreter, we are able to run probabilistic programs two orders of magnitude faster than Church and 5 times faster than known existing implementations of probabilistic programming.

We continued work on the object system in Terra, which makes it easier to create the high-performance objects—“Exotypes”—that form the basis of optimizing domain-specific languages. Both the behavior and physical layout of these types is defined programmatically in Lua, which gives the programmer a large degree of expressibility while retaining careful control of performance. More information about this API is in our API documentation.

We tested this object system by implementing libraries for object serialization and dynamic assembly. Using Exotypes we can serialize object trees at 7.4Gb/s, an order of magnitude faster than out-of-the-box solutions such as Java serialization using Kryo (https://code.google.com/p/kryo/) which runs at 190Mb/s, or Google protocol buffers, which run at 150Mb/s (https://code.google.com/p/protobuf/). Furthermore, since these types are generated programmatically, we can create optimized objects based on compressed representations. To demonstrate this, we implemented a templating dynamic assembler library. Dynamic assemblers form the basis of JIT compilers, and need to be fast to minimize compilation time. Using our table-driven approach, we were able to create a dynamic assembler for x86 code based on LuaJIT's DynASM library (http://luajit.org/dynasm.html) This assembler required only ~2100 lines of Lua-Terra code to define. In contrast, the Google's assembler used in Chrome requires over 4900 lines of C++. The flexibility of behavior of Terra's types also allowed us to optimize the assembly code so that it can assemble x86 code templates at over 11Gb/s, and 3 to 17 times faster than Google's assembler on a rangle of templated code.

Finally, we have continued to test on Terra's language embedding API (http://terralang.org/api.html#embedded_language_api) by beginning to build more embedded DSLs in Terra including the Orion stencil language, a Terra version of Liszt, and an embedded language for probabilistic programming.

LULESH in Liszt Reimplementation

We are in the process of converting the Liszt language to an implementation based on Terra. Using Terra will allow us to optimize Liszt kernels based on dynamically determined information allowing us to remove many of the previous restrictions in the language. Mesh topology can change over the course of the program; kernels will be able to call already existing C, C++, or FORTRAN code; and features such as particles or custom mesh formats will be easier to add.

We are also updating the Liszt runtime to use a generic relational representation both for mesh topological relations and for per-element data used for computations. The runtime API is being updated to allow programs to only load the topological relations that are used, which will allow Liszt programmers to use meshes that have been stored in more generic relational formats as well as the already-supported 3D manifold-specific formats. Generalizing our runtime data structures also allows us to add desired language features such as particles, higher order elements, and implicit methods without adding new internal representations to the runtime.

The front-end of the Terra port of Liszt is now compiling programs and generating code for the single-core runtime that reads mesh files, runs iterative computations over per-element mesh values, and prints results. In order to support the LULESH application, we are implementing language features that allow programmers to invoke function calls from parallel Liszt code, and we are updating the code generation to use the updated runtime. Our near term goal is to reinvigorate the collaboration between Liszt and LULESH developers and re-implement LULESH in the new Terra infrastructure.

Programming Models Publications

Papers

Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, and J. Vitek, “Terra: a multi-stage language for high-performance computing”, 34th annual ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI 2013), Seattle WA, June 16-22, 2013 [external link]

Software

“Terra: A low-level counterpart to Lua”, [open source at GitHub, see also terralang.org]

“Liszt: A DSL for solving mesh-based PDEs”, [older version of Liszt at liszt.stanford.edu]