GASNet gemini-conduit and aries-conduit documentation
Larry Stewart <stewart@serissa.com> 

User Information:
-----------------

This documentation covers both gemini-conduit (for the Cray XE and XK
series) and aries-conduit (for the Cray XC).  The terms "Gemini" and
"gemini-conduit" are used throughout this document when both gemini
and aries are meant.

Because these Cray systems have non-trivial differences between the login
nodes and the compute nodes, one must use a cross-configure script to build
either of these conduits.  See other/contrib/cross-configure-crayxe-linux
for Gemini and other/contrib/cross-configure-crayxc-linux for Aries.

Where this conduit runs:
-----------------------

The gemini-conduit runs on systems with Cray's "Gemini" interconnect,
including their XE and XK series machines.  There is no known minimum version
required of any hardware or software component, but it is likely that
sufficiently old releases of the system software are not supported correctly.

The aries-conduit runs on systems with Cray's "Aries" interconnect, including
their XC series machines.  To the best of our knowledge, there are no known
compatibility issues with any release of the XC series system software.

Configuring job launch:
----------------------

Both ALPS (srun) and native SLURM (srun) are supported for job-launch by the
provided gasnetrun_gemini and gasnetrun_aries scripts.  The value of the
environment variable MPIRUN_CMD can be used to choose between these two.  The
value when configure is run establishes a default, which a value set at
job-launch time can override.  Our recommended settings are
 + For ALPS (default):   MPIRUN_CMD="aprun -n %N %C"
 + For native SLURM:     MPIRUN_CMD="srun -K0 -W60 %V -mblock -n %N %C"

For SLURM-based systems, one can simply set APRUN="srun -K0 -W60 %V -mblock"
in ones environment when running the appropriate cross-configure script.

If you desire additional customization, please see mpi-conduit's README for
documentation regarding the MPIRUN_CMD environment variable.  Note, that it
is NOT required that one compile mpi-conduit to use gemini or aries conduits.

Recognized environment variables:
---------------------------------

* All the standard GASNet environment variables (see top-level README)

GASNET_BARRIER - barrier algorithm selection
  In addition to the algorithms in the top-level README, there is an
  implementation specific to Gemini and Aries:
    GNIDISSEM - like RDMADISSEM, but implemented using lower-level
                GNI operations for lower latency.
  Currently GNIDISSEM is the default on both Gemini and Aries.

GASNET_NETWORKDEPTH_SPACE - volume of AM Request queue
  This determines the maximum volume of AM Requests (in bytes) that can
  be oustanding between a given pair of peers before flow-control.  This
  differs from most other conduits in which the depth is given by a
  count.  Values smaller than 2 maximum size Medium AMs, or larger than
  64 maximum sized Medium AMs, will be silenly adjusted to that range.
  The default value is 12K.

GASNET_NETWORKDEPTH_TOTAL - depth of AM Reply queue
  This determines the maximum number of AM Requests that can be
  oustanding from one peer to all others before flow-control.
  The default value is 64, and the minimum is 1.

GASNET_GNI_NUM_PD - number of post descriptors
  AM and RDMA operations all requires use of a control block called a
  Post Descriptor. This variable controls how many of these are
  allocated, and therefore limit the total number of operations which
  can be in-flight at any time.
  The default value is 512, and the minimum is 32.

GASNET_GNI_GET_FMA_RDMA_CUTOVER - default 4096 max 16384
GASNET_GNI_PUT_FMA_RDMA_CUTOVER - default 4096 max 16384
  These set the cutover from Fast Memory Access (RDMA commanded in user
  mode) to RDMA (commanded through the kernel).  Transfers up to and
  including these sizes will be performed using FMA.  Use of FMA will
  typically result in lower latency, but at the expense of greater
  involvement by the CPU (and thus less potential, for instance, for
  overlap of non-blocking communication with computation).

GASNET_GNI_GET_BOUNCE_REGISTER_CUTOVER  - default 4096, max 32768
GASNET_GNI_PUT_BOUNCE_REGISTER_CUTOVER  - default 4096, max 32768
  Every Post Descriptor has a small bounce buffer (128 bytes) for
  handling out-of-segment transfers (normally FMAs) with the least
  overhead.  Beyond that size out-of-segment can be performed using
  either bounce buffers or dynamic registration.  Such transfers up
  to and including these sizes will be conducted using bounce buffers.
  Beyond these sizes, dynamic memory registration is used.
  The minimum allowed is GASNET_PAGESIZE (normally 4K) and lower values
  will be silently replaced by this value.
  
GASNET_GNI_BOUNCE_SIZE      - default 256K, space for bounce buffers
  There is a pool of bounce buffers, each of a fixed size determined by the
  larger of GASNET_GNI_{PUT,GET}_BOUNCE_REGISTER_CUTOVER (see above).  The
  value of GASNET_GNI_BOUNCE_SIZE determines the size of the entire pool of
  bounce buffers.  Thus the number of bounce buffers available is
     GASNET_GNI_BOUNCE_SIZE / MAX(GASNET_GNI_PUT_BOUNCE_REGISTER_CUTOVER,
                                  GASNET_GNI_GET_BOUNCE_REGISTER_CUTOVER,
                                  (gasnet_AMMaxMedium() + 64))
  Since all three terms in the MAX default to 4K, the default 256K gives
  64 bounce buffers.

  If an operation must allocate a bounce buffer and the pool is empty,
  it will poll the completion queue until one is returned.  Therefore, this
  setting may limit the number of operations requiring use of a bounce
  buffer which can be in-flight at any time.

GASNET_GNI_MEM_CONSISTENCY
  This control the use of the flag GNI_MEM_STRICT_PI_ORDERING,
  GNI_MEM_RELAXED_PI_ORDERING, or neither when registering memory
  for RDMA.
  The legal values are the strings "strict", "relaxed" or "default".
  In this release the default is "relaxed".

GASNET_GNI_MEMREG
  This controls the maximum number of concurrent temporary memory
  registrations, used for RDMA transfers to and from addresses outside of
  the GASNet segment.
  The default value is 16 on Gemini, and is 3072 on Aries.

  If this vale is too large (possibly due to competition with MPI or
  another runtime library) you will see a fatal error at startup.  If
  GASNet detects the resource exhaustion, the message includes the
  following text:
      UDREG_CacheCreate() failed with rc=2
  Alternatively, if the resource exhaustion is detected by the MPI runtime
  you may see a message including the following text:
      UDREG_CacheCreate unknown error return 2
  Other runtimes may include similar failure messages.
  If you encounter these failures you should reduce the value of this
  parameter, or in the extreme you may wish to set GASNET_USE_UDREG=0 to
  disable use of UDREG by GASNet.

GASNET_PHYSMEM_PINNABLE_RATIO - default 0.80
GASNET_PHYSMEM_MAX - default <size of physmem> * <pinnable ratio>
GASNET_PHYSMEM_NOPROBE - default 0
  The logic for setting the maximum size of the segment is copied from
  portals-conduit, which uses the following setting to control the
  maximum size of the segment.  If you set PINNABLE_RATIO 1 (or higher)
  and PHYSMEM_MAX, and PHYSMEM_NOPROBE you will bypass all the checking.
  Of course mmap might fail, or you might be swapping the segment...

GASNET_USE_UDREG
  AVAILABLE UNLESS CONFIGURED WITH --disable-gni-udreg.
  This gives a boolean: "0" to disable or "1" to enable the use of the
  UDREG library for caching of local memory registrations, and is
  intended primarily for debugging purposes.
  The default is enabled.

GASNET_USE_FIREHOSE
  AVAILABLE ONLY IF CONFIGURED WITH --enable-gni-firehose.
  This gives a boolean: "0" to disable or "1" to enable the use
  of the firehose dynamic pinning library for local memory
  registration, and is intended primarily for debugging purposes.
  The default is enabled.

GASNET_DOMAIN_COUNT
GASNET_GNI_PTHREADS_PER_DOMAIN
GASNET_AM_DOMAIN_POLL_MASK
  AVAILABLE ONLY IF CONFIGURED WITH --enable-gni-multi-domain.
  For their use, please see the multi-domain documentation under the
  "Optional configure-time settings" section, below.

PMI_GNI_LOC_ADDR   - NIC address on local node
PMI_GNI_COOKIE     - Access code shared by all nodes in job
PMI_GNI_PTAG       - Protection tag shared by all nodes in job
PMI_GNI_DEV_ID     - Device ID (Selects which NIC is in use)
  Provided by Cray runtime system.  Don't mess with these.

Optional configure-time settings:
------------------------------

--with-gni-max-medium=[value]
   By default gasnet_AMMaxMedium() is 4032: a 4096 byte buffer minus
   64 bytes reserved for up to 16 handler arguments.  This configure
   option allows control over the value of gasnet_AMMaxMedium().
   The value must be a multiple of 64, and cannot be less than 512.
   The default value is 4032.

--enable-gni-multi-domain
   This configure option enables EXPERMIENTAL support for improved
   performance of RDMA operations in a PAR (multi-threaded) build of
   Gemini or Aries conduit by allocating multiple GNI Communication
   Domains and distributing the clients threads over these domains.
   Since operations on a given domain must be serialized (we use a
   pthread mutex), use of multiple domains can significantly reduce
   lock contention.  The trade-offs are that with multiple domains
   the maximum size of the GASNet segment may be reduced, and progress
   on Active Messages may be slowed.

   The following three environment variables are honored when
   multi-domain support is enabled:

   GASNET_DOMAIN_COUNT
   This is the number of Communication Domains to create per process.
   The multi-domain support has no benefits unless this is set to a
   value larger than 1.
   The default value is 1 (a single domain with no benefits).

   GASNET_GNI_PTHREADS_PER_DOMAIN
   This is the number of threads to assign to the first domain before
   assigning any to the second.  When the last domain has been assigned
   this many threads, then assignment resumes at the first domain.
   The default value is 1 (cyclic/round-robin assignment).

   GASNET_AM_DOMAIN_POLL_MASK
   This value controls how often threads in domains other than the
   first poll for arrival of Active Messages (threads in the first 
   domain poll on every call to gasnet_AMPoll).  Since all Active
   Message traffic arrives via the first domain, polling by more than
   one thread will increase lock contention, but a total lack of
   polling by threads in other domains can lead to deadlock in some
   circumstances.
   A value of 0 will result in polling for Active Messages on every
   call (explicit or implict) to gasnet_AMPoll().
   Values of 1, 3, 7, 15, etc. will result in progressively lower
   average frequencies (1 in (n+1)).
   The default is 255 (1 in every 256 polls by threads outside the
   first domain will poll for AM arrivals).

--enable-gni-firehose
   This configure option enables EXPERMIENTAL support for the firehose
   dynamic memory registration library for caching of dynamic memory
   registration.  This is only caching of local registration, and NOT
   the "full" firehose which provides cached remote registration (as
   for GASNET_SEGMENT_EVERYTHING support).
   NOTE: This option is largely superseded by the (ON by default) use
   of Cray UDREG, which does not suffer from firehose's vulnerability
   when memory is unmapped.


Optional compile-time settings:
------------------------------

* All the compile-time settings from extended-ref (see the extended-ref README)

in gemini-conduit/gasnet_gemini.h:

GASNETC_USE_SPINLOCK	- 0 user gasneti_mutex_t (default)
			  1 use gasneti_spinlock_t

The _DEFAULT values for all of the GASNET_GNI_ environment variables.

in gemini-conduit/gasnet_gemini.c:

int gasnetc_poll_burst = 10;   /* number of rdma completions to get at once */

Sets the number of events for arriving active messages taken at once
#define AM_BURST 20

Known problems:
---------------

* See the Berkeley UPC Bugzilla server for details on known bugs.

Future work:
------------

 * Caching of memory registrations (Firehose?)

gemini-conduit uses a single fixed memory registration for the GASNet segment. It uses another for the AM receive buffers, and transiently uses additional registrations for large RDMA transfers outside the segment.  There are only a modest number of mapping registers available per rank, so large numbers of registrations are inappropriate, but small numbers could be cached rather than per-rdma transient.

 * Inline

It should be possible to inline the special cases of short blocking GET and short blocking PUT

 * Alternative designs

The O(N) AM buffers queues could be used only for wakeup, leaving AM transport to RDMA, more along the lines of the shmem-conduit.  This would reduce buffer requirements for large jobs.

==============================================================================

Design Overview:
----------------

The gemini conduit provides the gasnet core and extended APIs on Cray systems
based on the gemini interconnect.  Initial work done on the XE6.

References:

Cray documentation "Using the GNI and DMAPP APIs" document S-2446-3102

http://docs.cray.com/cgi-bin/craydoc.cgi?mode=SideMap;f=xe_sitemap

APIs:

PMI-2

/opt/cray/pmi/default/include

GNI

/opt/cray/gni-headers/default/include/gni_pub.h

==Background==

The Gemini interconnect was introduced with the Cray XE6 system.  It
may be thought of as an evolutionary development of the Seastar
interconnect used on the Red Storm system and the Cray XT series.
Physically Gemini is a 3D torus system using network interfaces that
connect directly to the Hyperchannel IO bus of AMD processors.

Logically Gemini provides short messages, remote memory access, remote
atomic operations, and remote block transfers. Memory must be
registered with GNI before it can be used for communications.  Job
launch services are provided by the Cray Application Level Placement
Scheduler (ALPS), with API provided through implementations of the PMI
(process management interface) API.

During initialization, a parallel application creates a Communications
Domain, which is basically a binding to a particular network interface.
Data necessary to get permit ranks of a parallel application to
communicate are provided via environment variables and PMI interfaces.

In gemini, the address information for a remote rank is stored in a
software structure called an Endpoint.  Generally a separate endpoint
is needed for each rank of the job.  Each endpoint is associated with
a Completion Queue (CQ) which signals when locally originated
operations to a particular endpoint have finished.  Cqs may be used by
multiple endpoints.

NOTE: The next paragraph is no longer accurate - we no longer use SMSG

To use the short message system (SMSG), the application allocates
buffer space, and assigns a completion queue to be used to announce
the arrival of messages from remote ranks.  Short message system
messages can be of variable length, but each message uses a fixed
length message buffer from a pool.  The shared message queue (MSGQ)
system is similar, except that resources are shared at the node level,
rather than each rank having separate resources. The shared message
queue system has better scalability, scaling as O( (N/24)**2 ) rather
than O( N**2 ). However, MSGQ has a maximum message length of 128,
while the limit for SMSG is large enough to handle a complete GASNet
Medium Active Message.  In ether SMSG or MSGQ, the buffer memory must
be registered using the GNI API.

The Gemini interconnect provides two flavors of RDMA.  FMA, or Fast
Memory Access, provides remote atomic operations, and remote get and
put. These operations are commanded by user-mode applications directly
to the hardware, (so called OS Bypass).  For larger messages, Cray
provides an RDMA facility based on a hardware Byte Transfer Engine,
but its use is mediated by the kernel mode device driver.  It isn't
clear when one should use RDMA rather than FMA, but a few hundred
bytes to a few thousand bytes is likely.

==Overview==

The gemini conduit creates an endpoint for every rank, and then binds
each endpoint to a particular rank using the address information
exchanged through PMI.  Each endpoint in gemini is also associated
with a common completion queue, which is a structure used for
announcing the completion of locally originated messages.

Next the gemini conduit allocates and registers storage for buffers
for the gemini short message system.  The setup allocates buffers
large enough to hold a complete AMMedium message.  Arriving short
messages are signaled on a second completion queue.  Using defaults,
about 32K per rank are reserved on each rank.

==Active Message design==

The gemini conduit supports PSHM for same-node communications. This
isn't strictly necessary when using the Cray Short Message System, but
it would be necessary when using the Cray Shared Message Queue system
because MSGQ cannot send to the same node.

Flow control and deadlock avoidance is done by a fixed credit
system. The environment variable GASNET_NETWORKDEPTH is the number
of AM buffers for each peer.  The default value is 12.  Each peer
has an am_credit variable. When a rank wishes to send an active
message, it atomic decrements the peer's credit counter and calls the
gemini conduit's Poll routine until it succeeds.  At the receiver,
every AM Request generates a "hidden" credit return message if and
only if the handler did not send a Reply.

NOTE: The Gemini short message system has some sort of internal credit
accounting, but we don't rely on it because of lack of features
(allocating space for replies) and lack of visibility (the only way to
find out if there is credit is to try to send a message).  The
gemini-conduit credit design is such that the underlying GNI_SmsgSend
should never fail.

Short and Medium active messages take one message each, so there is no
need for a reassembly queue for arriving fragments.

Long active messages are more complex.  If the data of a long AM will
fit in a maximum sized message, then it is sent as immediate data
and copied at the destination.  If the data is too large for the AM
packet, then it is sent by RDMA, and when the RDMA completion event
occurs, the Active message is sent.  For blocking AMs, the originator
polls for completion of the RDMA by calling gasnetc_poll_local_queue, which
checks for RDMA completions but does NOT check for arriving active
messages.  In the case of a non-blocking long active message, the
active message packet is constructed in memory attached to the RDMA
post descriptor, and when the RDMA completion event occurs, the AM is
sent by the completion routine.

==RDMA==

Within the Gemini API (GNI), RDMA is commanded by creating a Post
Descriptor and passing it to GNI. Both the source and destination
memory of an RDMA transfer must be registered.  The knowledge of RDMA
is encapsulated in the functions named gasnetc_rdma_{put,get}*().

GASNet requires that the remote address of a put or get will be within
the GASNet memory segment, which is pre-registered by gemini-conduit.
GASNet does not require that the local address of a put or get be
registered.  There are a number of cases.

As mentioned above, Gemini provides two kinds of RDMA: FMA and BTE
(called "RDMA").  FMA is Fast Memory Access and is used for short
transfers and is commanded entirely in user mode.  The Byte Transfer
Engine is intended for larger transfers and is kernel mediated.

Gemini-conduit allocates a pool of RDMA Post Descriptors, called
gasnetc_post_descriptor.  These are kept on a LIFO queue in registered
memory, because each one contains an "immediate" bounce buffer to
be used in the event the local address of a short xfer is not
registered.

Gemini-conduit also allocates a pool of medium sized bounce buffers,
which are used in the event the local address of a transfer is not
registered and is too large for the immediate buffer.  For transfers
to or from memory outside the GASNet segment that are too large for
the medium sized bounce buffer, the local buffer is registered for
the duration of the transfer, then unregistered.

The post descriptor also contains a completion field, that defines
what to do when the RDMA is finished, and a set of flags that control
completion behavior.

RDMA Completion event processing is done by gasnetc_poll_local_queue.
A number of types of follow-on activities are implemented:

 * For a GET, potentially copy data from a bounce buffer to the actual
   target

 * Potentially De-register specially registered memory

 * Potentially free a dedicated bounce buffer

 * Potentially send an Active Message that was waiting for the RDMA
   completion

 * Perform completion processing (signal other software waiting for
   RDMA completion


The GASNet Core API only uses RDMA PUT. The extended API uses both GET
and PUT.  The logic for GET is similar to that for PUT.

==Extended API==

Gemini conduit uses the reference implementation of gasnet_op_t for
completions.

Due to occasional failures of memory registration of large blocks, if
an RDMA transfer is larger than one megabyte, it is broken into
smaller segments and all but the last are sent by blocking protocol.

NOTE: The breakup of large blocks should probably be done for Long AMs
as well, to allow increasing gasnet_AMMaxLong{Request,Reply}().

==Locking==

In order to support GASNET_PAR mode, a variety of locks are used.

GASNETC_LOCK_GNI, GASNETC_UNLOCK_GNI

The Cray gni API is not thread-safe, so the gemini conduit uses a
single big lock around calls to gni.  Certain cases might be thread
safe, but without knowledge of the internals of gni it is hard to tell.

Queue Locks

A singly linked lists are used for the smsg_work_queue, and has its
own lock.  The are enqueue and dequeue functions that work on this
queue.

The bounce buffer pool and post descriptor pool use the
gasneti_lifo_push and gasneti_lifo_pop functions.
These are lock-free on the x86-64 architecture.


==Event Processing==

The gemini conduit uses two gni completion queues: bound_cq_handle for
completions of locally originated RDMA, and smsg_cq_handle, for
arriving short messages.

===bound_cq_handle  (RDMA completion)===

On a call to gasnetc_poll_local_queue, up to gasnetc_poll_burst (default 10)
completions are processed.  If the completion queue is empty, the
function exits early.  For each completed post descriptor, the flags
are checked and post-completion activities performed:

 GC_POST_COPY - copy from bounce buffer to target
 GC_POST_SEND - send an AM that was waiting for PUT completion (AMLong)
 GC_POST_UNREGISTER - deregister one-time memory registration
 GC_POST_UNBOUNCE - free an allocated bounce buffer
 GC_POST_COMPLETION_FLAG - atomic set a variable to signal a waiting thread
 GC_POST_COMPLETION_OP - gasnete_op_markdone 

===smsg_cq_handle===

NOTE: This section no longer accurate - we no longer use SMSG

When a short message arrives, an event is sent to the completion queue
associated with the endpoint of the message source.  The receiver
looks at the event to figure out which endpoint signaled, then calls
GNI_SmsgGetNext(endpoint) to get the message.  However, events are
generated whenever a message arrives, but they may be out of order,
and SmsgGetNext only delivers messages in-order. Consequently, when
you get an event, there may appear to be no message waiting, or there
may be several messages waiting.  It is necessary to keep calling
SmsgGetNext until you get all the waiting messages. At the same time,
it is necessary to read the completion queue at least once for every
message received or the completion queue will fill up.  If the
completion queue overflows, some events may be lost and it will be
necessary to drain all endpoints to make sure no messages are left
waiting.

The work starts in gasnetc_poll_smsg_queue.

First a call is make to gasnetc_poll_smsg_completion_queue, which reads up
to AM_BURST events from the completion queue and quickly adds the
relevant endpoints to a temporary array.  Then, after releasing the
GNI big lock, the array is scanned, and any peers which are not
already on the smsg_work_queue are added to it.  The smsg_work_queue
is a list of endpoints which may have arriving traffic.  Finally,
gasnetc_poll_smsg_completion_queue checks to see if the smsg completion
queue overflowed, in which case the completion queue is drained and
ALL endpoints are added to the smsg_work_queue

Then for up to SMSG_PEER_BURST ranks, gasnetc_process_smsg_q(source rank)
is called to deal with any arriving messages.

In gasnetc_process_smsg_q, arriving short messages are read and a dispatch
is made on the message type.  In the case of active messages, the
message is copied into a local variable, the SMSG buffer is released,
and then the AM handler is called.  This is done in order to prevent
stack growth caused by recursive calls to process arriving
messages. Each time around the message processing loop,
gasnetc_poll_smsg_completion_queue is called, which may result in
additional ranks being queued on the smsg_work_queue, but does not
cause recursive calls to gasnetc_process_smsg_q.  This is done so that the
average rate of retiring completion events is equal to the rate of
processing messages, and prevents completion queue overflow.

==Exit protocol==

NOTE: This section no longer accurate - we do more than is described here.

The exit algorithm is borrowed from the portals-conduit.  The Cray
runtime system ALPS (Application Layout and Placement System) is
strict about exiting without calling PMI_Finalize. It will interpret
that as an error exit and kill all the other ranks.


==Initialization==

The logic to determine the maximum size pinnable segment is borrowed
from the portals-conduit.

The gasnetc_bootstrap* functions containsthe interface to PMI.  It also includes
the implementation of AllGather for GASNet bootstrapExchange and GNI
initialization.  It uses PMI_AllGather.  For some reason,
PMI_AllGather does not return data in rank order, so it is sorted into
the usual AllGather order internally.

gasnet_core.c routine gasnetc_init also contains code to determine
which ranks have shared memory. This is done using the
PMI_Get_pes_on_smp() call, translated into the right format for
gasnet_nodemap[].

The gasnet auxseg machinery is used to allocate registered memory for
bounce buffers and for the rdma post descriptor pool. This is done to
reduce the number of separately registered segments, because
registrations are a scarce resource.


== Build notes ==