README file for ibv-conduit =========================== Paul H. Hargrove @ TOC: @ @ Section: Overview @ @ Section: Where this conduit runs @ @ Section: Job Spawning @ @ Section: Multi-rail Support @ @ Section: Runtime Configuration @ @ Section: HCA Configuration @ @ Section: Platform-specific Notes @ @ Section: Core API @ @ Section: Extended API @ @ Section: Graceful exits @ @ Section: References @ @ Section: Overview @ ibv-conduit implements GASNet over InfiniBand networks using the Open Fabrics Verbs API (www.openfabrics.org). The name "ibv" comes from a time when this API was known as "InfiniBand Verbs" (its library is still named libibverbs). @ Section: Where this conduit runs @ ibv-conduit runs over networks using the Open Fabrics Verbs API, such as InfiniBand. Platforms known to support this API include Linux, Solaris, AIX, Windows and FreeBSD. However, at this time only Linux and Solaris have been confirmed to run GASNet's ibv-conduit. While Open Fabrics Verbs also covers iWARP, ibv-conduit does not currently support it. While ibv-conduit may work on Intel Omni-Path (OPA) HCAs, we instead recommend use of psm-conduit on such systems. Nearly all InfiniBand cards (known as Host Channel Adapters, or HCAs) have support for Open Fabrics Verbs, available from www.openfabrics.org or from the HCA or OS vendor. We have tested numerous InfiniBand HCAs from Mellanox, InfiniPath HCAs from Pathscale/QLogic, and TrueScale HCAs from Intel. In addition to the Open Fabrics Verbs, vendor-specific APIs also exist and some have GASNet conduits: + mxm-conduit for the MXM API from Mellanox + ofi-conduit for the PSM (version 1) API on Intel TrueScale HCAs + ofi- or psm-conduit for the PSM2 API on Intel Omni-Path HCAs The performance of ibv-conduit relative to others on the same hardware will depend on your applications' communications patterns, the number of nodes you run on, and other parameters. Therefore, we do not make any specific recommendation as to which to use. We recommend that you benchmark your own workload if you are concerned with the very best possible performance. @ Section: Job Spawning @ If MPI support was NOT enabled when GASNet was configured, then only SSH based spawning will be supported and the following paragraph may be ignored. If MPI support was enabled when GASNet was configured, then there are two options for spawning a GASNet ibv-conduit application: MPI or SSH. The default can be set at configure time with --with-ibv-spawner=ssh or --with-ibv-spawner=mpi or --with-ibv-spawner=pmi where mpi is the default. Additionally If using UPC or Titanium, the language-specific commands should be used to launch applications. Otherwise, applications can be launched using the gasnetrun_ibv utility: + usage summary: gasnetrun_ibv -n [options] [--] prog [program args] options: -n number of processes to run (required) -N number of nodes to run on (not supported by all MPIs) -E list of environment vars to propagate -v be verbose about what is happening -t test only, don't execute anything (implies -v) -k keep any temporary files created (implies -v) -spawner=(ssh|mpi) force use of MPI or SSH for spawning (if available) At runtime the environment variable GASNET_IBV_SPAWNER (set to "mpi" or "ssh") can override the value set at configuration time, but the command line option will override the environment variable. If configured for PMI as the default, then you should use your PMI-based launcher, such as srun or yod, to launch ibv-conduit applications directly. If spawning using MPI, then the following apply: + In order to bootstrap ibv-conduit, a working MPI must be installed and configured on your system. See mpi-conduit/README for information on configuring GASNet for a particular MPI. Note that you must compile mpi-conduit as well (even if you never plan to use it). + MPI is only used in gasnet_init(), gasnet_attach() and gasnet_exit() and not for any GASNet calls between attach and exit. Therefore it is acceptable to use a TCP/IP based MPI such as MPICH or LAM/MPI. + The environment variable MPIRUN_CMD can be used to configure how to invoke mpirun. See mpi-conduit/README (or README-mpi) for more information. + Since an InfiniBand-based MPI typically allocates a non-trivial amount of memory for InfiniBand communication buffers, it may be desirable in memory-constrained situations to use a non-IB MPI, or to disable the use of IB by a multi-transport MPI. The documentation for your chosen MPI is the authoritative source for information, but here are the settings required to force use of TCP/IP in several widely-used MPIs: - LAM/MPI Pass "-ssi rpi tcp" to mpirun, OR Set environment variable LAM_MPI_SSI_rpi to "tcp". - Open MPI Pass "--mca btl tcp,self" to mpirun, OR Set environment variable OMPI_MCA_BTL to "tcp,self". - Intel MPI Set environment variable I_MPI_DEVICE to "sock". - HP-MPI Set environment variable MPI_IC_ORDER to "tcp". - MPICH or MPICH2 These don't use IB, so no special action is required. - MVAPICH or MVAPICH2 These don't use TCP/IP unless explicitly configured for IP-over-IB. So, these are not recommended if one is concerned with memory use. In all of the settings above, omit the quotes from the value. If spawning using SSH, the following apply: + The -E option is not necessary, as the full environment is always propagated to the application processes. + A list of hosts is specified using one of the GASNET_SSH_NODEFILE, GASNET_SSH_SERVERS, or GASNET_NODEFILE environment variables (in order from higest precedence to lowest). If set, the *_NODEFILE variables specify a file with one hostname per line. Blank lines and comment lines (using '#') are ignored. If set, the variable GASNET_SSH_SERVERS itself contains a list of hostnames, delimited by commas or whitespace. The following environment variables set by supported batch systems are also recognized if the GASNET_* variables are not set: PBS: PBS_NODEFILE LSF: LSB_HOSTS SGE: PE_HOSTFILE SLURM: Use `scontrol show hostname` if SLURM_JOB_ID is set + The environment variable GASNET_SSH_CMD can be set to specify a specific remote shell (perhaps rsh). The default is "ssh", and a search of $PATH resolves the full path. + The environment variable GASNET_SSH_OPTIONS can be set to specify options that will precede the hostname in the commands used to spawn jobs. One example, for OpenSsh, would be GASNET_SSH_OPTIONS="-o 'StrictHostKeyChecking no'" + For the following, the term "compute node" means one of the hosts given in GASNET_SSH_NODEFILE (or other environment variable described above) which will run an application process. The term "master node" means the node from which the job was spawned. The master node may be one of the compute nodes but is not required to be. + The ssh (or rsh) at your site must be configured to allow logins from the master node to compute nodes, and among the compute nodes. These must be achieved without interaction (such as entering a password or accepting new host keys). + Any firewall or port filtering must allow the ssh/rsh connections described above, plus TCP connections on untrusted port (those with numbers over 1024) from a compute node to the master node and among compute nodes. + Resolution for all given hostnames must be possible from both the master node and the compute nodes. It has been noted that some InfiniBand driver implementations may not allow for multiple open()s of the adapter. In this case, spawning via MPI is not possible because the MPI and GASNet implementations cannot share the adapter. If your GASNet jobs fail to spawn via MPI, but spawn correctly with ssh, this may be the reason. Our recommended response to this situation is to completely disable MPI support in GASNet by configuring with --without-mpi-cc. @ Section: Multi-rail Support @ Multi-rail support is ON in GASNet ibv-conduit by default. By default, GASNet ibv-conduit will open up to two InfiniBand Host Channel Adapters (HCAs) per node, and will stripe communications over one active port on each adapter. See the sections "Build-time Configuration" and "Runtime Configuration" for information on how to open more or fewer HCAs/ports, or to control which HCAs/ports are used. To first order, the use of multiple ports or multiple adapters will yield increases in both bandwidth (good) and software overhead (bad). How the resulting trade off works for a given application may be hard to predict. If one is concerned with obtaining the maximum possible performance for a given application, then experiment with the GASNET_NUM_QPS and/or GASNET_IBV_PORTS environment variables (documented in "Runtime Configuration") to determine how a given application runs best. @ Section: Build-time Configuration @ Ibv-conduit can ensures good network attentiveness (timely processing of incoming AMs) by spawning an extra thread that remains blocked until the arrival of an Active Message. One can disable this thread by configuring GASNet with the flag '--disable-ibv-rcv-thread'. It is recommended that one NOT use this option, but instead disabled the thread at runtime (see Runtime Configuration section). If the extra thread will never be needed, disabling it at build time will yield a small reduction in latencies by allowing some locking operations to compile away. By default, ibv-conduit will open at most two Host Channel Adapters (HCAs) on a node. To utilize more than two HCAs in a host, specify '--with-ibv-max-hcas=N' at configure time. However, if you have only a single HCA per host, then you may be able to get a small performance improvement by disabling multi-rail support with '--disable-ibv-multirail' at configure time. When using dynamic connections (see GASNET_CONNECT_DYNAMIC env var, below) there is an extra thread spawned to block for the arrival of connection requests. If needed, this can be disabled at configure time using '--disable-ibv-conn-thread'. @ Section: Runtime Configuration @ There are a number of parameters in ibv-conduit which can be tuned at runtime via environment variables. General settings: Ibv-conduit supports all of the standard GASNet environment variables and the optional GASNET_EXITTIMEOUT and GASNET_THREAD_STACK families of environment variables. See GASNet's top-level README for documentation. + GASNET_BARRIER In addition to the barrier algorithms in the top-level README, there is an implementation specific to IBV: IBDISSEM - like RDMADISSEM, but implemented using lower-level operations for lower latency. Currently IBDISSEM is the default on IBV. Connection settings: Under normal conditions, Host Channel Adapters and Ports will be located and configured automatically. However, in the event you have multiple adapters or multiple active ports on a single adapter, you may wish to set environment variables to identify the correct HCAs and Ports. Or, you may wish to use non-default values for configuring connections. These parameters may legally take different values on each node. See "Build-time Configuration", above, for information on enabling use of multiple HCAs in GASNet ibv-conduit. + GASNET_HCA_ID + GASNET_PORT_NUM ** UNSUPPORTED ** These environment variables, used in older releases, are no longer supported. Setting them to anything but the empty string will result in a run-time warning. + GASNET_NUM_QPS This variable gives the number of IB Queue Pairs (QPs) over which to stripe traffic between each pair of peers. This can yield an increase in throughput and bandwidth when multiple physical ports are used on one or more adapters. If the number of QPs exceeds the number of available physical ports then multiple QPs will be mapped round-robin to the ports. Be aware that mapping multiple QPs per port may yield either a performance improvement or a degradation, depending on traffic pattern. The default is 0, which means one QP per HCA/port used. + GASNET_IBV_PORTS By default, GASNet will open and use one active IB port on each HCA used, which will be all HCAs (when GASNET_NUM_QPS is zero), or the first GASNET_NUM_QPS HCAs found (when GASNET_NUM_QPS is non-zero). Setting GASNET_IBV_PORTS will specify a filter for which ports will be used. This can be used for instance to cause multiple physical ports to be used per HCA, or to specify specific ports and/or HCAs to be considered (up to GASNET_NUM_QPS if it is non-zero). This variable is a string of one or more HCA/port specifications, separated by '+' characters. Each such specification gives an HCA identifier and an optional comma-separated list of port numbers. The list of port numbers, if provided, is separated from the HCA id by a ':'. If a list of ports is given, only those ports may be used. Otherwise the first active port on the given HCA may be used. The following example allows the first active port on HCA mlx4_0, and only port 2 on mlx4_1: GASNET_IBV_PORTS="mlx4_0+mlx4_1:2". Note that this list is a *filter*, which means: + Duplicate entries do not cause multiple opens of a port or HCA + Entries describing non-existent HCAs are silently ignored + Entries describing inactive ports are silently ignored + Order is not significant. In particular if GASNET_NUM_QPS is less than the number of entries in GASNET_IBV_PORTS, ports are opened in the order detected, regardless of their order in GASNET_IBV_PORTS Note that in most IBV distributions the 'ibv_devinfo' utility will list the available HCAs and the status of their ports. The default is no filter. + GASNET_QP_TIMEOUT This sets the timeout value used to configure InfiniBand QueuePairs. The IB specification uses (4.096us * 2^qp_timeout) as the length of time an HCA waits to receive and ACK from its peer before attempting retransmission. The default is currently 18 (roughly 1 second). + GASNET_QP_RETRY_COUNT This sets the maximum number of retransmissions due to ACK timeout before the HCA signals a fatal error. The default is currently 7 (the max supported by early Mellanox HCAs) + GASNET_QP_RD_ATOM This sets the number of per-connection resources allocated by the HCA for responding to RDMA Reads (and atomics, which GASNet does not use currently). Lower values use slightly less memory but may reduce the throughput of Get-intensive communications patterns. The default value is '0', which means to use the maximum supported value reported by the HCA. Other valid setting are typically in the range from 1 to 4. + GASNET_MAX_MTU This sets the maximum MTU to be used, and has the following valid values: -1, 0, 256, 512, 1024, 2048 or 4096. If the value is 0 GASNet will automatically select the MTU size. If the value is -1 GASNet will use the HCA port's active value. Otherwise the lesser of this setting or the port's active value will be used. The default is 0: automatic MTU selection. + GASNET_CONNECT_DYNAMIC This boolean setting determines if connections can be established on demand. The default value is TRUE. When GASNET_CONNECT_DYNAMIC is enabled, a node will connect on demand to any peer not previously connected at startup. However, if a node is fully connected to all peers at startup, then dynamic connections are automatically disabled on that node. Therefore, unless GASNET_CONNECT_STATIC or GASNET_CONNECTFILE_IN is set to a non-default value this variable has no effect. + GASNET_CONNECT_STATIC This setting determines if connections are established at startup. When GASNET_CONNECT_STATIC is enabled, a node will connect at startup to all peers indicated by the GASNET_CONNECTFILE_IN setting (see below), or to ALL peers if that variable is unset. The value is a boolean with a default of TRUE. + GASNET_CONNECTFILE_IN This setting provides a filename used to limit the connections established at startup, and is ignored if GASNET_CONNECT_STATIC is FALSE. Any '%' character in the value is replaced with the node number to allow (but not require) separate per-node files. The format of a connect file is a series of lines of the form: node: peer1 peer2 ... without leading whitespace. For example, to request that node 7 connect to nodes 0, 4 and 6: 7: 0 4 6 Line lengths are not limited, but the same node number may appear to the left of the colon on multiple lines to limit line lengths. So, the following is equivalent to the previous example: 7:0 4 7:6 Ranges are supported. So, the following connects node 6 with nodes 9, 10, 11 and 12: 6:9-12 Order is not significant (except in ranges), so neither lines nor peer numbers need to be sorted. Connections are bidirectional so the following: 1:0 0:1 describes only 1 connection between nodes 0 and 1 and only one of these two lines is required to establish it (though there is no error in specifying both). This is true regardless of whether using a single file or per-node files. An optional line size: N indicates the number of nodes in the job, and is validated against the size of the current job if present. An optional line base: N specifies a numeric base for interpretation of all node numbers on lines that follow. The default is 10 (decimal), and legal values range from 2 (binary) to 36 (uses digits '0'-'9' and 'a'-'z'). If present, the 'base' line only affects node numbers read from later lines, and therefore should appear at the start of the file. Values on the 'size' and 'base' lines are always read as decimal. The default is unset/empty (no limit on which nodes are connected at startup). + GASNET_CONNECTFILE_OUT This setting specifies a filename in which to generate connection information suitable for later use as GASNET_CONNECTFILE_IN. Any '%' character in the value is replaced with the node number to allow separate per-node files. Use of per-node files is strongly recommended, and on some file systems (notably NFS) is REQUIRED for correct operation. If desired, the separate files may be concatenated together after the run completes to produce a single file suitable for use as GASNET_CONNECTFILE_IN. Alternatively, the following perl one-liner will concatenate the files while removing all but the first instance of the 'base' and 'size' lines: perl -ne 'print unless (/(base|size)/ && $X{$_}++);' -- [FILES] where [FILES] denotes the list of per-node connection files and the combined file is generated on stdout. The connection information produced in the output file(s) lists only those connections actually used in the current run. Therefore a common use case is to set GASNET_CONNECTFILE_OUT on a fully-connected run, and then use the generated file(s) to limit static connections in subsequent runs. The default is to use base-36 for node numbers, which results in more compact files but is difficult for a human to read. See GASNET_CONNECTFILE_BASE, below, for how to change this. The default is unset/empty (no output files are generated). + GASNET_CONNECTFILE_BASE This setting controls the numeric base used for node numbers in GASNET_CONNECTFILE_OUT files. Valid values range from 2 (binary) to 36 (uses digits '0'-'9' and 'a'-'z'). The value of the setting is always parsed as base-10. The default value is 36. + GASNET_CONNECT_SNDS + GASNET_CONNECT_RCVS These two settings control the number of small buffers allocated to send and to receive dynamic connection requests, and are ignored if GASNET_CONNECT_DYNAMIC is FALSE, or on any node that is already fully connected at startup. Because the buffers are small and allocation is page granular there is seldom any benefit to reducing the default values. However, there are conditions under which increasing one or both may help reduce the latency of dynamic connections: + Dynamic connection setup is blocking at the initiator, but if using pthreads it is possible that one node may have dynamic connection requests in-progress to multiple nodes. So, if an application is highly-threaded it may be beneficial to increase GASNET_CONNECT_SNDS for greater concurrency of sends. + If a given node receives many simultaneous connection requests, any requests in excess of the allocated buffers will be dropped. The connection will be delayed until the requester retransmits. So, the average connection setup time in the presence of "bursty" requests may be reduced by increasing GASNET_CONNECT_RCVS. The default value of GASNET_CONNECT_SNDS is 4. The default value of GASNET_CONNECT_RCVS is MAX(6, 4 + 2*ceil(log_2(N_remote))) where "ceil()" denotes rounding up to an integer, "log_2()" is the base-2 logarithm and "N_remote" is the number of GASNet nodes minus "self" and any nodes reachable through shared memory (PSHM). + GASNET_CONNECT_RETRANS_MIN + GASNET_CONNECT_RETRANS_MAX These two settings control the minimum and maximum intervals between retransmission of messages used in establishing dynamic connections, and are ignored if GASNET_CONNECT_DYNAMIC is FALSE, or on any node that is already fully connected at startup. Values are in units of microseconds (10^-6 sec). The value of GASNET_CONNECT_RETRANS_MIN is the interval between sending an initial request and the first retransmission. Each retransmission doubles the interval before the next, up to the maximum value given by GASNET_CONNECT_RETRANS_MAX, after which the connection setup fails. Adjustment of these settings may help resolve timeouts on networks with high rates of UD packet loss. However, this is not recommended without consulting with the author and the defaults are therefore not documented here. Software configuration settings: There are some optional behaviors in ibv-conduit that can be turned ON or OFF. These parameters may legally take different values on each node, but doing so may not be useful. + GASNET_RCV_THREAD This gives a boolean: "0" to disable, or "1" to enable, the use of an extra thread that blocks waiting for an Active Message Request or Reply to arrive. This allows ibv-conduit to remain attentive to incoming AM traffic even while the application is not making any calls to GASNet. The down side is that when this thread wakes it must contend for CPU resources and for locks. Therefore, for an application that is calling GASNet sufficiently often, use of this thread may significantly INCREASE running time. However, on an SMP where an otherwise idle processor is available the use of this thread can REDUCE running time by relieving the application thread of the burden of servicing incoming AM Requests and Replies. Note that if '--disable-ibv-rcv-thread' was specified at build time then the extra thread is unavailable and this environment variable is ignored. Currently the default is disabled (0), but this is subject to change. NOTE: In releases prior to GASNet 1.18.2 the AM receive thread was unavailable for ibv-conduit, but that is no longer the case. + GASNET_RCV_THREAD_RATE If GASNET_RCV_THREAD is enabled, then this setting can be used to impose a limit on how frequently the AM receive thread may wake. This may be used to limit interference between the AM receive thread and the main application thread(s), while providing some network attentiveness when the application is not making GASNet calls. A non-zero value gives the maximum rate in wake-ups per second. The default value is 0, which means no limit is imposed. NOTE: A future release may implement GASNET_RCV_THREAD_LOAD to impose a limit on the *fraction* of time the thread spends awake. Pinnable memory probe configuration: In normal operation of ibv-conduit it is necessary to know how much memory may be registered (aka pinned) with the InfiniBand HCA(s). This is limited by multiple factors and thus cannot be determined by a simple query. Therefore, the default behavior is to attempt to mmap and register as much memory as possible at startup, and then release all the memory. When there are multiple GASNet processes on a shared memory node, one representative process will perform this probe. There are at least two well-known reasons why one may desire to limit or eliminate this probe. The first is the time spent performing the probe. The second is the possibility that the O/S or a batch execution environment may terminate a process that exceeds some limit on the virtual memory size of a process and/or may terminate the process with the largest size when memory is exhausted. Use of the following parameters allows one to bound, or to eliminate, this probe. These parameters must be equal across all nodes, and the behavior otherwise is undefined. + GASNET_PHYSMEM_MAX If non-zero this parameter tells ibv-conduit the maximum amount of physical memory to pin. The suffixes "M" and "G" are interpreted as Megabytes and Gigabytes respectively. The current default is zero, which means to probe the limits imposed by the O/S and HCA. Setting to a non-zero value will limit how large the GASNet segment can be, and how much memory is available for firehose (see below), but may speed startup by bounding the probe. Note, however that setting only this variable bounds the probe, but does not eliminate it. Also be aware that with multiple processes per shared memory node the value given by this variable will be divided by the number of processes per shared memory node to determine the memory available. + GASNET_PHYSMEM_NOPROBE This gives a boolean: "0" to disable or "1" to enable the use of GASNET_PHYSMEM_MAX without probing. If GASNET_PHYSMEM_MAX is zero or unset, this variable is ignored. Enabling this setting may greatly speed startup, but can lead to unexpected runtime failures if GASNET_PHYSMEM_MAX exceeds the limits imposed by the O/S and HCA. If GASNET_PHYSMEM_MAX is zero (or unset) this variable is ignored. The default is OFF. Protocol configuration: The following environment variables control the selection of protocols for performing certain transfers. These parameters must be equal across all nodes, and the behavior otherwise is undefined. + GASNET_INLINESEND_LIMIT IBV includes an "inline send" operation that transfers the data to the HCA at the same time it transfers the request. This normally provides a measurable performance improvement, but is only available up to an hardware- and firmware-dependent maximum size. A value of 0 disables use of inline sends. The default of 72 is normally correct. For ibv-conduit the default of -1 causes use of the maximum value reported by the HCA. + GASNET_PACKEDLONG_LIMIT To perform an AMLong or AMLongAsync with non-empty payload, ibv-conduit must transfer both the payload and the header. For sufficiently small payloads, it is more efficient (in terms of both CPU overhead and network latency) to pack the header and payload together and copy the payload into place on the target before running the handler. Thus, for payload up to and including this size this packing is used. The default value is the maximum that fits into a 4KB buffer together with the maximum sized header (currently 4012). A value of zero ensures the payload and header always travel separately. + GASNET_NONBULKPUT_BOUNCE_LIMIT To perform a non-bulk PUT with nbytes > GASNET_INLINESEND_LIMIT or to transfer the payload of an AMLong (but not AMLongAsync) with nbytes > MAX(GASNET_INLINESEND_LIMIT, GASNET_PACKEDLONG_LIMIT), ibv-conduit must either copy the data into bounce buffers, or block until remote completion is signaled by the HCA. Such transfers up to and including size GASNET_NONBULKPUT_BOUNCE_LIMIT are performed using bounce buffers while larger transfers are transfered using blocking PUTs. The default value is 64KB. A value of zero disables use of bounce buffers. + GASNET_PUTINMOVE_LIMIT (only for GASNET_SEGMENT_{LARGE,EVERYTHING}) When the firehose algorithm (see below) is in use for managing the pinning of remote memory, a PUT that misses in the firehose cache may be accelerated by piggybacking data on the AMMedium that is used to obtain a remote pinning. The value of GASNET_PUTINMOVE_LIMIT is the maximum number of bytes to send in this way. The value is bounded by the maximum value set at compile time, and it is an error to request a larger value. Note that in a GASNET_SEGMENT_FAST configuration, the remote segment is pinned statically and this optimization is never applicable. The default value is 3KB (the current maximum value). A value of zero disables this optimization. + GASNET_USE_SRQ This controls whether IBV Shared Receive Queue (SRQ) support is used, but is ignored if GASNet was configured with --disable-ibv-srq. This setting defaults to -1, which means that SRQ will be used only if doing so would reduce memory usage (as determined from the value of the GASNET_RBUF_COUNT setting, described below). If set to a non-negative value, this setting give the minimum GASNet node count at which SRQ will be used, regardless of whether or not the memory usage would increase or decrease. A value of zero will disable SRQ. Examples: - GASNET_USE_SRQ unset or explicitly set to -1: SRQ is used ONLY if GASNET_RBUF_COUNT is less than the number of receive buffers required for the non-SRQ case. - GASNET_USE_SRQ <= gasnet_nodes() [includes GASNET_USE_SRQ == 1] SRQ is used and GASNET_RBUF_COUNT is enforced as a maximum - GASNET_USE_SRQ > gasnet_nodes() SRQ is NOT used and GASNET_RBUF_COUNT is ignored Note that the interpretation of the values 0 and 1 allow one to use this setting as a simple boolean if desired. + GASNET_USE_XRC This controls whether IBV eXtended Reliable Connection (XRC) support is used. However, it is is ignored if GASNet was was configured with --disable-ibv-xrc, if XRC support was not found at configure time, or if SRQ support is not used (regardless of why), This setting defaults to 1 if SRQ support was enabled at configure time. As a result XRC will be used anytime SRQ is used. Resource usage parameters: The following environment variables control how much memory is preallocated at startup time to serve various functions. Because these resource pools do not grow dynamically, it is important that these parameters be sufficiently large, or performance degradations may results. The default settings should be sufficient for most conditions. You may need to lower some values if you have insufficient memory. These parameters must be equal across all nodes, and the behavior otherwise is undefined. + GASNET_NETWORKDEPTH_PP This gives the maximum number of ops (RDMA + AMs) which can be in-flight simultaneously from a node to each of its peers. Here "in-flight" means queued to the send work queue and not yet reaped from the send completion queue. This value is the depth of each send work queue. This limit is on the number of ibv-level ops in-flight, and the number of GASNet-level operations may be less (for example, when the remote range of a PUT or GET covers more than one pinned region, due to GASNET_PIN_MAXSZ, or because an AM Long uses separate ops for the payload and header). The default value is 24. Reducing this parameter may limit small message throughput. If you believe your small message throughput is too low, you may try increasing this value. + GASNET_NETWORKDEPTH_TOTAL This gives the maximum number of ops (RDMA + AMs) which can be in-flight simultaneously from each node (with "in-flight" defined as in GASNET_NETWORKDEPTH_PP.) The depth of the send completion queue is min(GASNET_NETWORKDEPTH_TOTAL, GASNET_NETWORKDEPTH_PP*(N-1)). If set to zero, the value is set to the maximum usable value computed from GASNET_NETWORKDEPTH_PP and the HCA's reported capabilities. The default value is 255. Reducing this parameter may limit small message throughput. If you believe your small message throughput is too low, you may try increasing this value (or setting it to zero), at a cost in additional memory consumption. + GASNET_AM_CREDITS_PP This give the maximum number of AM Requests which can be in-flight simultaneously from a node to each of its peers. Here "in-flight" means the Request is queued to the send work queue, but the matching Reply has not yet been processed for AM flow control (described in another section of this README). This is the number of buffers which must be preposted to each receive work queue for AM Requests. The default value is 12 (48KB*(N-1) allocated for Request buffers). Reducing this parameter may limit Active Message throughput. If you believe your Active Message throughput is too low, you may try increasing this value. + GASNET_AM_CREDITS_TOTAL This gives the integer number of AM Requests which can be in-flight simultaneously from each node, with "in-flight" defined as in GASNET_AM_CREDITS_PP. This is the number of receive buffers which will be allocated for posting to an endpoint for the AM Reply which follows each AM Request. If set to zero, the value is set to the maximum usable value computed from GASNET_AM_CREDITS_PP and the HCA's reported capabilities. The default value is MIN(256, nodes*GASNET_AM_CREDITS_PP). Reducing this parameter may limit Active Message throughput. If you believe your Active Message throughput is too low, you may try increasing this value (or setting it to zero), at a cost in additional pinned memory. + GASNET_AM_CREDITS_SLACK This gives the maximum number of flow-control credits that can be delayed at the responder. If a Request handler does not produce a Reply, a credit may be "banked" to be piggy-backed on the next Request or Reply headed to the requesting node. The value of GASNET_AM_CREDITS_SLACK gives the maximum number of credits that can be banked before a hidden Reply is generated to convey credits back to the requester. The default value is 1. GASNET_AM_CREDITS_SLACK will be silently reduced if needed to ensure deadlock will not occur, and is ignored when SRQ is used. Reducing this parameter to zero or setting it too high may increase the latency of Active Message traffic. + GASNET_RBUF_COUNT If SRQ support is unavailable or disabled, this parameter is ignored. See GASNET_USE_SRQ documentation for details of when SRQ is enabled. When SRQ is enabled this gives the max number of AM receive buffers allocated on each node. These buffers are needed for reception of AM headers and the payload of mediums, but are not used for RDMA. The actual number of buffers allocated is the lesser of the value of GASNET_RBUF_COUNT or a value computed from the GASNET_AM_* and GASNET_NETWORKDEPTH_* parameters described above. If set to zero, the value is limited only by the HCA's capabilities. The default value is 1024 (up to 4MB for buffers). Reducing this parameter may limit Active Message throughput. If you believe your Active Message throughput is too low, you may try increasing this value (or setting it to zero), at a cost in additional pinned memory. + GASNET_BBUF_COUNT This gives the max number of pre-pinned buffers allocated on each node. These buffers are needed for assembly of AM headers and the payload of mediums, and for some PUTs (see GASNET_NONBULKPUT_BOUNCE_LIMIT). The actual number of buffers allocated is the lesser of the values of GASNET_BBUF_COUNT and GASNET_NETWORKDEPTH_TOTAL, since the total network depth bounds the number of in-flight operations that might need these buffers. If set to zero, the value is set to GASNET_NETWORKDEPTH_TOTAL. The default value is 1024 (up to 4MB for buffers). Reducing this parameter limits the number of in-flight operations which consume bounce buffers. This includes AMs too large for an inline send and PUTs subject to the GASNET_NONBULKPUT_BOUNCE_LIMIT. If you believe that throughput of these operations is too small, you may try increasing this value (or setting it to zero), at a cost in additional pinned memory. AMRDMA configuration: The following environment variables control the AM-over-RDMA code in ibv-conduit. These parameters must be equal across all nodes, and the behavior otherwise is undefined. As an optimization, ibv-conduit can send Active Message (AM) traffic using RDMA PUT operations to buffers reserved for this purpose. This leads to a reduction in the end-to-end latency of such AM traffic, but at the cost of increased memory usage and in the cost of a polling for AMs (as done explicit in gasnet_AMPoll(), and implicitly in many GASNet calls). + GASNET_AMRDMA_LIMIT This environment variable controls the maximum size of an AM that may be sent via the AMRDMA path. The software overhead on both the sender and receiver grow with the message size. So, as the message size grows the use of AMRDMA eventually costs more than the snd/rcv alternative. A value of zero will disable AMRDMA. The current default is 4K minus some overheads. + GASNET_AMRDMA_DEPTH This environment variable controls the number of buffers (currently 4K each) that are allocated per peer for receiving AMRDMA traffic. Larger values may potentially allow for more AMs in-flight before flow control throttles the sender, but at the cost of increased memory consumption. The value must be a power of 2 and the maximum is 32. The current default is 16. + GASNET_AMRDMA_MAX_PEERS This environment variable limits the number of peers from which AMs may be received over RDMA. Larger values may potentially allow for lower cost AM traffic, but at the cost of increased memory consumption and greater overhead each time that ibv-conduit polls for AM traffic. A value of zero will disable AMRDMA. The current default is 32. + GASNET_AMRDMA_CYCLE This environment variable, which must be zero or a power or two, sets the period between (re)assignments of AMRDMA peers, as measured by of the number of AMs received of size <= GASNET_AMRDMA_LIMIT. The current algorithm for selecting AMRDMA peers is very simplistic and will never revoke AMRDMA status once granted to a given peer. An overly large value will result in a long period of time before any peers may be assigned AMRDMA status, while an overly short value may select peers based on initialization activities which don't reflect the communication pattern of the remainder of the run. A value of zero will disable AMRDMA. A value of one will result in selecting the first GASNET_AMRDMA_MAX_PEERS peers from which AMs are received. The current default is 1024. Firehose configuration: These parameters must be equal across all nodes, and the behavior otherwise is undefined. The following environment variables control the resources used by the "firehose" [ref 1] dynamic registration library. By default firehose will use as much pinned memory as the HCA and O/S will permit. Resource use is divided into two pools. The main pool is for managing of pinning of the GASNet segment on remote nodes, while the "victim" pool is used to manage pinnings for local use. By default in a GASNET_SEGMENT_LARGE or GASNET_SEGMENT_EVERYTHING configurations, 75% of the pinnable memory will go in the main pool and 25% into the victim pool. In a GASNET_SEGMENT_FAST configuration, firehose is not needed for management of the statically pinned GASNet segment, and by default only a small fraction of the available memory is placed in the main pool and the majority is placed in the victim pool. + GASNET_USE_FIREHOSE This environment variable is only available in a DEBUG build of GASNet (one configured with --enable-debug). This gives a boolean: "0" to disable or "1" to enable the use of the firehose dynamic pinning library in a GASNET_SEGMENT_FAST configuration. In a GASNET_SEGMENT_FAST configuration, the GASNet segment is registered (pinned) with the HCA at initialization time, because pinning is required for RDMA. However, GASNet allows for local addresses (source of a PUT or destination of a GET) to lie outside of the GASNet segment. So, to perform RDMA GETs and PUTs, ibv-conduit must either copy out-of-segment transfers though preregistered bounce buffers, or dynamically register memory. By default firehose is used to manage registration of out-of-segment memory. (default is ON). Setting this environment variable to "0" (or "no") will disable use of firehose, forcing the use of bounce buffers for out-of-segment transfers. This will result in a significantly lower peak bandwidth for large PUTs and GETs, with little or no affect on small message latency. It is available only for debugging purposes. In a GASNET_SEGMENT_LARGE or GASNET_SEGMENT_EVERYTHING configuration, the GASNet segment is not preregistered and use of firehose is required. Thus it is an error to disable firehose in such a configuration. + GASNET_FIREHOSE_M This gives the amount of memory to place in the main pool. The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes and Gigabytes respectively, with "M" assumed if no suffix is given. When GASNET_FIREHOSE_MAXVICTIM_M is set, the default is the maximum pinnable memory minus GASNET_FIREHOSE_MAXVICTIM_M. Otherwise the default is 75% of the maximum pinnable memory (in a GASNET_SEGMENT_LARGE or GASNET_SEGMENT_EVERYTHING configuration), or the size of the prepinned bounce buffer pool (in a GASNET_SEGMENT_FAST configuration). + GASNET_FIREHOSE_MAXVICTIM_M This gives the amount of memory to place in the victim (local) pool. The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes and Gigabytes respectively, with "M" assumed if no suffix is given. The default is the maximum pinnable memory minus GASNET_FIREHOSE_M. + GASNET_FIREHOSE_MAXREGION_SIZE This gives the maximum size of a single dynamically pinned region, should be a multiple of the pagesize, and preferably a power of two. The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes and Gigabytes respectively, with "M" assumed if no suffix is given. The current default is 128k. Larger values have been known to trigger a performance anomaly in some HCAs. + GASNET_FIREHOSE_R This gives the number of pinned regions to allocate for the management of the main pool. Values will be truncated if larger than the default of (GASNET_FIREHOSE_M / GASNET_FIREHOSE_MAXREGION_SIZE). + GASNET_FIREHOSE_MAXVICTIM_R This gives the number of pinned regions to allocate for the management of the victim (local) pool. Values will be truncated if larger than the default of (GASNET_FIREHOSE_MAXVICTIM_M / GASNET_FIREHOSE_MAXREGION_SIZE). + GASNET_FIREHOSE_VERBOSE This gives a boolean: "0" to disable or "1" to enable the output of internal information of use to the developers. You may be asked to run with this environment variable set if you report a bug that appears related to the firehose algorithm. @ Section: HCA Configuration @ To achieve normal correct operation of GASNet over IBVerbs should *not* require any specialized configuration of your HCAs. However, this section documents any configuration that *may* help improve performance. We recommend you backup your configuration data prior to attempting any modification, and that you confirm that any changes made produce a measurable benefit before deciding to keep them. If trying a suggestion here results in no measurable improvement, then we recommend that you return the modified parameter(s) to their previous value(s). WE DISCLAIM ALL RESPONSIBILITY IF FOLLOWING ANY SUGGESTION HERE RESULTS IN AN UNSTABLE OR UNUSABLE SYSTEM. Please consult the documentation provided with your HCA drivers, and/or your vendor or system integrator for information on how to query or change your HCA's configuration parameters. + The HCA configuration parameter MAX_QP_OUS_RD_ATOM controls the number of simultaneous RDMA Reads for which a QP may act as Responder. Our testing on one system with a default value of 8, showed that increasing the value to 16 yielded approximately a 30% bandwidth improvement in an RDMA-GET benchmark. @ Section: Platform-specific Notes @ + Crashes have been seen using QLogic's InfiniPath HCAs with ibv-conduit with default parameters. If you see crashes with a message containing FATAL ERROR: aborting on reap of failed send then we recommend setting the following two environment variables GASNET_NETWORKDEPTH_PP=8 GASNET_QP_RD_ATOM=1 In our testing this resulted in about a 2% reduction in peak bandwidth, but eliminated all instances of "aborting on reap of failed send". @ Section: Core API @ + Flow-control for AMs. The AMs in ibv-conduit are just implemented as send/recv traffic. Therefore a send without a corresponding recv buffer preposted at the peer will be stalled by the RNR (receiver-not-ready) flow control in IB. However there are two reasons why we want to avoid this situation. The first is that if such a send is blocked by flow control, then the ordering semantics of IB tell us that all the gets and puts that we've initiated after the AM was sent are also stalled. Rather than let that happen, we should manually delay those which are dependent on the AM. The second reason is that under some conditions the RNR flow control is very poor. The problem is that once the intended receiver sends a RNR NAK to indicate no available recv buffers, IB has the SENDER's hardware/firmware poll for the receiver to become ready again! That leaves us with a choice between configuring a small polling interval and consuming a lot of bandwidth for this polling, or a large interval which leads to performance which is degraded more than necessary when IB flow control is asserted. For these reasons we implement some flow control at the AM level. The basic idea is that every REQUEST consumes one credit on the sending endpoint and every REPLY grants one credit on the receiving endpoint. Thus if M is the initial number of credits on each endpoint and every REQUEST has exactly one matching REPLY, then M becomes a limit on the number of un-acknowledged REQUESTS in flight on an endpoint. If we want to avoid RNR conditions, then we should start with M credits and M preposted recv buffers on each endpoint. This allows for only the receipt of M REQUESTS. In addition, a recv buffer will be posted on demand for a REPLY just before sending each REQUEST. It is a simple matter to count the credits when a REPLY is received and to poll for credits when needed to send a REQUEST. It is also simple to ensure the exactly-one-reply. We already ensure that at-most-one reply is sent by the request handler. Additionally we must check upon handler return for the case that the request hander sent no reply, and send one implicitly. We just use a special "system category" handler, gasnetc_SYS_ack, which doesn't even run a handler. To avoid using up 1/2 our bandwidth in the event of a REQUEST-REQUEST pong-pong, we perform some coalescing to avoid sending too many SYS_ack REPLIES. We keep up to GASNET_AM_CREDITS_SLACK "banked" on the responding node, sending the SYS_ack REPLY only if the number banked exceeds this limit. Credits which are banked get piggybacked on the next REQUEST or REPLY headed back to the original requester. To avoid a window of time between when we send a REPLY (credit) and when we post the recv buffer, we must post the replacement recv buffer BEFORE running an AM REQUEST handler. To do this we keep a pool of unposted recv buffers (also used for the on-demand posting of buffers needed for REPLIES). So, when we recv an AM REQUEST, we grab a free recv buffer from the pool and post it to the endpoint, and only then run the handler. We send an implicit reply if a REQUEST handler didn't send any REPLY. Finally we take the recv buffer containing the just-processed AM and we return it to the unposted pool. There is a corner case we must deal with when there are no spares left in the unposted pool. In this case we will copy the received REQUEST into a temporary (non-pinned) buffer before processing it. This allows us to repost the recv buffer immediately. Since the temporary buffer is not pinned, it cannot be used for receives. Therefore, we free the temporary buffer when the handler is done, rather than placing it in the unposted pool. If we reap multiple AMs in a single Poll, then we reuse the previous buffer as the "spare" for the next one, in place of grabbing one from the unposted pool each time. Thus, we touch the unposted pool at most twice per Poll, once for the first AM we receive and once at the end to put the recv buffer of the final AM back in the unposted pool. For the dedicated receive thread we can do even better, never touching the unposted pool at all, by always keeping a single thread-local "spare", initially acquired at startup. Note that with SRQ is used, no fow control is used. @ Section: Extended API @ Notes for myself for extended API: + The send completion facility consists of two pointers to counters, associated with each sbuf. If these pointers are non-NULL then the counter is decremented atomically when the send is complete. One counter is for awaiting reuse of local memory and is only be used for sbufs which are doing zero copy. This counter provides the mechanism for Longs and non-bulk puts to block before they return, and should be allocated as an automatic variable. The second counter is for request completion and should be non-NULL for every sbuf for which request completion would be checked (all gets & puts, but not the Longs). For nb and nbi the counter is waited on at sync-time. Therefore the explicit handle is a struct containing the counter. + Similar to the reference implementation's cut-off between Mediums (which typically do a source-side copy) and Longs (which may not), we have a cut-off size, below which the RDMA-put operation will do source-side copies _iff_ local completion is desired (Long, put_nb, and put_nbi). + The gets are done w/ RDMA-reads, and use the sbuf bounce buffers if the local memory is not in the segment (or otherwise registered). The value gets also pass though the bounce buffers. Clearly there is no bulk/non-bulk distinction in terms of local memory reuse, just the alignment and optimal size distinctions. So, only the outstanding request counter on the sbuf is needed for syncs of all types of gets. + Table of when synchronization is needed Local Remote Operation Sync Sync -------------------------- LongAsync X X Long I X put_nb I S put_nbi I S put_nb_bulk X S put_nbi_bulk X S put_nb_val X S put_nbi_val X S put X I put_bulk X I put_val X I get_nb X S get_nbi X S get_nb_bulk X S get_nbi_bulk X S get_nb_val X S get_nbi_val (DOES NOT EXIST) get X I get_bulk X I get_val X I X = Not needed at all (or not even applicable with _val forms) I = Needed before (I)nitiating function returns S = Needed before (S)ynchronizing function returns + Some minor tweaks are used to avoid allocation of counters in some cases. - For all the functions which require waiting on a counter in the initiating function, the counter can be allocated on the stack (as an automatic variable). - For the implicit-handle forms the request counter is in the thread-specific data, possibly in an access-region. - For the explicit handle forms the request counter must be allocated from some pool, requiring some memory management work. This is done with a modification to the code from the reference implementation, and uses thread-local data to avoid locks. + The memsets can be more efficiently implemented as a _local_ memset followed by a PUT, for small enough sizes. The cutoff is presently the size of one bounce buffer, but has not been tuned. This was disabled when GASNET_PIN_MAXSZ was introduced. Therefore, all memsets are currently done by Active Messages. @ Section: Graceful exits @ On June 24, 2003 ibv-conduit now passes all 9 (I added two recently) of the cases in testexit. By "Pass" I mean that the entire gasnet job (tested up to 8-way across my 4 dual-processor machines) terminates with no orphans, and with tracing properly finalized (if tracing is enabled). On August 11, 2003 the graceful exit code was revised to send O(N) network traffic in the worst case, as opposed to the O(N^2) required in all cases in the first implementation. Additionally, the exit code is properly propagated through the bootstrap, to yield a correct exit code for the parallel job as a whole. If using MPI for bootstrapping, the actual exit code will depend on supported in a given MPI implementation (some ignore the exit code of the individual processes). This code is heavily commented, but for the curious, here is a description of the code. There are three paths by which an exit request can begin. The first is through gasnetc_exit(), which may be called by the user, by the conduit in certain error cases, and by the default signal handler for "termination signals". The second is via a remote exit request, passed between nodes to ensure full-job termination from non-collective exits. The third is via an atexit/on_exit handler, registered by gasnetc_init(), used to catch returns from main() and user calls to exit(). There are slight variations among the code in these three cases, but most of the work is common, and is performed by three functions: gasnetc_exit_head(), gasnetc_exit_body() and gasnetc_exit_tail(). The first of these, _head, is used to determine the "first" exit and store its exit code for later use. This is important because even a collective exit will involve receiving remote exit requests. Only if a remote exit request is received before any local calls to gasnetc_exit(), should the request handler initiate the exit. Note that even in the case of a collective exit it is possible for the first remote request to arrive before the local gasnetc_exit() call. However, that is made very unlikely by the timing and is nearly harmless since the only difference is the raising of SIGQUIT in response to a remote exit request, which is not done for locally-initiated ones. The second common function, _body(), is used to perform the "meat" of the shutdown. It begins by ignoring SIGQUIT to avoid re-entrance, and then blocks all but the first caller in a polling loop to avoid multiple threads from executing the shutdown code. Because strange things can happen if we are trying to shutdown from a signal context, a signal handler is installed for all the "abort signals". This signal handler just calls _exit() with the exit code stored by _head(). Because we may have problems shutting down if certain locks were held when a signal arrived, we also install the signal handler for SIGALRM, and use the alarm() function to bound the time spent blocked in the shutdown code. While there is the risk that this alarm might go off "too soon" if the shutdown has lots of work to do, we can be certain that the correct exit code is still generated. Once the signal handlers are established, _body closes down the tracing and stats gathering and flushes stdout and stderr. Then _body calls gasnetc_exit_reduce() to try to perform a collective reduce-to-all over the exit codes. If this completes within a given timeout then we know the exit is collective and skip over the master/slave logic decribed in the next 2 paragraphs. If the reduction does not complete within the timeout, then _body next calls gasnetc_get_exit_role() to "elect" a master node for the exit. This is done with an alarm() timer in force. The use of an "election" with a timeout ensures that we will exit, even if node 0 is wedged. The election of a master proceeds by sending a system-category AM request to node 0, and spinning to wait for a corresponding reply, which will indicated if the local node is the "master" or a "slave" in the coordination of the graceful exit. The logic on node 0 ensures that the first "candidate" is always made the master, not waiting for multiple AMs to arrive. Additionally the slave nodes may, under circumstances described below, know before entering gasnetc_get_exit_role() that they are slaves, and will not bother to send an AMRequest to node 0. In either case gasnetc_get_exit_role() indicates to _body which role the local node is to assume. From _body, the single master node will enter gasnetc_exit_master() and will begin sending an remote exit request (system-category AM, so this will all work between _init and _attach) to each peer. Then the master waits (with timeout, of course) for a reply from each peer. This request conveys the desired exit code to each node. It also will wake them out of a spin-loop, barrier, or other case where they were not yet aware of the need to exit. In the handler for the exit request, a node will send a reply back to the master, so it knows all the nodes are reachable. It will set its role to "slave" and, if no exit is in-progress, it will start the exit procedure, as described later. From _body, the slave nodes all call gasnetc_exit_slave(), which simply spins until the remote exit request has arrived from the master. Regardless of whether exit coordination (the reduction, or exit requests and replies) completed within their timeouts, _body proceeds to flush stdout and stderr one last time and closes stdin, stdout and stderr. Finally, _body shuts down its bootstrap support. If either coordination was completed within the timeout, then the gasnetc_bootstrapFini() routine is called indicating that we'll not be making any more calls to the bootstrap code and expect to exit shortly. However, if both coordinations did fail we call gasnetc_bootstrapAbort(exitcode). This call is meant to request that the bootstrap terminate our job "with prejudice" since we failed to coordinate a graceful shutdown on our own. We do this to try to avoid orphans, but risk lots of unsightly error messages and possible loss of our exit code. Assuming we did not call _bootstrapAbort (which does not return) we finish _body by canceling our alarm timer and return to our caller. The final common routine is gasnetc_exit_tail(). This function just does the last bit of work to terminate the job. It is not included in _body because we let the atexit/on_exit() case terminate "normally" after _body returns. However, in the case of exits initiated via gasnet_exit() or remote exit request we call _tail to complete the exit. In _tail we set an atomic variable to wake any threads which were stuck polling in _body due to being other than the first thread to enter. Those threads should eventually wake and also call _tail to terminate. Next, we call gasneti_killmyprocess() to do any platform- specific magic required to get the entire multithreaded application to exit. Finally we call _exit() with the saved exit code. Given the routines gasnetc_exit_{head,body,tail}() the code for the three types of exit are pretty trivial. In particular, gasnetc_exit() just calls _head, _body and _tail with no additional logic. In the request handler for the exit request AM, we look at the return from _head to determine if this exit request is the first we've seen (inclusive of local calls to gasnet_exit() and our atexit/on_exit handler). If it IS the first exit request, then we raise a SIGQUIT, as required by the GASNet spec, to allow the user's handler to perform its cleanup. However, to get the most robust exit code we don't want to run the _body code from a signal handler context if we can avoided it. Therefore we inspect the signal handler and skip the raise() call if the handler is the gasnet default handler, SIG_DFL or SIG_IGN. After the raise() returns, or is skipped all together, we are certain that the user's hander, if any, has executed and has NOT called gasnet_exit(). If a user handler had called gasnet_exit(), then raise() would not have returned. So, if we reach the code after the possible raise(), we proceed to call gasnetc_exit_body() and _tail to complete the (hopefully) graceful exit of the gasnet job. It is important to note that if we get a remote exit request that initiates an exit, then we will never return from the handler. However, the design of the AM code in IBV conduit ensures that this will actually work without deadlock. For one, we never run handlers from signal context or with locks held. Thus we can expect a "clean" set of locks. Furthermore, we don't expect to do anything useful with the network once the request handler calls _body anyway. The atexit handler just calls _head and _body before returning to allow the exit to complete. In this case we have a little problem with the lack of access to the return code. Therefore we just pass 0 to _head, which _body then sends in the remote exit requests. Experience has shown that, at least with LAM/MPI for bootstrap, when all but one task exits with zero, the single non-zero exit code becomes the exit code for the parallel job. Therefore, using zero here gives the specified exit code from the parallel job for both collective and non-collective returns from main. If support is detected at configure time for on_exit(), then it is used rather than atexit(), and the problem of the missing return code vanishes. In the normal case of a collective exit, the reduce-to-all-with-timeout is performed in 3 steps. The first is an intra-supernode reduction. The second is a reduce-to-all over supernodes using the same communication pattern as the dissemination barrier, requiring ceil(log_2(SN)) rounds in which each supernode sends and receives one AM (where "SN" is number of supernodes). The third step is a supernode-scoped broadcast. For non-PSHM builds, only a dissemination based reduce-to-all is performed (steps 1 and 3 are eliminated and "supernode" is replace by "node" in the description of step 2). For the non-collective exits, there is both a "best case" and a "worst case" to consider: Best case: one node is way ahead of the others and can win the master election and send remote exit requests before the others attempt the election. In this case the coordinated shutdown needs 1 round-trip for the election, followed by (N-1) round-trips for the remote exit request/reply, for a total of 2*N AMs sent (not counting those from the failed reduction). Worst case: all nodes attempt the election at roughly the same time and a full N round-trips take place for the election, followed by (N-1) round trips for the remote exit request/reply, for a total of 4*N-2 AMs sent (plus those from the failed reduction). The average case for non-collective exits is somewhere between those two. @ Section: References @ [1] Bell, Bonachea. "A New DMA Registration Strategy for Pinning-Based High Performance Networks", Workshop on Communication Architecture for Clusters (CAC'03), 2003. Also at http://upc.lbl.gov/publications/