Legion Overview
The notes below are a summarization of Legion’s manual. Legion is an asynchronous, task-based distributed execution engine based on the following concepts:
Tasks
Tasks are the basic units of the Legion parallel computation, their execution is non-preemptible and they can have multiple variants for each kind of processor (CPU vs GPU) that can be used, as well as for each memory layout of the arguments. Every task should be assigned a task ID and registerd with the runtime before the runtime starts. A particular task is designated as the top-level task, and it will be invoked by the start method. Tasks can call any C++ function, including those allocating/deallocating memory, but they cannot use packages other than Legion to implement parallelism or concurrency.
Task Registration: To register a task with the runtime, we need to provide the name of the task, the task ID, the kind of processor that can run the task (or other constraints), and pass two booleans indicting whether the task is launchable as a single task and/or as part of an index launch
Task arguments and signature: A task’s signature contains an object representing the task, a verctor of physical region instances (used to pass data between tasks), a context object, with the task’s metadata, and a pointer to the runtime. Arguments should not contain C++ pointers, and tasks should not refer global variables
Subtasks
Subtasks are tasks called by other tasks. To execute a subtask, the
parent creates a TaskLauncher object, which takes two arguments: the
ID of the task to be launched and TaskArgument object holding any
arguments (and their size), which will be passed by value. TaskArgument
objects should only be used for passing small amounts of data and cannot
contain pointers nor futures.
Executing a task: to execute a task, the parent calls
runtime->execute_task, with the TaskLauncher as an argument.
This method is not blocking, meaning that all tasks are launched
asynchronously by their parent. The parent task will, however, wait
until all children return before terminating.
Futures
Futures are placeholders for the results returned by tasks, since tasks are executed asynchronously. If we read the value of a future, the future will be immediately evaluated, with the parent blocking until the value is returned.
Points, Rectangles, Domains
Legion defines the points, rectangles and domains types for launching sets of tasks, as well as other related applications. Various operations are defined on these types.
Point: a point is a tuple of \(n\) integers
Rectangle: a rectangle is a set of points. It is defined using two points A and B, and includes all points that are greater or equal to A and less or equal to B.
Domains: a domain is equivalent to a rectangle, but with less type checking.
Index Launches
An index launch is a mechanism to launch multiple tasks at once, with an
improved efficiency. In general, index launches are only meant to
execute multiple instances of the same task in parallel. Index launches
make use of a IndexLauncher (counterpart of the TaskLauncher)
and are executed with runtime->execute_index_space (instead of
runtime->execute_task). To pass arguments to tasks launched as part
of an index launch, Legion uses an ArgumentMap, mapping each point
in the task index space to an argument for the corresponding task.
Similarly, the exeutor does not return a single future but a
FutureMap, mapping each point in the index space to the future
returned by the corresponding task.
Regions
Legion uses regions to handle large quantities of data (smaller amounts can be handled using arguments and futures).
Logical region: a logical region is a table whose rows are defined by an index space an whose columns are defined by a field space. The number of indexes in an index space is defined by the user, whereas the total number of fields in a field space is bounded. Logical regions do not actually hold data.
Physical instance: each physical instance of a logical region holds a copy of the data to be stored by that region. Multiple copies can be placed on multiple machines for increased performance.
Creating a physical instance
The most common way to instantiate a logical region is to pass it as a
requirement to a substask. Legion will then automatically create a
physical instance for the logical region when launching the subtask. We
can add a region requirement to a subtask through the
add_region_requirement method of the TaskLauncher. A
RegionRequirement consists of, in addition to the logical region to
be instantiated, a privilege (WRITE_DISCARD, READ_ONLY,
READ_WRITE, WRITE, REDUCE), a coherence mode
(ATOMIC, SIMULTANEOUS, RELAXED) and the parent region (or
the region itself if the region has no parent).
Partitioning
Regions can be partitioned in multiple ways, and to arbitrary levels of recursion. The partitioning hierarchy defines a region tree. Partitioning is done dynamically at runtime; applications can create and destroy partitions. Partitioning can be costly, so it has to be done judiciously. Partitions only name subsets of data, they don’t allocate any storage themselves.
Aliased/disjoint: a partition is called aliased if its subregions overlap. A partition whose subregions do not share elements is called disjoint
Complete: a partition where every element of the partitioned region is at least in one subregion is called complete
Ghost regions/stencils: overlapping boundaries of blocks?
Projection functor: we can pass subregions to tasks as part of an index launch through the use of two arguments. The first argument is the logical partition to be used, and the second one is the identifier of a projection functor, which determines how subregions are assigned to tasks. Users can provide their own projection functor.
Equal partitions
Equal partitions are the simplest type; a region is simply split into subregions of approximately the same size. An index space is used to determine the number of subregions. The points in the index space corresponding to the subregions are called colors. The partitioning is applied to the index space, rather than the logical region. In this way, logical regions sharing the same index space can reuse the same partitioning.
Equal partitions are both complete and disjoint
By default, equal partitioning of a multi-dimensional region only partitions the first dimension.
Partition by Field
Partitioning by field consists of specifying for each element of a region which subregion it should be in.
Partition by Restriction
A blocked partition is one where a region is divided into blocks of the same size. In applications with stencil, it is useful to include ghost cells asjacent to the block.
A partition by restriction takes a transform and an extent as inputs. The extent \(E\) is a rectangle with the desired size of each subregion, whereas the transform \(T\) is a \(n \times m\) matrix, where \(n\) is the number of dimensions of the color space, and \(m\) is the number of dimensions of the region. Each point \(p\) in the color space corresponds to the points defined by the rectangle \(Tp + E\) in the corresponding subregion.
Set-based Partitions
Set-based partitions allow us to defined partitions that are dependent on other ones. For example, we can define a difference partition, obtained by the set difference (by color) between two index space partitions. Similary, we can define a union partition or intersection partition.
Image and Pre-Image Partitions
Image partitions allow us to create new partitions that are compatible with already computed partitions from another region, e.g. if we want to create a partition for the nodes of a graph and a partition for the graph’s edges. The image partition and pre-image partition allow us to do that by defining a pointer relationship betwene two regions, which in turn induces a partition of one region given the already existing partitioning in another one.
Control Replication
In Legion, root tasks are responsible for launching subtasks to utilize each parallel machine, but this design choice can cause bottlenecks whenever a root task needs to launch ~16-32 tasks or more. To solve the problem, Legion uses control replication, which consists in executing multiple copies of the parent task in parallel, and each copy launches a subset of the child tasks.
The most common case is to control-replicate only the top-level task. To use control replication, the programmer only needs to mark the task as replicable. It is the mapper’s role to decide whether to replicate it and how to shard the work of analyzing the subtasks across the instances of the replicated tasks.
Coherence
The coherence settings control how sibling tasks can do concurrently with a region. There are four coherence modes: exclusive where each task has exclusive access to a region for the entirety of the task’s execution time, atomic, where sibling tasks must execute atomically with respect to a region, simultaneous, which implements a shared memory abstraction where race conditions are allowed, relaxed, where no restriction is in place, and the application can organize the data however it wants.
Simultaneous coherence
In the simultaneous coherence scenario, a copy restriction is added to the underlying region, so that there can only be one physical instance, and all tasks need to use it. Because of this restriction, the region cannot be copied to a GPUs. Two important concepts, when it comes to coherence, are are phase barriers and the acquire/release directives.
Phase barriers have an arrival count, and when an operation arrives at the barrier, the counter is incremented. The operation arriving at the barrier does not stop its execution. Operations can wait at a phase barrier, in the sense that their execution does not start until the phase barrier is triggered. Phase barriers have up to \(2^{32}\) generations; each operation arrives or waits at a specific generation. When a phase barrier triggers, it is advanced to the new generation, and the waiters at the next generation can start their execution.
The acquire directive removes the copy restriction on the underlying region, which means the region can be copied anywhere, including on GPU accelerators. It should be used by tasks that are certain to be the only ones using a region. On the other hand, the release directive flushes all updates, and reintroduces the copy restriction.
In the most common scenario, phase barriers and the acquire/release directives are used together: phase barriers are used to enforce the proper synchronization of acquire/release.
Mapping
Mappers expose to the application those decisions that affect its performance. Legion provides a default mapper, and a few other pre-designed ones, but the user is encourages to write their own application-specific mapper for better performance
A mapper is an abstract C++ class defining a set of pure virtual functions invoked by the Legion runtime as callback. The invokations can happen in a unpredictable order, and the degree of concurrency between mapper callbacks is controlld by the synchronization mode that is selected. Mappers can store arbitrary state.
Mapper registration consists of registering the mapper object with
the runtime. This must be done before the application begins, but after
the runtime is created. We can register a CustomMapper using
Runtime::add_registration_callback, which takes a function pointer
as an arguemnt. The function itself should take as arguments a
Machine object, a Runtime pointer, and a reference to a STL set
of Processor objects. We can register multiple callbacks by calling
Runtime::add_registration_callback multiple times. The runtime will
invoke the callback functions in the order they were added, once for
each instance of the Legion runtime. In multiprocess jobs, there will be
one copy of the Legion runtime per process, and thus one invocation of
each callback per process.
Synchronization models allow the user to control how a runtime instance’s multiple threads should interact and whether they can invoke mapper calls for the same mapper concurrently. The synchronization model can only be set once and cannot be changed. There are three available synchronization models:
serialized non rereentrant, where calls to the mapper objext are serialized and executed atomically. If mapper calls to the runtime are preempted, no other mapper call can be invoked by the runtime
serialized reentrant, where at most one mapper call executes at a time, but if a mapper call invokes a runtime method that preempts it, the runtime can execute another mapper call or resume the blocked call
concurrent, where mapper calls to the same mapper object can proceed concurrently. Users can use
lock_mapperandunlock_mapperto perform synchronization
Machine interface: it allows mappers to access information about the
machine on which the application is executing. This is done through a
Machine object, defined by Realm, Legion’s low-level portability
layer (file realm/machine.h). The older methods for accessing the
machine object are get_all_processors and get_all_memories,
which populate STL data structures with the names of processors and
memories on the current machine. However, these methods are not
scalable. The more modern and efficient interface uses queries:
ProcessorQuery and MemoryQuery. The mapper applies filters
to reduce the set of processors and memories of interest, thus making
the search more efficient. For example, local_address_space
specializes the query to the local node, only_kind specializes to a
specific kind of processor, has_affinity requires the search to
focus on processors/memories that have a certain affinity to other
processors/memories, etc. Queries can still be expensive, so it’s
important to memoize the results whenever possible.
Mapping tasks
Mapping tasks is the most important role of the mapper. When a task is launched, it goes through a pipeline of mapping callbacks. The most important ones are:
select_task_options: is called exactly once and sets the fields of theTaskOptions& outputobject. The callback comes first because the fields of the output object determine the rest of the mapping process for the task. More specifically:output.initial_procdetermines the processor that will execute the task. The default is the current processor. A remote processor can be used as well. The processor can also change in subsequent stagesoutput.inline_taskdetermines whether to inline the task in the parent, and use the parent task’s regions.output.stealablecontrols whether the task can be stolen by other mappers for load balancing. If set totrue, other mappers can steal until the task is picked byselect_tasks_to_mapoutput.map_locally(by defaultfalse) controls whether the task mapping (map_task) stage of the pipeline should be run by the current mappervalid_instancescontrols whether the task should receive a list of the currently calid instances of regions when it callsrequest_valid_instances. Setting this option tofalsesaves overhead if the taks will never use a currently valid region instancereplicate_defaultcontrols whether to run a task separately on every Legion process participating in the replication of the parent task, when we are in a control-replication context. This might be advantageous if the cost of broadcasting the results is larger than the cost of simply recomputing them on each replica.output.parent_priority: changes the parent priority, where allowed
select_sharding_functor: selects the functors to use for sharding index task launches in control-replicated contexts. The callback is run once per replicated task index launch in each replicated context.slice_taskis used in index launches to slice the index space of tasks into smaller sets of tasks, so that each set can be sent to a destination mapper as a single object rather than sending multiple individual tasks individuallyselect_tasks_to_mapprocesses tasks from theready_taskslist and selects those that should be mapped and those that should be sent to other processor. At this point in the pipeline, all index tasks have been expanded into single tasks, andselect_tasks_to_mapis called by the mapper at the destination process, unlessmap_locallyis set totrue. For each task in the list,select_tasks_to_mapcan either move the task to themap_tasksset, or add the task to therelocate_tasksmap if the task needs to be transferred. Alternatively,select_tasks_to_mapcan also leave the task in theready_taskslist. Ifselect_tasks_to_mapdoes not move any task, it will need to provide aMapperEventin thedeferral_eventfield, so that the new call toselect_tasks_to_mapcan be deferrred until the event is triggeredmap_taskmaps the tasks’s region arguments and selects the variant to use, after which the task will actually run on one the selected processes. It takes ans input a vector of vectors of valid instances, each containing alida data for the corresponding region requirement. Thepre_mappedregion holds the indices of region requirements that are already satisfied, and don’t need to be remapped. The callback will need to fill in:targer_procs, the processes that can execute the task. The runtime will execute the task on the first one to become availablechosen_variant: the variant of the task to usechosen_instances: the instances to use fro thr task. The values returned should be chosen from the ones invalid_instances, or the mapper can create new physical instances by callingcreate_physical_instanceand fill the new instances with data from existing instances usingselect_task_sources.output_targets: if the region is strictly an output regionfutures_locations: the memory for holding the futures produced by the taskuntracked_valid_regions: a set of indices of read-only regions that should be marked for garbage collection after the task is readypostmap_task
Virtual mappings
Virtual mappings are used for logical region arguments that are not used by a task, but only passed as an argument to a subtask
Profiling
Most profiling is implemented by the Realm runtime, and can be
turned on by passing the fla -lg:prof to the Legion runtime. More
fine-grained profiling can be enabled throguh the type
ProfileRequest and its public method add_measurement()