StarPU Handbook - StarPU Extensions
Loading...
Searching...
No Matches
2. Advanced Tasks In StarPU

2.1 Task Dependencies

2.1.1 Sequential Consistency

By default, task dependencies are inferred from data dependency (sequential coherency) by StarPU. The application can however disable sequential coherency for some data, and dependencies can be specifically expressed.

Setting (or unsetting) sequential consistency can be done at the data level by calling starpu_data_set_sequential_consistency_flag() for a specific data (an example is in the file examples/dependency/task_end_dep.c) or starpu_data_set_default_sequential_consistency_flag() for all data (an example is in the file tests/main/subgraph_repeat.c).

The sequential consistency mode can also be gotten by calling starpu_data_get_sequential_consistency_flag() for a specific data or get the default sequential consistency flag by calling starpu_data_get_default_sequential_consistency_flag().

Setting (or unsetting) sequential consistency can also be done at task level by setting the field starpu_task::sequential_consistency to 0 (an example is in the file tests/main/deploop.c).

Sequential consistency can also be set (or unset) for each handle of a specific task, this is done by using the field starpu_task::handles_sequential_consistency. When set, its value should be an array with the number of elements being the number of handles for the task, each element of the array being the sequential consistency for the i-th handle of the task. The field can easily be set when calling starpu_task_insert() with the flag STARPU_HANDLES_SEQUENTIAL_CONSISTENCY

char *seq_consistency = malloc(cl.nbuffers * sizeof(char));
seq_consistency[0] = 1;
seq_consistency[1] = 1;
seq_consistency[2] = 0;
STARPU_RW, handleA, STARPU_RW, handleB, STARPU_RW, handleC,
0);
free(seq_consistency);
@ STARPU_RW
Definition starpu_data.h:60
#define STARPU_HANDLES_SEQUENTIAL_CONSISTENCY
Definition starpu_task_util.h:239
int starpu_task_insert(struct starpu_codelet *cl,...)

A full code example is available in the file examples/dependency/sequential_consistency.c.

The internal algorithm used by StarPU to set up implicit dependency is as follows:

if (sequential_consistency(task) == 1)
for(i=0 ; i<STARPU_TASK_GET_NBUFFERS(task) ; i++)
if (sequential_consistency(i-th data, task) == 1)
if (sequential_consistency(i-th data) == 1)
create_implicit_dependency(...)
#define STARPU_TASK_GET_NBUFFERS(task)
Definition starpu_task.h:1534

2.1.2 Tasks And Tags Dependencies

One can explicitly set dependencies between tasks using starpu_task_declare_deps() or starpu_task_declare_deps_array(). Dependencies between tasks can be expressed through tags associated to a tag with the field starpu_task::tag_id and using the function starpu_tag_declare_deps() or starpu_tag_declare_deps_array(). The example tests/main/tag_task_data_deps.c shows how to set dependencies between tasks with different functions.

The termination of a task can be delayed through the function starpu_task_end_dep_add() which specifies the number of calls to the function starpu_task_end_dep_release() needed to trigger the task termination. One can also use starpu_task_declare_end_deps() or starpu_task_declare_end_deps_array() to delay the termination of a task until the termination of other tasks. A simple example is available in the file tests/main/task_end_dep.c.

starpu_tag_notify_from_apps() can be used to explicitly unlock a specific tag, but if it is called several times on the same tag, notification will be done only on first call. However, one can call starpu_tag_restart() to clear the already notified status of a tag which is not associated with a task, and then calling starpu_tag_notify_from_apps() again will notify the successors. Alternatively, starpu_tag_notify_restart_from_apps() can be used to atomically call both starpu_tag_notify_from_apps() and starpu_tag_restart() on a specific tag.

To get the task associated to a specific tag, one can call starpu_tag_get_task(). Once the corresponding task has been executed and when there is no other tag that depend on this tag anymore, one can call starpu_tag_remove() to release the resources associated to the specific tag.

2.2 Waiting For Tasks

StarPU provides several advanced functions to wait for termination of tasks. One can wait for some explicit tasks, or for some tag attached to some tasks, or for some data results.

starpu_task_wait_array() is a function that waits for an array of tasks to complete their execution. starpu_task_wait_for_all_in_ctx() is a function that waits for all tasks in a specific context to complete their execution. starpu_task_wait_for_n_submitted_in_ctx() is a function that waits for a specified number of tasks to be submitted to a specific context. starpu_task_wait_for_no_ready() is a function that waits for all tasks to become unready, which means that they are either completed or blocked on a data dependency. In order to successfully call these functions to wait for termination of tasks, starpu_task::detach should be set to 0 before task submission.

The function starpu_task_nready() returns the number of tasks that are ready to execute, which means that all their data dependencies are satisfied and they are waiting to be scheduled, while the function starpu_task_nsubmitted() returns the number of tasks that have been submitted and not completed yet.

The function starpu_task_finished() can be used to determine whether a specific task has completed its execution.

starpu_tag_wait() and starpu_tag_wait_array() are two blocking functions that can be used to wait for tasks with specific tags to complete their execution. The former one waits for a specified task to complete while the latter one waits for a group of tasks to complete.

When using e.g. starup_task_insert(), it may be more convenient to wait for the result of a task rather than waiting for a given task explicitly. That can be done thanks to starpu_data_acquire() or starpu_data_acquire_cb() that wait for the result to be available in the home node of the data. That will thus wait for all the tasks that lead to that result. One can also use starpu_data_acquire_on_node() and give it STARPU_ACQUIRE_NO_NODE to tell to just wait for tasks to complete, but not wait for the data to be available in the home node. One can also use starpu_data_acquire_try() or starpu_data_acquire_on_node_try() to just test for the termination.

If a task is created by using starpu_task_create() or starpu_task_insert(), the field starpu_task::destroy is set to 1 by default, which means that the task structure will be automatically freed after termination. On the other hand, if the task is initialized by using starpu_task_init(), the field starpu_task::destroy is set to 0 by default, which means that the task structure will not be freed until starpu_task_destroy() is called explicitly. Otherwise, we can manually set starpu_task::destroy to 1 before submission or call starpu_task_set_destroy() after submission to activate the automatic freeing of the task structure.

2.3 Using Multiple Implementations Of A Codelet

One may want to write multiple implementations of a codelet for a single type of device and let StarPU choose which one to run. As an example, we will show how to use SSE to scale a vector. The codelet can be written as follows:

#include <xmmintrin.h>
void scal_sse_func(void *buffers[], void *cl_arg)
{
float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
unsigned int n_iterations = n/4;
if (n % 4 != 0)
n_iterations++;
__m128 *VECTOR = (__m128*) vector;
__m128 factor __attribute__((aligned(16)));
factor = _mm_set1_ps(*(float *) cl_arg);
unsigned int i;
for (i = 0; i < n_iterations; i++)
VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);
}
#define STARPU_VECTOR_GET_NX(interface)
Definition starpu_data_interfaces.h:2100
#define STARPU_VECTOR_GET_PTR(interface)
Definition starpu_data_interfaces.h:2084
struct starpu_codelet cl =
{
.cpu_funcs = { scal_cpu_func, scal_sse_func },
.cpu_funcs_name = { "scal_cpu_func", "scal_sse_func" },
.nbuffers = 1,
.modes = { STARPU_RW }
};
starpu_cpu_func_t cpu_funcs[STARPU_MAXIMPLEMENTATIONS]
Definition starpu_task.h:414
Definition starpu_task.h:338

The full code of this example is available in the file examples/basic_examples/vector_scal.c.

Schedulers which are multi-implementation aware (only dmda and pheft for now) will use the performance models of all the provided implementations, and pick the one which seems to be the fastest.

2.4 Enabling Implementation According To Capabilities

Some implementations may not run on some devices. For instance, some CUDA devices do not support double floating point precision, and thus the kernel execution would just fail; or the device may not have enough shared memory for the implementation being used. The field starpu_codelet::can_execute permits to express this. For instance:

static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
const struct cudaDeviceProp *props;
return 1;
/* Cuda device */
if (props->major >= 2 || props->minor >= 3)
/* At least compute capability 1.3, supports doubles */
return 1;
/* Old card, does not support doubles */
return 0;
}
struct starpu_codelet cl =
{
.cpu_funcs = { cpu_func },
.cpu_funcs_name = { "cpu_func" },
.cuda_funcs = { gpu_func }
.nbuffers = 1,
.modes = { STARPU_RW }
};
const struct cudaDeviceProp * starpu_cuda_get_device_properties(unsigned workerid)
int(* can_execute)(unsigned workerid, struct starpu_task *task, unsigned nimpl)
Definition starpu_task.h:360
starpu_cpu_func_t cpu_func
Definition starpu_task.h:383
Definition starpu_task.h:683
enum starpu_worker_archtype starpu_worker_get_type(int id)
@ STARPU_CPU_WORKER
Definition starpu_worker.h:67

A full example is available in the file examples/reductions/dot_product.c.

This can be essential e.g. when running on a machine which mixes various models of CUDA devices, to take benefit from the new models without crashing on old models.

Note: the function starpu_codelet::can_execute is called by the scheduler each time it tries to match a task with a worker, and should thus be very fast. The function starpu_cuda_get_device_properties() provides quick access to CUDA properties of CUDA devices to achieve such efficiency.

Another example is to compile CUDA code for various compute capabilities, resulting with two CUDA functions, e.g. scal_gpu_13 for compute capability 1.3, and scal_gpu_20 for compute capability 2.0. Both functions can be provided to StarPU by using starpu_codelet::cuda_funcs, and starpu_codelet::can_execute can then be used to rule out the scal_gpu_20 variant on a CUDA device which will not be able to execute it:

static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
const struct cudaDeviceProp *props;
return 1;
/* Cuda device */
if (nimpl == 0)
/* Trying to execute the 1.3 capability variant, we assume it is ok in all cases. */
return 1;
/* Trying to execute the 2.0 capability variant, check that the card can do it. */
if (props->major >= 2 || props->minor >= 0)
/* At least compute capability 2.0, can run it */
return 1;
/* Old card, does not support 2.0, will not be able to execute the 2.0 variant. */
return 0;
}
struct starpu_codelet cl =
{
.cpu_funcs = { cpu_func },
.cpu_funcs_name = { "cpu_func" },
.cuda_funcs = { scal_gpu_13, scal_gpu_20 },
.nbuffers = 1,
.modes = { STARPU_RW }
};

Another example is having specialized implementations for some given common sizes, for instance here we have a specialized implementation for 1024x1024 matrices:

static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
const struct cudaDeviceProp *props;
return 1;
/* Cuda device */
switch (nimpl)
{
case 0:
/* Trying to execute the generic capability variant. */
return 1;
case 1:
{
/* Trying to execute the size == 1024 specific variant. */
struct starpu_matrix_interface *interface = starpu_data_get_interface_on_node(task->handles[0]);
return STARPU_MATRIX_GET_NX(interface) == 1024 && STARPU_MATRIX_GET_NY(interface == 1024);
}
}
}
struct starpu_codelet cl =
{
.cpu_funcs = { cpu_func },
.cpu_funcs_name = { "cpu_func" },
.cuda_funcs = { potrf_gpu_generic, potrf_gpu_1024 },
.nbuffers = 1,
.modes = { STARPU_RW }
};
void * starpu_data_get_interface_on_node(starpu_data_handle_t handle, unsigned memory_node)
#define STARPU_MATRIX_GET_NY(interface)
Definition starpu_data_interfaces.h:1237
#define STARPU_MATRIX_GET_NX(interface)
Definition starpu_data_interfaces.h:1232
Definition starpu_data_interfaces.h:1085

Note that the most generic variant should be provided first, as some schedulers are not able to try the different variants.

2.5 Getting Task Children

It may be interesting to get the list of tasks which depend on a given task, notably when using implicit dependencies, since this list is computed by StarPU. starpu_task_get_task_succs() or starpu_task_get_task_scheduled_succs() provides it. For instance:

struct starpu_task *tasks[4];
ret = starpu_task_get_task_succs(task, sizeof(tasks)/sizeof(*tasks), tasks);
int starpu_task_get_task_succs(struct starpu_task *task, unsigned ndeps, struct starpu_task *task_array[])

And the full example of getting task children is available in the file tests/main/get_children_tasks.c

2.6 Parallel Tasks

StarPU can leverage existing parallel computation libraries by the means of parallel tasks. A parallel task is a task which is run by a set of CPUs (called a parallel or combined worker) at the same time, by using an existing parallel CPU implementation of the computation to be achieved. This can also be useful to improve the load balance between slow CPUs and fast GPUs: since CPUs work collectively on a single task, the completion time of tasks on CPUs become comparable to the completion time on GPUs, thus relieving from granularity discrepancy concerns. hwloc support needs to be enabled to get good performance, otherwise StarPU will not know how to better group cores.

Two modes of execution exist to accommodate with existing usages.

2.6.1 Fork-mode Parallel Tasks

In the Fork mode, StarPU will call the codelet function on one of the CPUs of the combined worker. The codelet function can use starpu_combined_worker_get_size() to get the number of threads it is allowed to start to achieve the computation. The CPU binding mask for the whole set of CPUs is already enforced, so that threads created by the function will inherit the mask, and thus execute where StarPU expected, the OS being in charge of choosing how to schedule threads on the corresponding CPUs. The application can also choose to bind threads by hand, using e.g. sched_getaffinity to know the CPU binding mask that StarPU chose.

For instance, using OpenMP (full source is available in examples/openmp/vector_scal.c):

void scal_cpu_func(void *buffers[], void *_args)
{
unsigned i;
float *factor = _args;
struct starpu_vector_interface *vector = buffers[0];
unsigned n = STARPU_VECTOR_GET_NX(vector);
float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
#pragma omp parallel for num_threads(starpu_combined_worker_get_size())
for (i = 0; i < n; i++)
val[i] *= *factor;
}
static struct starpu_codelet cl =
{
.modes = { STARPU_RW },
.where = STARPU_CPU,
.type = STARPU_FORKJOIN,
.max_parallelism = INT_MAX,
.cpu_funcs = {scal_cpu_func},
.cpu_funcs_name = {"scal_cpu_func"},
.nbuffers = 1,
};
enum starpu_data_access_mode modes[STARPU_NMAXBUFS]
Definition starpu_task.h:542
#define STARPU_CPU
Definition starpu_task.h:58
@ STARPU_FORKJOIN
Definition starpu_task.h:161
Definition starpu_data_interfaces.h:1981

Other examples include for instance calling a BLAS parallel CPU implementation (see examples/mult/xgemm.c).

2.6.2 SPMD-mode Parallel Tasks

In the SPMD mode, StarPU will call the codelet function on each CPU of the combined worker. The codelet function can use starpu_combined_worker_get_size() to get the total number of CPUs involved in the combined worker, and thus the number of calls that are made in parallel to the function, and starpu_combined_worker_get_rank() to get the rank of the current CPU within the combined worker. For instance:

static void func(void *buffers[], void *args)
{
unsigned i;
float *factor = _args;
struct starpu_vector_interface *vector = buffers[0];
unsigned n = STARPU_VECTOR_GET_NX(vector);
float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
/* Compute slice to compute */
unsigned slice = (n+m-1)/m;
for (i = j * slice; i < (j+1) * slice && i < n; i++)
val[i] *= *factor;
}
static struct starpu_codelet cl =
{
.modes = { STARPU_RW },
.type = STARPU_SPMD,
.max_parallelism = INT_MAX,
.cpu_funcs = { func },
.cpu_funcs_name = { "func" },
.nbuffers = 1,
}
@ STARPU_SPMD
Definition starpu_task.h:155
int starpu_combined_worker_get_rank(void)
int starpu_combined_worker_get_size(void)

A full example is available in examples/spmd/vector_scal_spmd.c.

Of course, this trivial example will not really benefit from parallel task execution, and was only meant to be simple to understand. The benefit comes when the computation to be done is so that threads have to e.g. exchange intermediate results, or write to the data in a complex but safe way in the same buffer.

2.6.3 Parallel Tasks Performance

To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to be used. When exposed to codelets with a flag STARPU_FORKJOIN or STARPU_SPMD, the schedulers pheft (parallel-heft) and peager (parallel eager) will indeed also try to execute tasks with several CPUs. It will automatically try the various available combined worker sizes (making several measurements for each worker size) and thus be able to avoid choosing a large combined worker if the codelet does not actually scale so much. Examples using parallel-task-aware StarPU scheduler are available in tests/parallel_tasks/parallel_kernels.c and tests/parallel_tasks/parallel_kernels_spmd.c.

This is however for now only proof of concept, and has not really been optimized yet.

2.6.4 Combined Workers

By default, StarPU creates combined workers according to the architecture structure as detected by hwloc. It means that for each object of the hwloc topology (NUMA node, socket, cache, ...) a combined worker will be created. If some nodes of the hierarchy have a big arity (e.g. many cores in a socket without a hierarchy of shared caches), StarPU will create combined workers of intermediate sizes. The variable STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER permits to tune the maximum arity between levels of combined workers.

The combined workers actually produced can be seen in the output of the tool starpu_machine_display (the environment variable STARPU_SCHED has to be set to a combined worker-aware scheduler such as pheft or peager).

2.6.5 Concurrent Parallel Tasks

Unfortunately, many environments and libraries do not support concurrent calls.

For instance, most OpenMP implementations (including the main ones) do not support concurrent pragma omp parallel statements without nesting them in another pragma omp parallel statement, but StarPU does not yet support creating its CPU workers by using such pragma.

Other parallel libraries are also not safe when being invoked concurrently from different threads, due to the use of global variables in their sequential sections, for instance.

The solution is then to use only one combined worker at a time. This can be done by setting the field starpu_conf::single_combined_worker to 1, or setting the environment variable STARPU_SINGLE_COMBINED_WORKER to 1. StarPU will then run only one parallel task at a time (but other CPU and GPU tasks are not affected and can be run concurrently). The parallel task scheduler will however still try varying combined worker sizes to look for the most efficient ones. A full example is available in examples/spmd/vector_scal_spmd.c.

2.7 Synchronization Tasks

For the application convenience, it may be useful to define tasks which do not actually make any computation, but wear for instance dependencies between other tasks or tags, or to be submitted in callbacks, etc.

The obvious way is of course to make kernel functions empty, but such task will thus have to wait for a worker to become ready, transfer data, etc.

A much lighter way to define a synchronization task is to set its field starpu_task::cl to NULL. The task will thus be a mere synchronization point, without any data access or execution content: as soon as its dependencies become available, it will terminate, call the callbacks, and release dependencies.

An intermediate solution is to define a codelet with its field starpu_codelet::where set to STARPU_NOWHERE, for instance:

struct starpu_codelet cl =
{
.nbuffers = 1,
.modes = { STARPU_R },
}
task->cl = &cl;
task->handles[0] = handle;
uint32_t where
Definition starpu_task.h:353
struct starpu_codelet * cl
Definition starpu_task.h:712
starpu_data_handle_t handles[STARPU_NMAXBUFS]
Definition starpu_task.h:793
struct starpu_task * starpu_task_create(void) STARPU_ATTRIBUTE_MALLOC
#define STARPU_NOWHERE
Definition starpu_task.h:46
int starpu_task_submit(struct starpu_task *task)
@ STARPU_R
Definition starpu_data.h:58

will create a task which simply waits for the value of handle to be available for read. This task can then be depended on, etc. A full example is available in examples/filters/fmultiple_manual.c.

StarPU provides starpu_task_create_sync() to create a new synchronization task, the same as the previous example but without submitting the task. The function starpu_create_sync_task() is also used to create a new synchronization task and submit it, which is a task that waits for specific tags and calls the specified callback function when the task is finished. The function starpu_create_callback_task() can create and submit a synchronization task, which is a task that completes immediately and calls the specified callback function right after.