StarPU Handbook
Loading...
Searching...
No Matches
27. Advanced Data Management

27.1 Data Interface with Variable Size

Besides the data interfaces already available in StarPU, mentioned in Data Interface, tasks are actually allowed to change the size of data interfaces.

The simplest case is just changing the amount of data actually used within the allocated buffer. This is for instance implemented for the matrix interface: one can set the new NX/NY values with STARPU_MATRIX_SET_NX(), STARPU_MATRIX_SET_NY(), and STARPU_MATRIX_SET_LD() at the end of the task implementation. Data transfers achieved by StarPU will then use these values instead of the whole allocated size. The values of course need to be set within the original allocation. To reserve room for increasing the NX/NY values, one can use starpu_matrix_data_register_allocsize() instead of starpu_matrix_data_register(), to specify the allocation size to be used instead of the default NX*NY*ELEMSIZE. It is also available for a vector by using starpu_vector_data_register_allocsize() to specify the allocation size to be used instead of the default NX*ELEMSIZE. To support this, the data interface has to implement the functions starpu_data_interface_ops::alloc_footprint, starpu_data_interface_ops::alloc_compare, and starpu_data_interface_ops::reuse_data_on_node for proper StarPU allocation management. It might be useful to implement starpu_data_interface_ops::cache_data_on_node, otherwise StarPU will just call memcpy().

A more involved case is changing the amount of allocated data. The task implementation can just reallocate the buffer during its execution, and set the proper new values in the interface structure, e.g. nx, ny, ld, etc. so that the StarPU core knows the new data layout. The structure starpu_data_interface_ops however then needs to have the field starpu_data_interface_ops::dontcache set to 1, to prevent StarPU from trying to perform any cached allocation, since the allocated size will vary. An example is available in tests/datawizard/variable_size.c. The example uses its own data interface to contain some simulation information for data growth, but the principle can be applied for any data interface.

The principle is to use starpu_malloc_on_node_flags() to make the new allocation, and use starpu_free_on_node_flags() to release any previous allocation. The flags have to be precisely like in the example:

unsigned workerid = starpu_worker_get_id_check();
unsigned dst_node = starpu_worker_get_memory_node(workerid);
interface->size += increase;
void starpu_free_on_node_flags(unsigned dst_node, uintptr_t addr, size_t size, int flags)
uintptr_t starpu_malloc_on_node_flags(unsigned dst_node, size_t size, int flags)
#define STARPU_MALLOC_PINNED
Definition starpu_stdlib.h:35
#define STARPU_MALLOC_COUNT
Definition starpu_stdlib.h:48
#define STARPU_MEMORY_OVERFLOW
Definition starpu_stdlib.h:78
#define starpu_worker_get_id_check()
Definition starpu_worker.h:257
unsigned starpu_worker_get_memory_node(unsigned workerid)

so that the allocated area has the expected properties and the allocation is properly accounted for.

Depending on the interface (vector, CSR, etc.) you may have to fix several fields of the data interface: e.g. both nx and allocsize for vectors, and store the pointer both in ptr and dev_handle.

Some interfaces make a distinction between the actual number of elements stored in the data and the actually allocated buffer. For instance, the vector interface uses the nx field for the former, and the allocsize for the latter. This allows for lazy reallocation to avoid reallocating the buffer every time to exactly match the actual number of elements. Computations and data transfers will use the field nx, while allocation functions will use the field allocsize. One just has to make sure that allocsize is always bigger or equal to nx.

Important note: one can not change the size of a partitioned data.

27.2 Data Management Allocation

When the application allocates data, whenever possible it should use the function starpu_malloc(), which will ask CUDA or OpenCL to make the allocation itself and pin the corresponding allocated memory (a basic example is in examples/basic_examples/block.c), or to use the function starpu_memory_pin() to pin memory allocated by other ways, such as local arrays (a basic example is in examples/basic_examples/vector_scal.c). This is needed to permit asynchronous data transfer, i.e. permit data transfer to overlap with computations. Otherwise, the trace will show that the state DriverCopyAsync takes a lot of time, this is because CUDA or OpenCL then reverts to synchronous transfers. Before shutting down StarPU, the application should deallocate any memory that has previously been allocated with starpu_malloc(), by calling either starpu_free() or starpu_free_noflag() which is more recommended. If the application has pinned memory using starpu_memory_pin(), it should unpin the memory using starpu_memory_unpin() before freeing the memory.

If an application requires a specific alignment constraint for memory allocations made with starpu_malloc(), it can use the starpu_malloc_set_align() function to set the alignment requirement.

The application can provide its own allocation function by calling starpu_malloc_set_hooks(). StarPU will then use them for all data handle allocations in the main memory. An example is in examples/basic_examples/hooks.c.

StarPU provides several functions to monitor the memory usage and availability on the system. The application can use the starpu_memory_get_used() function to monitor its own memory usage on a node, and the starpu_memory_get_total_all_nodes() function to monitor the amount of total memory on all memory nodes, and the starpu_memory_get_available_all_nodes() function to monitor the amount of available memory on all memory nodes. Additionally, the starpu_memory_get_used_all_nodes() function can be used to monitor the amount of used memory on all memory nodes.

By default, StarPU leaves replicates of data wherever they were used, in case they will be re-used by other tasks, thus saving the data transfer time. When some task modifies some data, all the other replicates are invalidated, and only the processing unit which ran this task will have a valid replicate of the data. If the application knows that this data will not be re-used by further tasks, it should advise StarPU to immediately replicate it to a desired list of memory nodes (given through a bitmask). This can be understood like the write-through mode of CPU caches.

starpu_data_set_wt_mask(img_handle, 1<<0);
void starpu_data_set_wt_mask(starpu_data_handle_t handle, uint32_t wt_mask)

will for instance request to always automatically transfer a replicate into the main memory (node 0), as bit 0 of the write-through bitmask is being set. An example is available in examples/pi/pi.c.

starpu_data_set_wt_mask(img_handle, ~0U);

will request to always automatically broadcast the updated data to all memory nodes. An example is available in tests/datawizard/wt_broadcast.c.

Setting the write-through mask to ~0U can also be useful to make sure all memory nodes always have a copy of the data, so that it is never evicted when memory gets scarce.

Implicit data dependency computation can become expensive if a lot of tasks access the same piece of data. If no dependency is required on some piece of data (e.g. because it is only accessed in read-only mode, or because write accesses are actually commutative), use the function starpu_data_set_sequential_consistency_flag() to disable implicit dependencies on this data.

In the same vein, accumulation of results in the same data can become a bottleneck. The use of the mode STARPU_REDUX permits to optimize such accumulation (see Data Reduction). To a lesser extent, the use of the flag STARPU_COMMUTE keeps the bottleneck (see Commute Data Access), but at least permits the accumulation to happen in any order.

Applications often need a data just for temporary results. In such a case, registration can be made without an initial value, for instance this produces a vector data:

starpu_vector_data_register(&handle, -1, 0, n, sizeof(float));
void starpu_vector_data_register(starpu_data_handle_t *handle, int home_node, uintptr_t ptr, uint32_t nx, size_t elemsize)

StarPU will then allocate the actual buffer only when it is actually needed, e.g. directly on the GPU without allocating in main memory.

In the same vein, once the temporary results are not useful anymore, the data should be thrown away. If the handle is not to be reused, it can be unregistered:

void starpu_data_unregister_submit(starpu_data_handle_t handle)

actual unregistration will be done after all tasks working on the handle terminate.

One can also unregister the data handle by calling:

void starpu_data_unregister_no_coherency(starpu_data_handle_t handle)

Different from starpu_data_unregister(), a valid copy of the data is not put back into the home node in the buffer that was initially registered.

If the handle is to be reused, instead of unregistering it, it can simply be invalidated:

void starpu_data_invalidate(starpu_data_handle_t handle)

or if the data transfer is asynchronous:

void starpu_data_invalidate_submit(starpu_data_handle_t handle)

the buffers containing the current value will then be freed, and reallocated only when another task writes some value to the handle. A basic example is available in the file tests/datawizard/data_invalidation.c.

27.3 Data Access

To access registered data outside tasks we can call the function starpu_data_acquire(). The access mode can be read-only mode STARPU_R, write-only mode STARPU_W, and read-write mode STARPU_RW. We will get an up-to-date copy of handle in memory located where the data was originally registered. The application can also call starpu_data_acquire_try() instead of starpu_data_acquire() to acquire the data, but if previously-submitted tasks have not completed when we ask to acquire the data, the program will crash. starpu_data_release() must be called once the application no longer needs to access the piece of data. Or call starpu_data_release_to() to partly release the piece of data acquired. We can also access registered data from a given memory node by calling the function starpu_data_acquire_on_node(), or calling starpu_data_acquire_on_node_try() if all previously-submitted tasks have completed. Correspondingly, starpu_data_release_on_node() must be called once the application no longer needs to access the piece of data and the node parameter must be exactly the same as the corresponding starpu_data_acquire_on_node() call. Or call starpu_data_release_to_on_node() to partly release the piece of data acquired.

The application may access the requested data asynchronous during the execution of callback by calling starpu_data_acquire_cb(), and by calling starpu_data_acquire_cb_sequential_consistency() with the possibility of enabling or disabling data dependencies. The callback function must call starpu_data_release() once the application no longer needs to access the piece of data. Or call starpu_data_release_to() to partly release the piece of data acquired. The application can also access registered data from a given memory node instead of main memory by calling the function starpu_data_acquire_on_node_cb(), and by calling starpu_data_acquire_on_node_cb_sequential_consistency() with the possibility of enabling or disabling data dependencies. starpu_data_release_on_node() must be called once the application no longer needs to access the piece of data. Or call starpu_data_release_to_on_node() to partly release the piece of data acquired.

27.4 Data Prefetch

The scheduling policies heft, dmda and pheft perform data prefetch (see STARPU_PREFETCH): as soon as a scheduling decision is taken for a task, requests are issued to transfer its required data to the target processing unit, if needed, so that when the processing unit actually starts the task, its data will hopefully be already available, and it will not have to wait for the transfer to finish.

The application may want to perform some manual prefetching, for several reasons such as excluding initial data transfers from performance measurements, or setting up an initial statically-computed data distribution on the machine before submitting tasks, which will thus guide StarPU toward an initial task distribution (since StarPU will try to avoid further transfers).

This can be achieved by giving the function starpu_data_prefetch_on_node() the handle and the desired target memory node. An example is available in the file tests/microbenchs/prefetch_data_on_node.c. The variant starpu_data_idle_prefetch_on_node() can be used to issue the transfer only when the bus is idle. One can also call starpu_data_request_allocation() for the allocation of a piece of data on the specified memory node. We can know whether the allocation is done on the specified memory node by using starpu_data_test_if_allocated_on_node(). We can also know whether the map is done on the specified memory node by using starpu_data_test_if_mapped_on_node().

If we want higher priority to request data to be replicated to a given node as soon as possible, so that it is available there for tasks, we can call starpu_data_fetch_on_node(). We can call starpu_data_prefetch_on_node_prio() to have a priority than starpu_data_prefetch_on_node(). And call starpu_data_idle_prefetch_on_node_prio() to have a bit higher priority than starpu_data_idle_prefetch_on_node().

Conversely, one can advise StarPU that some data will not be useful in the close future by calling starpu_data_wont_use(). StarPU will then write its value back to its home node, and evict it from GPUs when room is needed. An example is available in the file tests/datawizard/partition_wontuse.c. One can also advise StarPU to evict data from the memory node directly by calling starpu_data_evict_from_node(), but it may fail if e.g. some tasks are still working on the memory node. To avoid failure one can call starpu_data_can_evict() to check whether data can be evicted from the memory node. Anyway it is more recommended to use starpu_data_wont_use().

One can query the status of handle on the specified memory node by calling starpu_data_query_status2() or starpu_data_query_status(). One can call starpu_memchunk_tidy() to tidy the available memory on the specified memory node periodically.

27.5 Manual Partitioning

Except the partitioning functions described in Partitioning Data and Asynchronous Partitioning, one can also handle partitioning by hand, by registering several views on the same piece of data. The idea is then to manage the coherency of the various views through the common buffer in the main memory. examples/filters/fmultiple_manual.c is a complete example using this technique.

In short, we first register the same matrix several times:

starpu_matrix_data_register(&handle, STARPU_MAIN_RAM, (uintptr_t)matrix, NX, NX, NY, sizeof(matrix[0]));
for (i = 0; i < PARTS; i++)
starpu_matrix_data_register(&vert_handle[i], STARPU_MAIN_RAM, (uintptr_t)&matrix[0][i*(NX/PARTS)], NX, NX/PARTS, NY, sizeof(matrix[0][0]));
#define STARPU_MAIN_RAM
Definition starpu_task.h:144
void starpu_matrix_data_register(starpu_data_handle_t *handle, int home_node, uintptr_t ptr, uint32_t ld, uint32_t nx, uint32_t ny, size_t elemsize)

Since StarPU is not aware that the two handles are actually pointing to the same data, we have a danger of inadvertently submitting tasks to both views, which will bring a mess since StarPU will not guarantee any coherency between the two views. To make sure we don't do this, we invalidate the view that we will not use:

for (i = 0; i < PARTS; i++)
starpu_data_invalidate(vert_handle[i]);

Then we can safely work on handle.

When we want to switch to the vertical slice view, all we need to do is bring coherency between them by running an empty task on the home node of the data:

struct starpu_codelet cl_switch =
{
.nbuffers = 3,
.specific_nodes = 1,
};
ret = starpu_task_insert(&cl_switch, STARPU_RW, handle,
STARPU_W, vert_handle[0],
STARPU_W, vert_handle[1],
0);
uint32_t where
Definition starpu_task.h:353
#define STARPU_NOWHERE
Definition starpu_task.h:46
Definition starpu_task.h:338
@ STARPU_W
Definition starpu_data.h:59
@ STARPU_RW
Definition starpu_data.h:60
int starpu_task_insert(struct starpu_codelet *cl,...)

The execution of the task switch will get back the matrix data into the main memory, and thus the vertical slices will get the updated value there.

Again, we prefer to make sure that we don't accidentally access the matrix through the whole-matrix handle:

Note: when enabling a set of handles in this way, the set must not have any overlapping, i.e. the handles of the set must not have any part of data in common, otherwise StarPU will not properly handle concurrent accesses between them.

And now we can start using vertical slices, etc.

27.6 Data handles helpers

Functions starpu_data_set_user_data() and starpu_data_get_user_data() are used to associate user-defined data with a specific data handle. One can set or retrieve the field user_data of the data handle by calling these two functions respectively. Similarly, functions starpu_data_set_sched_data() and starpu_data_get_sched_data() are used to associate scheduling-related data with a specific data handle. One can set or retrieve the field sched_data of the data handle by calling these two functions respectively. One can set a name for a data handle by calling starpu_data_set_name().

One can call starpu_data_register_same() to register a new piece of data into a data handle with the same interface as the specified data handle. If necessary, one can register a void interface by using starpu_void_data_register(). There is no data really associated to this interface, but it may be used as a synchronization mechanism.

One can call starpu_data_cpy() or starpu_data_cpy_priority() to copy data from one memory location to another memory location, but the latter one allows the application to specify a priority value for the copy operation. The higher the priority value, the sonner the copy operation will be scheduled and executed. One can also call starpu_data_dup_ro() function for duplicating, but this function only creates a new read-only data block that is an exact copy of the original data block. The new data block can be used independently of the original data block for read-only access.

starpu_data_pack_node() and starpu_data_pack() are functions that are used to pack a data item into a binary buffer on a node or on local memory node. starpu_data_peek_node() and starpu_data_peek() are functions that allow you to read in handle's node or local node replicate the data located at the given pointer. starpu_data_unpack_node() and starpu_data_unpack() are functions that are used to unpack a data item from a binary buffer on a node or on local memory node.

StarPU provides several functions for querying the size and memory allocation of variable size data items, such as: starpu_data_get_size() is a function that returns the size of a data associated with handle in bytes. This is the size of the actual data stored in memory. starpu_data_get_alloc_size() is a function that returns the amount of memory that has been allocated for a data associated with handle in anticipation. This may be larger than the actual size of the data item, due to alignment requirements or other implementation details. starpu_data_get_max_size() is a function that returns the maximum size of a handle data that can be allocated by StarPU.

One can call starpu_data_get_home_node() to retrieve the identifier of the node on which the data handle is originally stored. One can call starpu_data_print() to print basic information about the data handle and the node to the specified file.

27.7 Handles data buffer pointers

A simple understanding of StarPU handles is that it's a collection of buffers on each memory node of the machine, which contain the same data. The picture is however made more complex with the OpenCL support and with partitioning.

When partitioning a handle, the data buffers of the subhandles will indeed be inside the data buffers of the main handle (to save transferring data back and forth between the main handle and the subhandles). But in OpenCL, a cl_mem is not a pointer, but an opaque value on which pointer arithmetic can not be used. That is why data interfaces contain three fields: dev_handle, offset, and ptr.

  • The field dev_handle is what the allocation function returned, and one can not do arithmetic on it.
  • The field offset is the offset inside the allocated area, most often it will be 0 because data start at the beginning of the allocated area, but when the handle is partitioned, the subhandles will have varying offset values, for each subpiece.
  • The field ptr, in the non-OpenCL case, i.e. when pointer arithmetic can be used on dev_handle, is just the sum of dev_handle and offset, provided for convenience.

This means that:

  • computation kernels can use ptr in non-OpenCL implementations.
  • computation kernels have to use dev_handle and offset in the OpenCL implementation.
  • allocation methods of data interfaces have to store the value returned by starpu_malloc_on_node() in dev_handle and ptr, and set offset to 0.
  • partitioning filters have to copy over dev_handle without modifying it, set in the child different values of offset, and set ptr accordingly as the sum of dev_handle and offset.

We can call starpu_data_handle_to_pointer() to get ptr associated with the data handle, or call starpu_data_get_local_ptr() to get the local pointer associated with the data handle.

Examples in the directory examples/interface/complex_dev_handle/ show how to generate and implement an interface supporting OpenCL.

To better notice the difference between simple ptr and dev_handle + offset, one can compare examples/interface/complex_interface.c vs examples/interface/complex_dev_handle/complex_dev_handle_interface.c and examples/interface/complex_filters.c vs examples/interface/complex_dev_handle/complex_dev_handle_filters.c.

27.8 Defining A New Data Filter

StarPU provides a series of predefined filters in Data Partition, but additional filters can be defined by the application. The principle is that the filter function just fills the memory location of the i-th subpart of a data. Examples are provided in src/datawizard/interfaces/*_filters.c, check starpu_data_filter::filter_func for further details. The helper function starpu_filter_nparts_compute_chunk_size_and_offset() can be used to compute the division of pieces of data.

27.9 Defining A New Data Interface

This section proposes an example how to define your own interface, when the StarPU-provided interface do not fit your needs. Here we take a simple example of an array of complex numbers represented by two arrays of double values. The full source code is in examples/interface/complex_interface.c and examples/interface/complex_interface.h

Let's thus define a new data interface to manage arrays of complex numbers:

/* interface for complex numbers */
struct starpu_complex_interface
{
double *real;
double *imaginary;
int nx;
};

That structure stores enough to describe one buffer of such kind of data. It is used for the buffer stored in the main memory, another instance is used for the buffer stored in a GPU, etc. A data handle is thus a collection of such structures, to describe each buffer on each memory node.

Note: one should not make pointers that point into such structures, because StarPU needs to be able to copy over the content of it to various places, for instance to efficiently migrate a data buffer from one data handle to another data handle, so the actual address of the structure may vary.

27.9.1 Data registration

Registering such a data to StarPU is easily done using the function starpu_data_register(). The last parameter of the function, interface_complex_ops, will be described below.

void starpu_complex_data_register(starpu_data_handle_t *handleptr,
unsigned home_node, double *real, double *imaginary, int nx)
{
struct starpu_complex_interface complex =
{
.real = real,
.imaginary = imaginary,
.nx = nx
};
starpu_data_register(handleptr, home_node, &complex, &interface_complex_ops);
}
void starpu_data_register(starpu_data_handle_t *handleptr, int home_node, void *data_interface, struct starpu_data_interface_ops *ops)
struct _starpu_data_state * starpu_data_handle_t
Definition starpu_data.h:45

The struct starpu_complex_interface complex is here used just to store the parameters provided by users to starpu_complex_data_register. starpu_data_register() will first allocate the handle, and then pass the structure starpu_complex_interface to the method starpu_data_interface_ops::register_data_handle, which records them within the data handle (it is called once per node by starpu_data_register()):

static void complex_register_data_handle(starpu_data_handle_t handle, int home_node, void *data_interface)
{
struct starpu_complex_interface *complex_interface = (struct starpu_complex_interface *) data_interface;
unsigned node;
for (node = 0; node < STARPU_MAXNODES; node++)
{
struct starpu_complex_interface *local_interface = (struct starpu_complex_interface *)
local_interface->nx = complex_interface->nx;
if (node == home_node)
{
local_interface->real = complex_interface->real;
local_interface->imaginary = complex_interface->imaginary;
}
else
{
local_interface->real = NULL;
local_interface->imaginary = NULL;
}
}
}
void * starpu_data_get_interface_on_node(starpu_data_handle_t handle, unsigned memory_node)
#define STARPU_MAXNODES
Definition starpu_config.h:229

If the application provided a home node, the corresponding pointers will be recorded for that node. Others have no buffer allocated yet. Possibly the interface needs some dynamic allocation (e.g. to store an array of dimensions that can have variable size). The corresponding deallocation will then be done in starpu_data_interface_ops::unregister_data_handle.

Different operations need to be defined for a data interface through the type starpu_data_interface_ops. We only define here the basic operations needed to run simple applications. The source code for the different functions can be found in the file examples/interface/complex_interface.c, the details of the hooks to be provided are documented in starpu_data_interface_ops .

static struct starpu_data_interface_ops interface_complex_ops =
{
.register_data_handle = complex_register_data_handle,
.allocate_data_on_node = complex_allocate_data_on_node,
.copy_methods = &complex_copy_methods,
.get_size = complex_get_size,
.footprint = complex_footprint,
.interface_size = sizeof(struct starpu_complex_interface),
};
void(* register_data_handle)(starpu_data_handle_t handle, int home_node, void *data_interface)
Definition starpu_data_interfaces.h:386
@ STARPU_UNKNOWN_INTERFACE_ID
Definition starpu_data_interfaces.h:353
Definition starpu_data_interfaces.h:372

The field starpu_data_interface_ops::interfaceid should be defined to STARPU_UNKNOWN_INTERFACE_ID when defining the interface, its value will be updated the first time a data is registered through the new data interface.

Convenience functions can be defined to access the different fields of the complex interface from a StarPU data handle after a call to starpu_data_acquire():

double *starpu_complex_get_real(starpu_data_handle_t handle)
{
struct starpu_complex_interface *complex_interface =
(struct starpu_complex_interface *) starpu_data_get_interface_on_node(handle, STARPU_MAIN_RAM);
return complex_interface->real;
}
double *starpu_complex_get_imaginary(starpu_data_handle_t handle);
int starpu_complex_get_nx(starpu_data_handle_t handle);

Similar functions need to be defined to access the different fields of the complex interface from a void * pointer to be used within codelet implementations.

#define STARPU_COMPLEX_GET_REAL(interface) (((struct starpu_complex_interface *)(interface))->real)
#define STARPU_COMPLEX_GET_IMAGINARY(interface) (((struct starpu_complex_interface *)(interface))->imaginary)
#define STARPU_COMPLEX_GET_NX(interface) (((struct starpu_complex_interface *)(interface))->nx)

Complex data interfaces can then be registered to StarPU.

double real = 45.0;
double imaginary = 12.0;
starpu_complex_data_register(&handle1, STARPU_MAIN_RAM, &real, &imaginary, 1);
starpu_task_insert(&cl_display, STARPU_R, handle1, 0);
@ STARPU_R
Definition starpu_data.h:58

and used by codelets.

void display_complex_codelet(void *descr[], void *_args)
{
int nx = STARPU_COMPLEX_GET_NX(descr[0]);
double *real = STARPU_COMPLEX_GET_REAL(descr[0]);
double *imaginary = STARPU_COMPLEX_GET_IMAGINARY(descr[0]);
int i;
for(i=0 ; i<nx ; i++)
{
fprintf(stderr, "Complex[%d] = %3.2f + %3.2f i\n", i, real[i], imaginary[i]);
}
}

The whole code for this complex data interface is available in the directory examples/interface/.

27.9.2 Data footprint

We need to pass a custom footprint function to the method starpu_data_interface_ops::footprint which computes data size footprint. StarPU provides several functions to compute different type of value: starpu_hash_crc32c_be_n() is used to compute the CRC of a byte buffer, starpu_hash_crc32c_be_ptr() is used to compute the CRC of a pointer value, starpu_hash_crc32c_be() is used to compute the CRC of a 32bit number, starpu_hash_crc32c_string() is used to compute the CRC of a string.

27.9.3 Data allocation

To be able to run tasks on GPUs etc. StarPU needs to know how to allocate a buffer for the interface. In our example, two allocations are needed in the allocation method complex_allocate_data_on_node(): one for the real part and one for the imaginary part.

static starpu_ssize_t complex_allocate_data_on_node(void *data_interface, unsigned node)
{
struct starpu_complex_interface *complex_interface = (struct starpu_complex_interface *) data_interface;
double *addr_real = NULL;
double *addr_imaginary = NULL;
starpu_ssize_t requested_memory = complex_interface->nx * sizeof(complex_interface->real[0]);
addr_real = (double*) starpu_malloc_on_node(node, requested_memory);
if (!addr_real)
goto fail_real;
addr_imaginary = (double*) starpu_malloc_on_node(node, requested_memory);
if (!addr_imaginary)
goto fail_imaginary;
/* update the data properly in consequence */
complex_interface->real = addr_real;
complex_interface->imaginary = addr_imaginary;
return 2*requested_memory;
fail_imaginary:
starpu_free_on_node(node, (uintptr_t) addr_real, requested_memory);
fail_real:
return -ENOMEM;
}
void starpu_free_on_node(unsigned dst_node, uintptr_t addr, size_t size)
uintptr_t starpu_malloc_on_node(unsigned dst_node, size_t size)

Here we try to allocate the two parts. If either of them fails, we return -ENOMEM. If they succeed, we can record the obtained pointers and returned the amount of allocated memory (for memory usage accounting).

Conversely, complex_free_data_on_node() frees the two parts:

static void complex_free_data_on_node(void *data_interface, unsigned node)
{
struct starpu_complex_interface *complex_interface = (struct starpu_complex_interface *) data_interface;
starpu_ssize_t requested_memory = complex_interface->nx * sizeof(complex_interface->real[0]);
starpu_free_on_node(node, (uintptr_t) complex_interface->real, requested_memory);
starpu_free_on_node(node, (uintptr_t) complex_interface->imaginary, requested_memory);
}

We can call starpu_opencl_allocate_memory() to allocate memory on an OpenCL device.

We have not made anything particular for GPUs or whatsoever: it is starpu_free_on_node() which knows how to actually make the allocation, and returns the resulting pointer, be it in main memory, in GPU memory, etc.

27.9.4 Data copy

Now that StarPU knows how to allocate/free a buffer, it needs to be able to copy over data into/from it. Defining a method copy_any_to_any() allows StarPU to perform direct transfers between main memory and GPU memory.

static int copy_any_to_any(void *src_interface, unsigned src_node,
void *dst_interface, unsigned dst_node,
void *async_data)
{
struct starpu_complex_interface *src_complex = src_interface;
struct starpu_complex_interface *dst_complex = dst_interface;
int ret = 0;
if (starpu_interface_copy((uintptr_t) src_complex->real, 0, src_node,
(uintptr_t) dst_complex->real, 0, dst_node,
src_complex->nx*sizeof(src_complex->real[0]),
async_data))
ret = -EAGAIN;
if (starpu_interface_copy((uintptr_t) src_complex->imaginary, 0, src_node,
(uintptr_t) dst_complex->imaginary, 0, dst_node,
src_complex->nx*sizeof(src_complex->imaginary[0]),
async_data))
ret = -EAGAIN;
return ret;
}
int starpu_interface_copy(uintptr_t src, size_t src_offset, unsigned src_node, uintptr_t dst, size_t dst_offset, unsigned dst_node, size_t size, void *async_data)

We here again have no idea what is main memory or GPU memory, or even if the copy is synchronous or asynchronous: we just call starpu_interface_copy() according to the interface, passing it the pointers, and checking whether it returned -EAGAIN, which means the copy is asynchronous, and StarPU will appropriately wait for it thanks to the pointer async_data. This copy method is also available for 2D matrices starpu_interface_copy2d(), 3D matrices starpu_interface_copy3d(), 4D matrices starpu_interface_copy4d() and N-dim matrices starpu_interface_copynd().

starpu_interface_copy() will also manage copies between other devices such as CUDA devices, OpenCL devices, etc. But if necessary, we may manage these copies by ourselves as well. StarPU provides three functions starpu_cuda_copy_async_sync(), starpu_cuda_copy2d_async_sync() and starpu_cuda_copy3d_async_sync() that enable copying of 1D, 2D or 3D data between main memory and CUDA device memories. They first try to copy the data asynchronous, if fail or stream is NULL then copy the data synchronously. StarPU also provides several functions that are used to transfer data between RAM and OpenCL devices. starpu_opencl_copy_ram_to_opencl() copies data from RAM to an OpenCL device. starpu_opencl_copy_opencl_to_ram() copies data from an OpenCL device to RAM. starpu_opencl_copy_opencl_to_opencl() copies data between two OpenCL devices. starpu_opencl_copy_async_sync() copies data between two devices. If event is NULL, the copy is synchronous, and checking whether ret is set to -EAGAIN, which means the copy is asynchronous.

This copy method is referenced in a structure starpu_data_copy_methods

static const struct starpu_data_copy_methods complex_copy_methods =
{
.any_to_any = copy_any_to_any
};
int(* any_to_any)(void *src_interface, unsigned src_node, void *dst_interface, unsigned dst_node, void *async_data)
Definition starpu_data_interfaces.h:345
Definition starpu_data_interfaces.h:105

which was referenced in the structure starpu_data_interface_ops above.

Other fields of starpu_data_copy_methods allow providing optimized variants, notably for the case of 2D or 3D matrix tiles with non-trivial ld.

We can call starpu_interface_data_copy() to record in offline execution traces the copy.

When an asynchronous implementation of the data transfer is implemented, we can call starpu_interface_start_driver_copy_async() and starpu_interface_end_driver_copy_async() to initiate and complete asynchronous data transfers between main memory and GPU memory.

27.9.5 Data pack/peek/unpack

The copy methods allow for RAM/GPU transfers, but is not enough for e.g. transferring over MPI. That requires defining the pack/peek/unpack methods. The principle is that the method starpu_data_interface_ops::pack_data concatenates the buffer data into a newly-allocated contiguous bytes array, conversely starpu_data_interface_ops::peek_data extracts from a bytes array into the buffer data, and starpu_data_interface_ops::unpack_data does the same as starpu_data_interface_ops::peek_data but also frees the bytes array.

static int complex_pack_data(starpu_data_handle_t handle, unsigned node, void **ptr, starpu_ssize_t *count)
{
struct starpu_complex_interface *complex_interface = (struct starpu_complex_interface *)
*count = complex_get_size(handle);
if (ptr != NULL)
{
char *data;
data = (void*) starpu_malloc_on_node_flags(node, *count, 0);
*ptr = data;
memcpy(data, complex_interface->real, complex_interface->nx*sizeof(double));
memcpy(data+complex_interface->nx*sizeof(double), complex_interface->imaginary, complex_interface->nx*sizeof(double));
}
return 0;
}
unsigned starpu_data_test_if_allocated_on_node(starpu_data_handle_t handle, unsigned memory_node)
#define STARPU_ASSERT(x)
Definition starpu_util.h:240

complex_pack_data() first computes the size to be allocated, then allocates it, and copies over into it the content of the two real and imaginary arrays.

static int complex_peek_data(starpu_data_handle_t handle, unsigned node, void *ptr, size_t count)
{
char *data = ptr;
struct starpu_complex_interface *complex_interface = (struct starpu_complex_interface *)
STARPU_ASSERT(count == 2 * complex_interface->nx * sizeof(double));
memcpy(complex_interface->real, data, complex_interface->nx*sizeof(double));
memcpy(complex_interface->imaginary, data+complex_interface->nx*sizeof(double), complex_interface->nx*sizeof(double));
return 0;
}

complex_peek_data() simply uses memcpy() to copy over from the bytes array into the data buffer.

static int complex_unpack_data(starpu_data_handle_t handle, unsigned node, void *ptr, size_t count)
{
complex_peek_data(handle, node, ptr, count);
starpu_free_on_node_flags(node, (uintptr_t) ptr, count, 0);
return 0;
}

And complex_unpack_data() just calls complex_peek_data() and releases the bytes array.

27.9.6 Pointers inside the data interface

In the example described above, the two pointers stored in the data interface are data buffers, which may point into main memory, GPU memory, etc. One may also want to store pointers to meta-data for the interface, for instance the list of dimensions sizes for the n-dimension matrix interface, but such pointers are to be handled completely differently. More examples are provided in src/datawizard/interfaces/*_interface.c

More precisely, there are two types of pointers:

  • Data pointers, which point to the actual data in RAM/GPU/etc. memory. They may be NULL when the data is not allocated (yet). StarPU will automatically call starpu_data_interface_ops::allocate_data_on_node to allocate the data pointers whenever needed, and call starpu_data_interface_ops::free_data_on_node when memory gets scarce. For instance, for the n-dimension matrix interface the pointers to the actual data (ptr, dev_handle, offset) are data pointers.

  • Meta-data pointers, which always point to RAM memory. They are usually always allocated so that they can always be used. For instance, for the n-dimension matrix interface the array of dimension sizes and the array of ld are meta-data pointers.

This means that:

Note: for compressed matrices such as CSR, BCSR, COO, the colind and rowptr arrays are not meta-data pointers, but data pointers like nzval, because they need to be available in GPU memory for the GPU kernels.

Note: when the interface does not contain meta-data pointers, starpu_data_interface_ops::reuse_data_on_node does not need to be implemented, StarPU will just use a memcpy. Otherwise, either starpu_data_interface_ops::reuse_data_on_node must be used to transfer only the data pointers and not the meta-data pointers, or the allocation cache should be disabled by setting starpu_data_interface_ops::dontcache to 1.

Note: It should be noted that because of the allocation cache, starpu_data_interface_ops::free_data_on_node may be called on an interface which is not attached to a handle anymore. This means that the meta-data pointers will have been deallocated by starpu_data_interface_ops::unregister_data_handle, and cannot be used by starpu_data_interface_ops::free_data_on_node to e.g. compute the size to be deallocated. For instance, the n-dimension matrix interface uses an additional scalar allocsize field to store the allocation size, thus still available even when the interface is in the allocation cache.

Note: if starpu_data_interface_ops::unregister_data_handle is implemented and checks that pointers are NULL, starpu_data_interface_ops::cache_data_on_node needs to be implemented to clear the pointers when caching the allocation.

27.9.7 Helpers

We can get the unique identifier of the interface associated with the data handle by calling starpu_data_get_interface_id(), and get the next available identifier for a newly created data interface by calling starpu_data_interface_get_next_id().

27.10 The Multiformat Interface

It may be interesting to represent the same piece of data using two different data structures: one only used on CPUs, and one only used on GPUs. This can be done by using the multiformat interface. StarPU will be able to convert data from one data structure to the other when needed. Note that the scheduler dmda is the only one optimized for this interface. Users must provide StarPU with conversion codelets:

#define NX 1024
struct point array_of_structs[NX];
/*
* The conversion of a piece of data is itself a task, though it is created,
* submitted and destroyed by StarPU internals and not by the user. Therefore,
* we have to define two codelets.
* Note that for now the conversion from the CPU format to the GPU format has to
* be executed on the GPU, and the conversion from the GPU to the CPU has to be
* executed on the CPU.
*/
#ifdef STARPU_USE_OPENCL
void cpu_to_opencl_opencl_func(void *buffers[], void *args);
struct starpu_codelet cpu_to_opencl_cl =
{
.opencl_funcs = { cpu_to_opencl_opencl_func },
.nbuffers = 1,
.modes = { STARPU_RW }
};
void opencl_to_cpu_func(void *buffers[], void *args);
struct starpu_codelet opencl_to_cpu_cl =
{
.cpu_funcs = { opencl_to_cpu_func },
.cpu_funcs_name = { "opencl_to_cpu_func" },
.nbuffers = 1,
.modes = { STARPU_RW }
};
#endif
{
#ifdef STARPU_USE_OPENCL
.opencl_elemsize = 2 * sizeof(float),
#endif
.cpu_elemsize = 2 * sizeof(float),
...
};
starpu_multiformat_data_register(handle, STARPU_MAIN_RAM, &array_of_structs, NX, &format_ops);
#define STARPU_OPENCL
Definition starpu_task.h:79
#define STARPU_CPU
Definition starpu_task.h:58
size_t cpu_elemsize
Definition starpu_data_interfaces.h:2656
size_t opencl_elemsize
Definition starpu_data_interfaces.h:2657
struct starpu_codelet * cpu_to_opencl_cl
Definition starpu_data_interfaces.h:2658
struct starpu_codelet * opencl_to_cpu_cl
Definition starpu_data_interfaces.h:2659
void starpu_multiformat_data_register(starpu_data_handle_t *handle, int home_node, void *ptr, uint32_t nobjects, struct starpu_multiformat_data_interface_ops *format_ops)
Definition starpu_data_interfaces.h:2655

Kernels can be written almost as for any other interface. Note that STARPU_MULTIFORMAT_GET_CPU_PTR shall only be used for CPU kernels. CUDA kernels must use STARPU_MULTIFORMAT_GET_CUDA_PTR, and OpenCL kernels must use STARPU_MULTIFORMAT_GET_OPENCL_PTR. STARPU_MULTIFORMAT_GET_NX may be used in any kind of kernel.

static void
multiformat_scal_cpu_func(void *buffers[], void *args)
{
struct point *aos;
unsigned int n;
aos = STARPU_MULTIFORMAT_GET_CPU_PTR(buffers[0]);
n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
...
}
extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
{
unsigned int n;
struct struct_of_arrays *soa;
soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
...
}
#define STARPU_MULTIFORMAT_GET_NX(interface)
Definition starpu_data_interfaces.h:2711
#define STARPU_MULTIFORMAT_GET_CUDA_PTR(interface)
Definition starpu_data_interfaces.h:2698
#define STARPU_MULTIFORMAT_GET_CPU_PTR(interface)
Definition starpu_data_interfaces.h:2694

A full example may be found in examples/basic_examples/multiformat.c.

27.11 Specifying A Target Node For Task Data

When executing a task on GPU, for instance, StarPU would normally copy all the needed data for the tasks to the embedded memory of the GPU. It may however happen that the task kernel would rather have some of the data kept in the main memory instead of copied in the GPU, a pivoting vector for instance. This can be achieved by setting the flag starpu_codelet::specific_nodes to 1, and then fill the array starpu_codelet::nodes (or starpu_codelet::dyn_nodes when starpu_codelet::nbuffers is greater than STARPU_NMAXBUFS) with the node numbers where data should be copied to, or STARPU_SPECIFIC_NODE_LOCAL to let StarPU copy it to the memory node where the task will be executed.

The function starpu_task_get_current_data_node() can be used to retrieve the memory node associated with the current task being executed.

STARPU_SPECIFIC_NODE_CPU can also be used to request data to be put in CPU-accessible memory (and let StarPU choose the NUMA node). STARPU_SPECIFIC_NODE_FAST and STARPU_SPECIFIC_NODE_SLOW can also be used

For instance, with the following codelet:

struct starpu_codelet cl =
{
.cuda_funcs = { kernel },
.nbuffers = 2,
.modes = {STARPU_RW, STARPU_RW},
.specific_nodes = 1,
};
starpu_cuda_func_t cuda_funcs[STARPU_MAXIMPLEMENTATIONS]
Definition starpu_task.h:429
#define STARPU_SPECIFIC_NODE_CPU
Definition starpu_task.h:279
#define STARPU_SPECIFIC_NODE_LOCAL
Definition starpu_task.h:272

the first data of the task will be kept in the CPU memory, while the second data will be copied to the CUDA GPU as usual. A working example is available in tests/datawizard/specific_node.c

With the following codelet:

struct starpu_codelet cl =
{
.cuda_funcs = { kernel },
.nbuffers = 2,
.modes = {STARPU_RW, STARPU_RW},
.specific_nodes = 1,
};
#define STARPU_SPECIFIC_NODE_SLOW
Definition starpu_task.h:285

The first data will be copied into fast (but probably size-limited) local memory, while the second data will be left in slow (but large) memory. This makes sense when the kernel does not make so many accesses to the second data, and thus data being remote e.g. over a PCI bus is not a performance problem, and avoids filling the fast local memory with data which does not need the performance.

In cases where the kernel is fine with some data being either local or in the main memory, STARPU_SPECIFIC_NODE_LOCAL_OR_CPU can be used. StarPU will then be free to leave the data in the main memory and let the kernel access it from accelerators, or to move it to the accelerator before starting the kernel, for instance:

struct starpu_codelet cl =
{
.cuda_funcs = { kernel },
.nbuffers = 2,
.modes = {STARPU_RW, STARPU_R},
.specific_nodes = 1,
};
#define STARPU_SPECIFIC_NODE_LOCAL_OR_CPU
Definition starpu_task.h:298

An example for specifying target node is available in tests/datawizard/specific_node.c.