Friday, July 29, 2011

Making your application code multicore ready

Many silicon vendors rely on multicore architectures to improve performance. However, some engineers might say that those vendors have not been able to deliver compilation tools that have kept pace with the architecture improvements. The tools that are available require both a good understanding of the application and a deep understanding of the target platform.

However, an alternative approach is possible. This article will highlight a way to parallelize complex applications in a short time span, without the need to understand neither the application nor the target platform.

This can be achieved with interactive mapping and partitioning design flow. The flow visualizes the application behavior and allows the user to interactively explore feasible multithreaded versions. The program semantics are guaranteed to be preserved under the proposed transformations.

In many cases the design of an embedded system starts with a software collection that has not yet been partitioned to match the multicore structure of the target hardware.

As a result, the software does not meet its performance requirements and hardware resources are left idle. To resolve this, an expert (or a team of experts) comes in to change the software so that it fits the target multicore structure.

Current multicore design flow practice



Figure 1 Traditional multicore design practice involves an iterative process of analysis, partitioning, loop parallelization, incorporation of semaphores, analysis, testing and retuning of code.


A typical approach, illustrated in Figure 1 above, would include the following steps:

1. Analyze the application. Find the bottlenecks.

2. Partition the software over the available cores. This requires a good understanding of data access patterns and data communication inside the application to match this with the cache architecture and available bandwidth of buses and channels on the target platform. Optimize some of the software kernels for the target instruction set (e.g. Intel SSE, ARM Neon).

3. Identify the loops that can be parallelized. This requires a good understanding of the application: find the data dependencies, find the anti- and output dependencies, and find the shared variables. The dependencies can be hidden very deeply, and to find them often requires complex pointer analysis.

4. Predict the speedup. Predict the overhead of synchronizing, the cost of creating and joining threads. Predict the impact of additional cache overhead introduced by distributing workload over multiple CPUs. If parallelizing a loop still seems worth it, go to the next step.

5. Change the software to introduce semaphores, FIFOs and other communication and synchronization means. Add thread calls to create and join threads. This requires a good understanding of the API’s available on the target platform. In this stage subtle bugs are often introduced, related to data races, deadlock or livelock that may only manifest themselves much later, e.g. after the product has been shipped to the customer.

6. Test. Does it seem to function correctly? Measure. Does the system achieve the required performance level? If not: observe and probe the system. Tooling exists to observe the system; The experts need to interpret these low-level observations in the context of their expert system knowledge, then draw conclusions.

7. Try again to improve performance or handle data races and deadlocks. This involves repeating the above from Step 1.

Close analysis of Figure 1 clearly shows there are many problems with this design flow. Experts that can successfully complete this flow are a rare-breed. Even if you can find them, at the start of a project it is hard to predict how many improvement and bug fix iterations the experts need to go through until the system stabilizes.

Multicore platforms are quickly becoming a very attractive option in terms of their cost-performance ratio. But they also become more complex every year, making it harder for developers to exploit their benefits. Therefore we need a new design flow that enables any software developer to program multicore platforms. This flow is depicted in Figure 2 below.



Figure 2 Multicore design flow that enables any software developer

In this alternative flow, a tool analyzes the program before it is partitioned. It finds the loops that can be parallelized and devises a synchronization strategy for these loops. The tool also has detailed knowledge of the target platform and it can estimate the cost of different partitioning and synchronization strategies.

Information is shared by www.irvs.info

No comments:

Post a Comment