In an article submitted to the arxiv* server, researchers proposed a unified, end-to-end, programmable graph representation learning (PGL) framework that enables heterogeneous computing systems to optimize program execution autonomously. As algorithm complexity rapidly increases across applications like autonomous vehicles and machine vision, flexible and efficient programming strategies are critical to fully utilize heterogeneous hardware accelerators like GPUs and achieve necessary computational gains.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
The Need for Intelligent Mapping
Many real-world applications now rely on the parallel processing strengths of heterogeneous systems combining central processing units (CPUs) and graphics processing units (GPUs). However, conventional programming models often need to work on fully exploiting hardware innovations in these platforms.
Novel machine learning techniques promise to overcome these limitations by learning cost models tailored to dynamic, heterogeneous systems. Such data-driven approaches can implicitly model hardware intricacies and adapt mappings accordingly without requiring extensive engineering knowledge. The proposed PGL framework focuses explicitly on accurately mapping software computations onto both CPUs and GPUs within a single application. For instance, parallel visualization tasks in autonomous driving are well-suited for GPUs, while sequential decision-making code needs faster CPU clock speeds. Carefully orchestrating CPU/GPU cooperation can thus provide further speed and efficiency gains.
To solve this challenging optimization problem, the research team proposes representing real-world software computations as dependency graphs, where nodes are instructions and edges show information flow. Recently, machine learning models using graph representations of code have shown a superior ability to capture structure and semantics compared to natural language processing approaches relying on more straightforward token sequences. However, prior graph representations still have some fundamental limitations:
Difficulty identifying memory dependencies: Unhandled memory issues can significantly increase data communication overheads and reduce application performance.
Inability to statically determine loop iterations: The workload balance between CPUs and GPUs depends greatly on predicting sequential versus parallel execution requirements.
Constructing Dynamic Execution Graphs
To address these drawbacks, PGL constructs more refined dynamic execution graphs using compiler traces. Nodes represent low-level intermediate instructions with different computation types (add, subtract, etc.). Edges delineate control, data, and memory dependencies between instructions, with weights indicating communication costs. Loop iteration counts, and memory aliasing can thus be modeled, preserving runtime execution order.
Common recurring patterns are visible even in zoomed-in graph segments, like star shapes indicating array usage within loops. Different fundamental shapes point to iteration, recursion, and sequential code building blocks. To quantitatively characterize graphs, PGL performs multifractal analysis, mathematically assessing topological complexity through concepts like generalized fractal dimensions. This analysis quantifies self-similar recurring patterns at different scales, validating PGL's ability to recognize meaningful structural differences.
The workflow complexity increases due to tracing the execution history. However, the finer granularity provides software/hardware flexibility lacking in prior dataflow representations without necessarily increasing inter-cluster communication costs. Partitioning schemes can constrain most dependencies within common clusters. PGL's representations specifically help expose optimization opportunities stemming from finer-grained analysis.
Discovering Universal Optimization Principles
By assessing over 130 real-world software graphs, PGL uncovers an intriguing universal relationship tying multifractal properties to system-level performance metrics. A simple power law emerges linking software parallelization degree and communication overheads to metrics like spectral width and dominant Lipschitz exponents. This mathematical signature suggests common optimizing principles underlie efficient software-hardware mapping, regardless of domain specifics.
Notably, rare recurring motifs — indicative of complex dependencies — strongly predict ultimate parallelization capabilities. Similarly, spectrum heterogeneity correlates with communication overheads in line with intuition about coordinating more diverse code structures. This universality implies that coherent optimizing strategies transcend intricate application details.
The prescriptive power laws provide predictive tools for programmers struggling with complex mappings. Judging software via easily obtained fractal metrics offers shortcuts to set reasonable expectations on optimal parallelization or communication costs. Moreover, software architects could use these universal scaling laws to guide future exascale system designs.
New Framework
Leveraging its multifractal insights, PGL introduces unsupervised and supervised learning components to address the ultimate mapping challenge - accurately predicting which code segments execute most efficiently on specialized CPUs versus GPUs.
First, an autoencoder using graph embeddings and spectral clustering partitions large dynamic execution graphs into software kernels. Communication-minimizing refinements ensure efficient inter-kernel dependency management. Next, supervised graph neural networks (GNNs) like gated graph neural networks (GGNN) classify kernel embeddings, predicting optimal CPU vs GPU deployment. GGNN currently provides the best accuracy, given its inherent support for dynamic computational graphs.
Evaluating against OpenCL test kernels and accurate benchmarks, PGL provides up to a 6x speedup over baseline parallel code and exceeds state-of-the-art frameworks by over 2x. The integrated machine learning approach thus offers a breakthrough in autonomous, heterogeneous platform optimization.
PGL does involve tracing runtime execution history, entailing time and space complexity costs. However, partitioned cluster communication minimization offsets much of this overhead. Small, simple code may also benefit less since initial costs are not amortized. Nonetheless, the approach shows significant gains over conventional software/hardware flexibility management techniques.
Future Outlook
This research reveals an exciting avenue to self-optimizing, flexible computing systems that can keep pace with algorithmic application innovations. The lessons extend beyond today's CPU/GPU domains, offering guidance to incorporate new accelerator technologies through appropriately extracted optimization principles. With further investment, intelligent mapping systems could greatly simplify software design on future exascale platforms.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Journal reference:
- Preliminary scientific report.
Xiao, Y., Ma, G., Ahmed, N. K., Capotă, M., Willke, T. L., Nazarian, S., & Bogdan, P. (2023). End-to-end programmable computing systems. Communications Engineering, 2(1), 1–15. https://doi.org/10.1038/s44172-023-00127-7, https://www.nature.com/articles/s44172-023-00127-7