iARPA AGILE Program Kickoff Meeting
October 18-20, 2022
Lawrence Berkeley National Laboratory
Berkeley, California
Projects
Advanced Micro Devices, Inc. – PANDO
Processing of extremely large graph-based data structures is critical to many application areas spanning commercial, scientific, social, public policy, and intelligence interests. Current technologies and solutions do not provide the computational power and throughput needed to process the types and sizes of graphs described in the AGILE BAA to provide the desired timely and actionable intelligence. The objective of this proposal is to research and develop new computational concepts that deliver the multi-orders-of-magnitude improvements sought by IARPA for graph computations.
Given the semiconductor industry’s post-Moore’s Law challenges, conventional computing cannot be relied on to bridge this performance gap. Therefore, this proposal envisions a coordinated set of novel methods to be employed that collectively are expected to achieve gamechanging levels of performance improvement for critical graph-based workloads. At the heart of the proposal is the PANDO architecture.
An important design philosophy of PANDO is to avoid data movement whenever possible because data movement is one of the major performance limiters in large graph problems. PANDO aims to minimize data movement by allowing compute to occur “in place” where data resides though the deployment of accelerators at different levels of the system including the memory system, nearby Data Warehouses/Data Lakes, and even at the point of data ingestion. To keep the design and programmability of PANDO tractable, a general PANDO Graph Accelerator (PX) architecture is proposed which provides specialized compute for common graph operations. However, this single general accelerator architecture can be instantiated at different points in the system to provide varying levels of capabilities appropriate for the tasks commonly executed at each different part of the overall system. A unified PX architecture means that every PX functionally supports the same fundamental set of operations (albeit at potentially different performance levels), which can greatly simplify the programming model and runtime systems as well as allow PANDO-based deployments to scale from datacenter-sized clusters down to desktop workstations.
Where data movement cannot be avoided, the envisioned PANDO system utilizes integrated silicon photonic technologies that will be designed to allow very low-latency, high-bandwidth, and energy-efficient communications across the entire PANDO system with a flat interconnect organization. The photonic-based system organization, along with proposed security features, also allowing the partitioning and reconfiguration of a PANDO system into smaller virtual clusters to support multiple concurrent users running different tasks.
Another tenet of PANDO is the ability to support and exploit parallelism everywhere. At the hardware level, PANDO provides a large number of PX units, each of which internally can support massive levels of fine-grain parallelism. At the software and runtime level, PANDO builds on the team’s many years of work in highly-concurrent programming models; large-scale, parallel, distributed systems; latency-tolerant graph runtimes; and libraries, compilers, and runtimes for heterogeneous systems.
The significance of the proposed PANDO solution is a system architecture and general approach to large-scale graph computation problems that delivers the targeted multiple orders-of-magnitude performance improvement for the stated performance metrics. The PANDO approach exploits established (CPUs still play a role) and emerging (silicon photonics are rapidly maturing) technologies when possible, but PANDO also delivers aggressive innovations across multiple areas where graph-specific techniques and optimizations provide the desired improvements.
Georgia Institute of Technology – FORZA
This project proposes Flow-Optimized Reconfigurable Zones of Acceleration (FORZA), an innovative, energy-efficient, secure, and reliable computer architecture for large-scale dataanalytic applications and other classes of large irregular problems. On AGILE workloads, FORZA is estimated to exceed the performance of state-of-the-art systems by two orders of magnitude due to our architectural and runtime innovations, which increase the efficiency and performance of memory-intensive applications. Through its revolutionary approach to computation, FORZA will support data analysis problem sizes far beyond the reach of current systems. The AGILE workloads are highly sensitive to latency, especially for remote memory accesses. FORZA addresses this challenge by bringing compute to the data with near-memory and in-network processing, and eliminating deep cache hierarchies (along with their concomitant energy and silicon area overheads). This approach eliminates lengthy and energy-intensive roundtrips to remote memory experienced on current systems. The FORZA runtime uses zonebased data layouts coupled with lightweight actor-based messages to provide application access to the novel hardware features in the system while also providing efficient atomicity. FORZA supports asynchronous, energy-efficient operations with three hardware-supported acceleration modes for memory accesses – reconfigurable atomic operations, active messages, and thread migration. Massive multithreading generates a high degree of concurrent memory accesses. The FORZA system also enhances security for multi-tenant applications by embedding logical-tophysical mapping of memory accesses in the near-memory hardware by incorporating Trusted Execution Environment features at the management level. Estimates show the FORZA approach will outperform the state-of-the-art by over two orders of magnitude relative to current approaches for AGILE workloads, while providing high security and energy efficiency. FORZA achieves this through dramatic overhead reductions from the combination of zone computation, actor-based runtime, massive multithreading, inexpensive atomics, and the elimination of expensive memory hierarchies with cache coherence overheads. Small yet efficient cores increase compute density relative to the state-of-the-art. All of this is achieved without exotic packaging requirements or relying on overly aggressive technology predictions. The FORZA system is envisioned to support a programming environment that enables analysts to develop compact high-level programs for co-designed AGILE workloads with automatic lowering and translation to the FORZA runtime environment. Sparse matrices and semi-ring algebras underlie the FORZA framework's support for streaming, path-oriented graph algorithms. Programs are expressed in high-level descriptions, such as past work on the Intrepydd Python-based environment, but with C++ efficiency.
To evaluate the design trade-offs in the FORZA system, a comprehensive, multi-scale simulation environment will be created. In Phase 1, the Architectural Design environment will be created with the A-SST framework. The modeling and simulation effort in Phase 2 will transition to a Detailed Design Model and will include multi-resolution models composed of A-SST software models combined with RTL implementations of FORZA system components. FPGA emulation techniques will be employed as appropriate, including FireSim, for scaling. The FORZA team includes a cadre of field-leading researchers and highly experienced HPC system designers. Georgia Institute of Technology is the lead institution with collaborators from Cornelis Networks, Lucata Corporation, Tactical Computing Laboratories (TCL), the University of California Santa Cruz (UCSC), and the University of Notre Dame (ND).
Intel Federal LLC – TIGRE
The AGILE program seeks to achieve breakthrough scalability in mixed-mode analytics that spans sparse/graph compute and dense/AI compute by 2030. Using a set of analytic workflows that measure critical metrics including Graph Neural Net training times, dynamic and adaptive object ingestion and processing, and the prediction of graph information, the objective of AGILE is to deliver a system architecture capable of tackling today’s incredibly hard problems by 2030. The AGILE program further increases complexity by requiring one architecture that can scale all the way from the desktop to datacenter installations, while at the same time enabling maximum programmability, energy efficiency, and dynamic adaption to changing operating environments.
Intel recognizes several fundamental challenges in addressing the AGILE program’s design goals. The proposed solutions will be derived from AGILE workflows and delivered as an unclassified hardware/software co-designed system. Optimizing sparse compute resources at scale requires an extremely efficient, high-bandwidth network that can scale to hundreds of thousands of nodes to meet the stringent AGILE requirements. This network must also offer high bandwidth efficiency on small message operations to enable performant sparse/graph operations. Additionally, the network must be dynamic, resilient, and scalable both on- and off- die and support multiple operation modes depending on the application. This network infrastructure must be paired with a memory architecture and hierarchy equally adaptive to providing performance in both AGILE workload extremes. This includes irregular, non-predictable access patterns for sparse/graph operations as well as streaming high-locality patterns in AI/dense operations. To achieve full scalability, the compute infrastructure (CPU, GPU, etc.) must be capable of transitioning between these different extreme conditions (sparse and dense) without requiring excessive data movement or migration that reduces performance and increases power. These hardware components would be orchestrated by a resilient distributed runtime framework, which exposes a set of hardware abstractions to software developers for ease of programming and performance scalability.
Intel’s name for this proposed project reflects this amalgamation of critical features: TIGRE - Transactional Integrated Global-memory system with dynamic Routing and End-to-end flow control. Bounds analysis of the AGILE workflows would drive the TIGRE system’s performance requirements, while running early software frameworks on behavioral and emulation models would allow for a rapid focus on the critical software interface designs as well as the hardware complexities and potential performance limiters. Leveraging Intel’s long history of delivering novel co-designed hardware/software solutions, the goal would be to enable TIGRE to leap at least 10x past the performance of any other technology by the year 2030.
Indiana University – Pilot-CRIS
The revolution of future Machine Intelligence (MI) and ultra-large scale computing challenges demands the creation of an innovative class of parallel computers that process large irregular and time-varying graph data structures as quickly and effectively as today’s supercomputers operate on numeric-based vectors and matrices. The Pilot-CRIS project being undertaken by Indiana University and its industry partner, Simultac LLC, on behalf of the pioneering IARPA AGILE program, will conduct critical ground-breaking research for the development, modeling, testing and evaluation of a revolutionary genus of parallel architecture for efficient and scalable graph data structure processing. Its exploration of innovative concepts in parallel computing architecture and runtime methods will overcome the barriers of current technology limitations and obsolete conventional practices. In so doing, Pilot-CRISS will prove transformative in impact on the future of machine intelligence, economic drivers, and societal values. The Pilot-CRIS research project will invent the new generation of parallel computer architecture for scalable processing of irregular time-varying graphs necessary for MI in general and the AGILE workflows in particular. Pilot-CRIS will develop a high-level model specification for all computer element modules to be integrated in an end-to-end system to address the critical challenges for future applications over the next decade. These critical challenges include sufficient capacity to store and efficiently access entire large-scale graphs, the performance capability for graph manipulation, the discovery and exploitation of parallelism intrinsic to the changing graph structures, design and incorporation of hardware mechanisms to severely reduce overheads of data management, asynchronous operation and message-driven control and communication to hide latency and reduce contention. The Pilot-CRIS system architecture integrates six classes of processing modules, each addressing a complementing critical challenge, and derived through a co-design methodology for optimal interoperability. These include the Graph Object Store, the Active Memory Architecture, the Secure Scalable System Area Network, the rapid rate Data traffic streaming-Ingest Engine, the runtime system adjunct platform, and the user host system. Multiple modules of each type may be employed in a diversity of configurations for a broad range of scaling from desk-side graph accelerators to multi-rack graph HPC systems at Exascale and beyond. Co-design uses cross-cutting frameworks for global name space, message-driven protocol, and the runtime system software driven by government specified benchmarks and workflows. Custom hardware will be designed in RTL for detailed modeling, evaluation, and testing.
University of Chicago - UpDown Computer: Intelligent Data Movement and Encoding for Graph Computing
The evolution of computing systems, at the highest levels of performance, has been strongly focused on dense computations with extensive reuse of values, enabling high computational rates at low cost. Unfortunately, more than a decade of experience shows that these computing systems are not well suited to graph analytics, failing to utilize system resources efficiently, and unable to deliver scalable, high performance. Key challenges include ill-suited memory hierarchies, inflexible data representations, and high-overhead communication and scheduling.
The UpDown computer reflects a new approach: efficiently integrating a programmable engine, the UpDown accelerator in the memory system. The accelerator provides intelligent data movement, data encoding, and efficient hardware messages. All at performance levels that are orders of magnitude greater than those possible on traditional CPU cores. Together, these features enable the UpDown computer to achieve dramatically higher performance on memory sweep, encoded memory, and a range of graph analysis computations requiring adaptive data-movement and efficient orchestration. The UpDown computer also incorporates traditional CPU cores, enabling efficient computation with hardware-managed hierarchies, but with the ability to enhance their capability with programmed data movement and data recoding.
Initial studies show the promise of UpDown, documenting performance and data movement benefits on key graph kernels such as BreadthFirstSearch (BFS), TriangleCounting (TC), and JACCARDSim as well as fundamental linear algebra kernels such as sparse-matrix vector product (SpMV) and sparse matrix multiply (SpGEMM). These studies illustrate the use of UpDown mechanisms and estimating potential performance as much as 100 times at the node level, based on breadth-first search and triangle-counting kernels. Initial studies suggest 10 to 100-fold potential improvements for sparse matrix, ETL, and data analytics as well.
Beyond performance, UpDown also has potential to improve power efficiency for graph analytics, implementing memory sweeps and data movement with extreme efficiency, and deploying encoded memory and other application-specific encodings that both increase memory capacity and bandwidth productivity.
We propose a systematic agenda to explore a number of remaining challenges for UpDown, including important parametric design challenges such as balancing UpDown accelerator capability with TOP and also with DRAM performance, as well as sizing scratchpad memories, and how to balance stack, node, and system level bandwidth using emerging optical communication technologies (e.g., TERAPHY). Additional challenges include refining the accelerator ISA and microarchitecture, as well as key challenges in protection/isolation and self-aware computing.
The agenda includes creating rigorous, quantitative evaluation of UpDown benefits as well as developing a breadth of understanding of how the AGILE kernels and workflows can best exploit and scale on the UpDown system. Our current system design and evidence suggests ample headroom to exceed the AGILE targets.
Qualcomm - Honeycomb
The goal of the Honeycomb system is to advance the state-of-the-art of data analytics computing performance by orders of magnitude. The system architecture is derived from a top-down approach by analyzing the data analytics workloads, which shows that performance and energy are primarily dictated by data movement, and not compute. The memory subsystem is hierarchical, with high capacity but slower memory at the top for long-term data storage, and lower capacity but faster memory close to the compute node. It uses conventional memory technology, such as DRAM, because a viable replacement is not in sight. But the architecture is technology ambivalent: it can adopt any emerging technology. The innovative memory subsystem architecture allows user to configure memory for various abstractions such as scratchpads, streams, processing-in-memory, and compute-near-memory, with a virtualized address space for security. The heterogenous compute node is targeted for 3 nm CMOS and beyond. It employs industry standard cores for management, leveraging software ecosystem of system and programming software. The accelerators are optimized for data analytics and provide performance and energy efficiency. Large on-die memory/caches provide local data store for compute cores and accelerators, and a network on the chip connects all of them together. The node implements optical interconnects to provide an external network for scalability. A cabinet houses 64 nodes, and a system of several cabinets comprises the entire system connected in a provably optimal two-dimensional dragonfly network, providing orders of magnitude performance improvements over a conventional system. The node architecture will be captured in a functional simulator using A-SST framework. The architecture will be simulated executing the workloads and will be optimized to meet the AGILE goals. The RTL implementation of the architecture in System Verilog will be simulated, cross validated with the functional simulations, and will be synthesized for emulation on an FPGA-enabled platform to demonstrate the AGILE program goals.