Research

Background and Overview

Broadly speaking, my research belongs to the area of computer and system architecture – designing and interconnecting hardware components to create computers that meet functional, performance, power and cost goals. It is a very exciting time to conduct research in this area because VLSI technology continues to provide increasing numbers of transistors and clock speeds to allow computer architects to build even more powerful microprocessors and computer systems. Moreover, as software technologies evolve, new computer applications and programming paradigms are constantly emerging to challenge traditional hardware designs. High-complexity design driven by the quest for greater performance has resulted in many critical issues, such as longer verification time, higher power dissipation and less scalability.

My primary technical interests span the areas of microprocessor and memory architecture, operating systems, security, and modeling and evaluation of future computer systems. A common thread in my research is the improvement of our knowledge of how emerging applications, system software, and programming models challenge architecture designs through improvements in the methods of evaluation. In my previous research, I have developed simulation platforms, mathematical models, and automatic workload synthesizers to increase the breadth and scope of evaluation.

Previous Work and Contributions

(1) Modeling, Simulation and Evaluation of Future Hardware Transactional Memory

As chip multiprocessors (CMPs) make their way into homes and the mainstream computing market, programmers need to begin changing the way they write code to be able to take advantage of the resources available on CMPs – specifically an ever-increasing number of processing elements. However, exploiting the available data and task parallelism in a program is often a challenging and time-consuming process, requiring significant time investments to extract performance and guarantee correctness. Transactional memory (TM) may be able to alleviate some of the difficulty involved in writing concurrent programs by shifting the burden of synchronization from the programmer to the TM system. The transactional memory programming model substitutes transactions for locks. These transactional regions are guaranteed to execute atomically; all of the instructions either complete or are discarded. In addition to being atomic, transactions are isolated; all of the updates become visible at the same time and no partial results are visible to the rest of the system. These two attributes are inherent in the model and no additional reasoning needs to be made by the programmer. This makes TM very attractive from a design point of view because it can accelerate development and increase productivity. TM systems and supporting mechanisms can be implemented in hardware, software or a hybrid of the two. My research work focuses on Hardware Transactional Memory (HTM) systems.

 

(1.1) SuperTrans: The First Detailed, Cycle Accurate and Multiple Issue Model of Hardware Transactional Memory [1]

While in recent years there has been extensive research into the design of HTM implementations, all of this research has been completed on frameworks that are either in order, single issue, non-cycle accurate, focus exclusively on a single implementation, or a combination of these limitations. While these implementations were crucial as they provided a proof-of-concept to both academia and industry, they lacked the ability to provide direct comparisons of vital design choices such as conflict detection and version management. Moreover they ignored the important role of instruction level parallelism and microarchitecture interaction.

To address these concerns and to offer the research community a more adaptable framework for experimenting with various design implementations, I developed the SuperTrans framework. SuperTrans was the first HTM framework capable of multiple-issue, cycle-accurate simulation of all of the commonly implemented dimensions of TM (Eager/Eager, Eager/Lazy, Lazy/Lazy). Additionally, SuperTrans allows for parameterized abstraction of key latencies which allows researchers to quickly experiment with various designs without focusing on detailed implementation. This framework provides an excellent test bed for designing and optimizing HTM systems.

 

(1.2) Using Analytical Models to Efficiently Explore Hardware Transactional Memory and Multi-core Co-design [2]

In an HTM-based multi-core system, the TM design dimensions and core configurations interact with each other. A multi-core processor relies on a number of microarchitectural optimizations that exploit dynamic properties of code and data to facilitate the execution of instructions in parallel on each core whereas the HTM system orchestrates how the transactionalized threads can be executed concurrently across multiple cores. Therefore, the overall system performance is determined by the TM design decisions as well as by the underlying multi-core architecture parameters and should not be considered in isolation.  Understanding their interactions and the corresponding performance impact is critical for the efficient co-design of TM-based multi-core systems. However, the intrinsic interaction between TM mechanisms, core architecture configuration and program transaction is largely unknown.

In this study, I built analytical models to capture it. Using the SuperTrans framework and linear regression modeling techniques, I identified the design points that have a first-order effect on the performance of transactional memory workloads. I also quantified the significance of the complex interactions between TM and core architecture parameters. Next, I built computationally efficient non-linear predictive models to accurately forecast TM workload performance at arbitrary HTM/core configuration combinations. By analyzing the regression trees generated from my neural network model building methods, I was able to reveal further heterogeneous interaction between TM workloads, core microarchitectures and TM mechanisms.

(1.3) Architecture Independent Workload Characterization for Transactional Memory Benchmarks [3,4]

Transactional memory offers an attractive alternative to traditional concurrent programming but implementations emerged before the programming model, leaving a gap in the design process. In previous research, transactional microbenchmarks have been used to evaluate designs or lock-based multithreaded workloads have been manually converted into their transactional equivalents; others have even created dedicated transactional benchmarks. Yet, throughout all of the investigations, transactional memory researchers have not settled on a way to describe the runtime characteristics that these programs exhibit; nor has there been any attempt to unify the way transactional memory implementations are evaluated. In addition, the similarity (or redundancy) of these workloads is largely unknown. Evaluating transactional memory designs using workloads that exhibit similar characteristics will unnecessarily increase the number of simulations without contributing new insight. On the other hand, arbitrarily choosing a subset of transactional memory workloads for evaluation can miss important features and lead to biased or incorrect conclusions.

In this research, I developed a set of architecture-independent transaction-oriented workload characteristics that can accurately capture the behavior of transactional code. I applied principle component analysis and clustering algorithms to analyze the proposed workload characteristics collected from a set of SPLASH, STAMP, and PARSEC transactional memory programs. I showed that using transactional characteristics to cluster the chosen benchmarks can reduce the number of required simulations by almost half. I also showed that these same methods can be used to identify specific feature subsets.

(1.4) TransPlant: Architecture Independent Workload Characterization for Transactional Memory Benchmarks [5]

Using the methods I developed for characterizing transactional workloads, I discovered that the majority of the benchmarks being used for evaluation in this area were redundant and not comprehensive. Currently, for performance evaluations, most researchers rely on manually converted lock-based multithreaded workloads or the small group of programs written explicitly for transactional memory. Using converted benchmarks is problematic because they have been tuned so well that they may not be representative of how a programmer will actually use transactional memory. Hand coding stressor benchmarks is unattractive because it is tedious and time consuming. A new parameterized methodology that could automatically generate a program based on the desired high-level program characteristics would greatly benefit the transactional memory community.

To meet this need I developed TransPlant, a framework for developing executable binaries designed to satisfy the statistical requirements of an input description file. This allows architects to quickly generate workloads that can be used to evaluate the performance of hardware modifications either within, or outside, the bounds of the limited set of available transactional memory benchmarks. Using principle component analysis, clustering, and raw transactional performance metrics, I showed that TransPlant can generate benchmarks with features that lie outside the boundary occupied by these traditional benchmarks. I also showed that TransPlant can mimic the behavior of SPLASH-2 and STAMP transactional memory workloads. The program generation methods I developed help transactional memory architects select a robust set of programs for quick design evaluations.

(1.5) Understanding the Behavior of Hardware Transactional Memory Workloads: Observations, Implications and Design Recommendations [6]

While there has been extensive research into the design of hardware and software transactional memory systems, there has been very little investigation of transactional memory program behavior. Understanding the behavior of transactional memory programs is essential for making efficient choices about hardware, compiler optimizations, and even the choice of versioning and conflict resolution mechanisms. Because these decisions often remain fixed throughout the life of a system, it is important that architects are able to make informed choices

In this research, I investigate how transaction granularity, stride, and thread count affect program performance using both array- and object-based memory accesses. My work revealed that conventional concurrent programming practices may lead to poor performance when applied to transactional memory. It may not be wise to spend development time to shrink critical sections; transaction count and thread count are more important. There are also vast differences in the performance characteristics of eager conflict/eager version management systems when switching between array and object accesses. Lazy/lazy systems are largely immune to the effects, which may make a lazy system more attractive to developers since performance remains consistent even in the presence of increased conflicts. The results from these experiments provided valuable insight into the response characteristics of EE/LL systems.

 

(2) BASS: A Benchmarking Suite for evaluating Architectural Security Systems [7]

As software vulnerabilities continue to be exposed on a daily basis and the motivation of cunning adversaries to compromise valuable computer assets grows, novel methods must be developed to ensure security.  Recently there has been a growing interest within the computer architecture research community in designing architectural and hardware mechanisms to improve security. Unfortunately, there is currently not a representative set of benchmarks for evaluating the security features of proposed hardware modifications.  The frequent result is that great effort is often spent searching for vulnerable programs, and/or evaluations suffer from a lack of diversity. To address this problem, I developed BASS, a benchmark suite to evaluate the security features of proposed architectural solutions under various malicious attack scenarios. BASS v1.0 currently consists of seven benchmarks chosen to cover a diverse range of architectural attack characteristics.  To facilitate the use of these benchmarks in architectural security research, I developed both vulnerable programs and scripts to automatically generate exploits targeting those vulnerable programs across both 32-bit X86 and 64-bit Alpha platforms.

 

Future Directions

(1) Improving Transaction Workload Performance Through Runtime Adaptable Hardware Transactional Models

One of the key discoveries in my analysis of the behavior of hardware transactional models [6] was that the performance of transactional memory systems depends on the composition of the transactional workload, the choice of transactional model, and the interaction between the two. While some memory access patterns perform better on Eager approaches, some perform better on Lazy approaches. Moreover these performance patterns are predictable. I believe the performance of transactional memory systems can be substantially improved by taking advantage of the interaction between the transactional workloads and underlying transactional model. My plan of attack for improving the performance is two-fold. First, I believe in many cases runtime systems and just-in-time compilation techniques can be used to appropriately re-size transactions to reduce overall contention based upon memory access patterns and the underlying transactional model. Moreover I believe adaptable systems can be constructed to assign transactional workloads to hybrid-transactional models based upon these workload characteristics. Finally, I believe that in cases where conflict is inevitable – conflict avoidance systems can be design to appropriately throttle and align transactions to reduce the overall contention of the system.

(2) Modeling of Heterogeneous Workloads and Systems

One of the unique advantages of the TransPlant framework that I have developed is its ability to natively generate workloads that are heterogeneous in nature. These types of workloads are very difficult to produce through hand coding, have yet to surface in the wild, and are vital to the evaluation of future heterogeneous platforms. It is anticipated that future hardware platforms as well as software will be heavily heterogeneous. I believe this provides a distinct opportunity to explore and evaluate these future heterogeneous systems, essentially circumventing the chicken and the egg problem of not having adequate software for evaluation due to a lack of existing hardware.

My research will involve a good mix of futuristic and present day research. I plan to pursue an interdisciplinary research program that includes close collaborations with researchers and faculty in related fields. Similarly, I intend to collaborate with industry in understanding and developing solutions for practical problems. I believe my past experience of research work done jointly with a number of colleagues as well as my prior record of participation with industry will help me achieve this. I am excited at the prospect of learning, contributing, giving shape and making an impact in this exciting and challenging field.

 

References

[1] James Poe, IDEAL Research: SuperTrans, http://www.ideal.ece.ufl.edu/supertrans, October 2009

[2] James Poe, Chang-Burm Cho and Tao Li, Using Analytical Models to Efficiently Explore Hardware Transactional Memory and Multi-core Co-design, International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), October 2008

[3] James Poe, Clay Hughes and Tao Li, TransMetric: Architecture Independent Workload Characterization for Transactional Memory Benchmarks, International Conference on Supercomputing (ICS), June 2009

[4] James Poe, Clay Hughes and Tao Li, On the (Dis)similarity of Transactional Memory Workloads, International Symposium on Workload Characterization (IISWC), October 2009

[5] James Poe, Clay Hughes and Tao Li, Understanding the Behavior of Hardware Transactional Memory Workloads: Observations, Implications and Design Recommendations, submitted

[6] James Poe, Clay Hughes and Tao Li, TransPlant: A Parameterized Methodology For Generating Transactional Memory Workloads, International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009

[7] James Poe and Tao Li, BASS: A Benchmarking Suite for evaluating Architectural Security Systems, Computer Architecture News (CAN), Volume 34, Issue 4, pages 26-33, December 2006