With the surge of big data applications and the worsening of the memory-wall problem, the memory system, instead of the computing unit, becomes the commonly recognized major concern of computing. However, this “memory-centric” common understanding has a humble beginning. More than three decades ago, the memory-bounded speedup model is the first model recognizing memory as the bound of computing and provided a general bound of speedup and a computing-memory trade-off formulation. The memory-bounded model was well received even by then. It was immediately introduced in several advanced computer architecture and parallel computing textbooks in the 1990’s as a must-know for scalable computing. These include Prof. Kai Hwang’s book “Scalable Parallel Computing” in which he introduced the memory-bounded speedup model as the Sun-Ni’s Law, parallel with the Amdahl’s Law and the Gustafson’s Law. Through the years, the impacts of this model have grown far beyond parallel processing and into the fundamental of computing. In this article, we revisit the memory-bounded speedup model and discuss its progress and impacts in depth to make a unique contribution to this special issue, to stimulate new solutions for big data applications, and to promote data-centric thinking and rethinking.

CARE: A Concurrency-Aware Enhanced Lightweight Cache Management Framework

Published in The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2023), 2024

Background

CHROME: Concurrency-Aware Holistic Cache Management Framework with Online Reinforcement Learning

Published in The 30th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2024), 2024

Background

ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache Optimizations

Published in The 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2024), 2024

Background

AceMiner: Accelerating Graph Pattern Matching using PIM with Optimized Cache System

Published in The 2024 IEEE 42nd International Conference on Computer Design (ICCD 2024), 2024

Background

Pyramid: Accelerating LLM Inference with Cross-Level Processing-in-Memory

Published in IEEE Computer Architecture Letters (CAL), 2025

Abstract

Integrating processing-in-memory (PIM) with GPUs accelerates large language model (LLM) inference, but existing GPU-PIM systems encounter several challenges. While GPUs excel in large general matrix-matrix multiplications (GEMM), they struggle with small-scale operations better suited for PIM, which currently cannot handle them independently. Additionally, the computational demands of activation operations exceed the capabilities of current PIM technologies, leading to excessive data movement between the GPU and memory. PIM’s potential for general matrix-vector multiplications (GEMV) is also limited by insufficient support for fine-grained parallelism. To address these issues, we propose Pyramid, a novel GPU-PIM system that optimizes PIM for LLM inference by strategically allocating cross-level computational resources within PIM to meet diverse needs and leveraging the strengths of both technologies. Evaluation results demonstrate that Pyramid outperforms existing systems like NeuPIM, AiM, and AttAcc by factors of 2.31×, 1.91×, and 1.72×, respectively.

Concurrency-Aware Cache Miss Cost Prediction with Perceptron Learning

Published in The 35th Great Lakes Symposium on VLSI (GLSVLSI 2025), 2025

Background

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Xiaoyang Lu

Posts by Collection

portfolio

Portfolio item number 1

Portfolio item number 2

publications

APAC: An Accurate and Adaptive Prefetch Framework with Concurrent Memory Access Analysis

Background

CoPIM: A Concurrency-aware PIM Workload Offloading Architecture for Graph Applications

Background

Premier: A Concurrency-Aware Pseudo-Partitioning Framework for Shared Last-Level Cache

Background

A generalized model for modern hierarchical memory system

Abstract

The Memory-Bounded Speedup Model and its Impacts in Computing

Abstract

CARE: A Concurrency-Aware Enhanced Lightweight Cache Management Framework

Background

CHROME: Concurrency-Aware Holistic Cache Management Framework with Online Reinforcement Learning

Background

ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache Optimizations

Background

AceMiner: Accelerating Graph Pattern Matching using PIM with Optimized Cache System

Background

Pyramid: Accelerating LLM Inference with Cross-Level Processing-in-Memory

Abstract

Concurrency-Aware Cache Miss Cost Prediction with Perceptron Learning

Background

talks

Talk 1 on Relevant Topic in Your Field

Tutorial 1 on Relevant Topic in Your Field

Talk 2 on Relevant Topic in Your Field

Conference Proceeding talk 3 on Relevant Topic in Your Field

teaching

Teaching experience 1

Teaching experience 2