Shaoqiang

Last updated: 2025-12-04

😄 About Me

I have been a Ph.D. candidate at the School of Integrated Circuits, Shanghai Jiao Tong University (SJTU), since September 2022, under the supervision of Prof. Lei He (IEEE Fellow). I am also a research intern at the Eastern Institute of Technology (EIT), Ningbo. I obtained my B.S. and M.S. degrees in integrated circuit design at Xidian University.

📄 Download CV

🎯 Research Interests

Digital IC Design for AI Chips
Hardware/Software Co-Optimization
AI Compiler Development
FPGA-Based Accelerator
AI Models:Llama,dLLM,Mamba,VLA

🎓 Education

Shanghai Jiao Tong University (2022.09 – 2026.06)
Ph.D. in Integrated Circuits Engineering
Xidian University (2019.09 – 2022.06)
Master’s in Integrated Circuit Design
Xidian University (2015.09 – 2019.06)
Bachelor’s in Integrated Circuit Design and Systems

🏆 Awards

2025.11: 🏅 1st Prize, the Build Your Dreams (BYD) Scholarship, Shanghai Jiao Tong University — Rank 3/160,
2021.12: 🥉 3rd Prize, the 4th “Huawei Cup” China Graduate Chip Innovation Competition, Special Second Prize GalaxyCore Technology Co., Ltd — Rank 50/499,
2021.07: 🏅 3rd Prize, the 5th National College Student Integrated Circuit Innovation and Entrepreneurship Competition— Northwest Region, Rank 57/180
2020.10: 🥈 2nd Prize, the 3rd “Huawei Cup” China Graduate Chip Innovation Competition, Special First Prize S2C Technology Co., Ltd — Rank 24/453,
2020.09: 🏅 1st Prize (twice), 2nd Prize (once), Study Excellence Scholarship for Master’s Degree Candidates, Xidian University — Rank 4/188,

📚 Publications

[1] CCF-A ASPLOS 2026 DFVG: A Heterogeneous Architecture for Speculative Decoding with Draft-on-FPGA and Verify-on-GPU. Shaoqiang Lu^*, Yangbo Wei^*, Junhong Qian, Dongge Qin, Shiji Gao, Yizhi Ding, Qifan Wang, Chen Wu, Xiao Shi, Lei He.

[2] CCF-A DAC 2025 MambaOPU: An FPGA Overlay Processor for State-space-duality-based Mamba Models. Shaoqiang Lu^*, Xuliang Yu^*, Tiandong Zhao, Siyuan Miao, Xinsong Sheng, Chen Wu, Liang Zhao, Ting-Jung Lin, Lei He.

[3] CCF-B ICCAD 2025 MoE-OPU: An FPGA Overlay Processor Leveraging Expert Parallelism for MoE-based Large Language Models. Shaoqiang Lu^*, Yangbo Wei^*, Junhong Qian, Chen Wu, Xiao Shi, Lei He.

[4] CCF-B TRETS Journal 2025,ISEDA 2024 MCoreOPU: An FPGA-based Multi-Core Overlay Processor for Transformer-based Models. Shaoqiang Lu^*, Tiandong Zhao^*, Ting-Jung Lin, Rumin Zhang, Chen Wu, Lei He.

[5] CCF-B ICCAD 2024 ChatOPU: An FPGA-based Overlay Processor for Large Language Models with Unstructured Sparsity. Tiandong Zhao, Shaoqiang Lu, Chen Wu, Lei He.

[6] CCF-C ASAP 2025 (Best Paper Nomination)👉 METAL: A Memory-Efficient Transformer Architecture for Long-Context Inference on FPGA. Zicheng He, Shaoqiang Lu, Tiandong Zhao, Chen Wu, Lei He.

[7] CCF-C ASP-DAC 2026 dLLM-OPU: An FPGA Overlay Processor for Accelerated Diffusion Large Language Models. Yangbo Wei^*, Shaoqiang Lu^*, Junhong Qian, Chen Wu, Xiao Shi, Lei He.

[8] CCF-C FPT 2026 FlightOPU: An FPGA Overlay Processor for LLM with HBM-Aware Multi-Die Architecture. Chen Wu, Shaoqiang Lu, Yangbo Wei, Junhong Qian, Jinlong Yan, Zhanfei Chen, Rumin Zhang, Xiao Shi, Lei He.

[9] CCF-A AAAI 2026 Mixture-of-Trees: Learning to Select and Weigh Reasoning Paths for Efficient LLM Inference. Yangbo Wei, Zhen huang, Shaoqiang Lu, Junhong Qian, Dongge Qin, Ting Jung Lin, Wei Xing, Chen Wu, Lei He.

[10] CCF-C FPL 2023 Token Packing for Transformers with Variable-Length Inputs. Tiandong Zhao, Siyuan Miao, Shaoqiang Lu, Jialin Cao, Jun Qiu, Xiao Shi, Kun Wang, Lei He.

[11] CCF-C FCCM 2025 C2OPU: Hybrid Compute-in-Memory and Coarse-Grained Reconfigurable Architecture for Overlay Processing of Transformers. Siyuan Miao, Lingkang Zhu, Chen Wu, Shaoqiang Lu, Jinming Lyu, Lei He.

[12] CCF-A Science China: Information Sciences 2025 FPGA Overlay processor for AI computing. He Lei, Wang Kun, Wu Chen, Tao Zhuofu, Shi Xiao, Miao Siyuan, Shaoqiang Lu.

🧰 Projects

Project 1: An FPGA-based Overlay Processor Unit for Accelerating AI Models

Description

Designed an FPGA-based Overlay Processor Unit (OPU) to accelerate inference of diverse AI deep learning models. Optimized data flow and operator execution through hardware–software co-design. Successfully deployed in real-time edge scenarios.

Responsibilities

Designed the OPU including the instruction set, compiler, and hardware microarchitecture.
Built a scalable computing engine for parallel execution of key operations (e.g., convolution, matrix multiplication).
Developed specialized functional units for nonlinear operations.
Implemented a real-time hardware–software runtime, covering CPU-side model compilation to FPGA inference.
Enabled PCIe-based transfer of weights and control instructions.
Designed for scalability across various sizes and types of neural networks.

Experimental Setup

Xilinx Alveo U200 @ 300 MHz (PE 600 MHz)
Implementation 4-Core OPU + 64 GB DDR4
Quantization Models run INT8 bitsandbytes.

Evaluation

Resources Usage (left) vs. Model Results (right) _{* We report the first token latency.}

Resource	LUT	FF	BRAM	DSP		Model	BERT	ViT	GPT2	LLaMA7B
Used	947684	1806396	1004	4364		Latency* (ms)	3.41	6.96	59.49	149.57
Util(%)	80.1%	76.3%	46.5%	63.8%		Throughput (TOP/s)	6.08	4.66	7.42	7.99

Tools

Vivado , FPGA (U200) , Verilog , ModelSim , PCIe , DDR , XDMA , C++ , Python , PyTorch

Mcore

Demo

Project 2：A Edge SoC with co-Accelerator in ARM Cortex-M3 for Face Detection

Description

Developed an Edge System-on-Chip (SoC) integrating an ARM Cortex-M3 processor with a dedicated hardware co-accelerator to enable real-time face detection. Deploy a decision-tree–based PICO (Pixel-Intensity Comparison-based Object Detection) model.

Responsibilities

Built a complete image acquisition, storage, and display pipeline with Bus peripheral access in software.
Implemented the face detection algorithm on Cortex-M3 for standalone execution.
Designed and integrated a dedicated hardware accelerator to boost detection performance.
Ran on a 100 MHz AX7103 FPGA with OV5640 (3-million-pixel) for image capture.
Data is stored in DDR3, HDMI for real-time display, and UART for status communication.

Experimental Setup & Results

Camera: OV5640, RGB565, 5 MP ; Display: HBMI, RGB888, 640×480 @ 60 Hz ; Memory: DDR3 ×2 (8 Gb each).

Resource	Used	Utilization	SMIC55	Report	Platform	Implement	Latency
LUT	36583	57.70%	ASIC Area	61801 μm²	CPU	OpenCV	33 ms
FF	36130	28.50%	PT Power	361.5 μW	FPGA	RTL	42 ms
BRAM	101	75.00%	NAND2 area	1.12 μm²/gate	Cortex-M3	C	2,700 ms
DSP	27	11.00%	Gates/MOScount	55180/ ≈220k	Speedup	Cortex/FPGA	≈ 60x

Evaluation

Achieved 41 ms/face processing time with hardware acceleration, compared to 2700 ms/face on pure Cortex-M3.
Delivered a 65.81× performance speedup over the baseline software-only implementation.

Tools & Technologies

C (embedded Cortex-M3) , AHB/APB/AXI , Keil MDK , AX7103 , OV5640 , HDMI

ARMSoC

Demo

Project 3：Digital IC Frontend Design and Implementation of a PE Array

Description

Designed a Processing Element (PE) array to accelerate irregular sparse AI workloads. Implemented the digital IC frontend design flow and compared resource utilization between ASIC and FPGA implementations.

Responsibilities

Designed sparse computing architecture for the PE array, including triangular-fed data flow, PE unit logic, and weight bitmask decoding.
Verified functionality via simulation before setting timing constraints.
Completed both FPGA and ASIC flows, eliminating timing violations through iterative analysis.
Integrated the PE array into a complete accelerator system.
Analyzed synthesis and power reports to guide RTL refinement for improved performance.

Evaluation

Logic synthesis using Synopsys Design Compiler (DC) at TSMC 28nm process node.
Power analysis via PrimeTime (PT), achieving:
- 1 GHz operating frequency
- 1.06 mm² core area
- 1.32 W power consumption
Balanced performance, area, and power through RTL optimizations.

Tools & Technologies

Verilog , Synopsys Design Compiler (DC) , PrimeTime (PT) , VCS , TSMC 28nm , SDC , Shell , Tcl

PEarray

Project 4：Expert-Aware Quantization and Sparsity for MoE- based Models

Description

Designed and implemented expert-aware quantization and sparsity optimization techniques for Mixture-of-Experts (MoE) models to reduce memory footprint.

Responsibilities

Introduced N:M sparsity patterns (1:4 / 2:4 / 4:8 / 6:8 / 8:8) in MLP layers.
Applied mixed-precision quantization (BF16 / FP8 / INT4) guided by expert activation frequency, covering both expert and shared layers.
Integrated sparsity and quantization pipelines into the training workflow of the DeepSeek-V2-Lite model on the GSM8K dataset.

Evaluation

Reduced parameter size by up to 2.76× while maintaining accuracy.
Only 1.53% average accuracy drop after fine-tuning.
Achieved 2–3× speedup and 40–60% memory savings.

Tools & Technologies

PyTorch , DeepSeek-V2-Lite , GSM8K dataset , CUDA , NVIDIA A100 , NVIDIA RTX 4090 , Python

MoE

Shaoqiang Lu陆少强

😄 About Me

🎯 Research Interests

🎓 Education

🏆 Awards

📚 Publications

🧰 Projects

Project 1: An FPGA-based Overlay Processor Unit for Accelerating AI Models

Demo

Project 2：A Edge SoC with co-Accelerator in ARM Cortex-M3 for Face Detection

Demo

Project 3：Digital IC Frontend Design and Implementation of a PE Array

Project 4：Expert-Aware Quantization and Sparsity for MoE- based Models

Shaoqiang Lu
陆少强