Research
Research portfolio
My research spans efficient machine learning and systems, from model pretraining
quality to algorithms and system co-design for LLM training, inference, and agent infrastructure.
Projects are organized below across five long-running themes. For paper details and authors,
please refer to the
Selected Publications
section on the home page or my
Google Scholar profile.
Efficient ML Algorithm
- CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning (Efficient training · NeurIPS 2024)
- Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping (Efficient inference · ICML 2025)
- CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention (Efficient inference · ICLR 2026)
- Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient (Efficient training)
- Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time (Efficient inference)
- I-DLM: Introspective Diffusion Language Models (Efficient inference)
- Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution (Efficient inference)
- VocabPrune (Efficient inference)
- Diffusion Router (Efficient inference)
- MixOfSpeculator: Mix-Architecture Speculator Design (Efficient inference)
- Phoenix Speculator (Efficient inference)
- Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation (Efficient training)
- Bio-Inspired LLM-Based Multiagent Systems (Efficient inference)
- Tail Likelihood Reinforcement Learning (Efficient training)
- Scaling Law of Speculative Decoding (Efficient inference)
Efficient ML System
- DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales (RL training system · DeepSpeed)
- Aurora: When RL Meets Adaptive Speculative Training — A Unified Training-Serving System (Speculator training system · ICML 2026)
- Pre-Expedite: Hierarchical Structure Space for Improving Small File Access in Parallel File Systems (ML file system)
- HybridShare: Universal Resource Scheduling for Hybrid Jobs (ML scheduling system)
- MAEM: Multiple-Application co-Execution Time Estimation (ML scheduling system)
- EmReal: A Digital Twin Framework of Emulated and Real Components for Robots with Reinforcement Learning (RL training system)
- XoRL (RL training system)
- Hierarchical Performance Isolation for Distributed LLM (Agent system)
- AgentGO (Agent system)
- Smart KV (Agent system)
- Universal KV System (Agent system)
Quantization
- Flash-LLM: Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity (VLDB 2024)
- Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design on Modern GPUs (USENIX ATC 2024)
- KITTY: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost (MLSys 2026)
- SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
- OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization (2026)
Modeling
- DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
- CoderForge-Preview
- Loop Diffusion
Survey
- RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Models (IEEE TPAMI)
- Survey of LLM Agents
Looking to collaborate?
Feel free to reach out — zhongzhu.zhou@sydney.edu.au — if you have aligned
interests in efficient ML systems, LLM training/serving infrastructure, quantization, or coding-agent
research. For a complete role-by-role breakdown of contributions (motivation + specific
contributions), see the Experience page.