Experience | Zhongzhu (Charlie) Zhou

Experience

Each role lists the projects I led or contributed to, with the motivation that drove them and the contributions I made. Project links use real URLs where available; the rest are placeholders to be filled in as releases land.

Professional Experience

Turbo Team, Together.AI

Mar 2024 - Present

Senior Research Scientist Remote & San Francisco, United States

Advisors: Ben Athiwaratkun (Senior Research Scientist, Together.AI) · Shuaiwen Song (Vice President of Research, Together.AI)

Efficient ML Algorithms

Ladder-Residual

Paper Code

Motivation. Large-model inference under tensor parallelism often suffers from communication stalls and weak overlap between communication and computation; we sought an architecture-runtime co-design that improves throughput without sacrificing model quality.

Contributions.

Co-conceived the parallelism-aware residual design and helped shape the paper's system and evaluation story.
Implemented and optimized the gpt-fast inference path with CUDA Graphs and PyTorch compile ("reduce-overhead") for large-model serving.
Benchmarked performance across model scales (1B-405B) and TP world sizes (1, 2, 4, 8, 16), validating up to 30% end-to-end throughput improvement on 70B models with P2P enabled and up to 60% with P2P disabled.

CREST (Turbo-reasoning)

Motivation. Reasoning models often under-think or over-think at test time, wasting tokens or missing correct solutions; we sought a training-free intervention that could be deployed in mainstream serving stacks.

Contributions.

Co-developed the core idea of a training-free test-time steering method that identifies and modulates cognitive attention heads, improving accuracy by up to 17.5% and reducing token usage by 37.6% across reasoning benchmarks.
Designed deployment paths for integrating CREST into vLLM and SGLang.

CARE

Paper Code Project

Motivation. MLA-style attention can improve serving efficiency, but most pretrained checkpoints use GQA/MHA and cannot directly benefit; we sought a practical conversion path that preserves quality while lowering inference cost.

Contributions.

Developed the core idea and empirical framing for upgrading pretrained attention into MLA-compatible forms.
Proposed a conversion pipeline that upgrades pretrained attention (e.g. GQA) into multi-head latent attention (MLA) for faster inference without increasing KV-cache size.
Ran the full experimental suite and carried out vLLM integration and theoretical analysis.

Squeeze Evolve

Motivation. Recursive self-aggregation improves reasoning quality, but uniform compute allocation across generation and aggregation wastes cost on easy subsets and under-allocates recovery on hard subsets.

Contributions.

Helped develop a multi-model orchestration view of recursive self-aggregation, routing generation and aggregation between large and small models based on cross-model confidence.
Owned coding-benchmark execution and evaluation pipelines, especially for LiveCodeBench V6, and supported ablations on routing thresholds and aggregation behavior across AIME 2025 and HMMT 2025.
Demonstrated 30-40% compute reduction at matched accuracy or 5-7 point accuracy gains at equivalent compute.
Paper accepted by COLM 2026.

Agent Evolve

Motivation. Current LLM-based multiagent systems are largely static after deployment and lack mechanisms for continual adaptation across agents, skills, and populations.

Contributions.

Built a bio-inspired LLM multiagent framework with pheromone-style memory, evolutionary division of labor, and skill inheritance for open-ended population adaptation.
Studied population-level adaptation through competition, selection, and cross-generation strategy transfer.

Explored integration of LEXICO compression techniques.
Prototyped vocabulary-pruned speculators and Mix-Architecture Speculator designs.
Explored diffusion LLMs that interleave self-verification with token generation.
Investigated diffusion-style MoE routers for smoother expert selection.
Investigated diffusion-style speculator design.

Efficient ML Systems

Training System — XoRL (RL Training System), Axolotl (SFT Training System)

Motivation. Building an RL and SFT training stack for coding and reasoning agents required more than model fine-tuning: it needed an end-to-end system that coupled sandboxed environments, distributed rollout workers, and multi-node training plus serving infrastructure while staying stable under long-context, MoE, and rapidly changing model variants.

Contributions.

Built much of the training-side RL framework, including agent PPO trainers, asynchronous rollout and pipeline-training paths, and the execution flow that converts multi-turn agent-environment interaction into PPO and GRPO training batches.
Owned the training pipeline that ingests rollout trajectories, computes advantage, and performs policy updates plus rollout-model weight synchronization for coding-agent post-training.
Implemented asynchronous rollout, replay-queue mini-batching, and router-assisted batching between rollout and training workers to overlap trajectory generation with policy optimization.
Developed trajectory/data transforms, token-level loss masks, stepwise-vs-trajectory advantage handling, rejection sampling, and batch balancing to improve GRPO signal quality and training stability.
Scaled long-context training recipes to 16K-32K contexts using Ulysses sequence parallelism, remove-padding, chunked prefill, and per-GPU token-budget tuning for DeepCoder and DeepScaleR-style runs.
Implemented sequence-parallel (SP) compatibility across the training stack so long-context post-training paths worked correctly with distributed attention, packed sequences, and rollout-to-training data flow.
Built SP-compatible MoE-LoRA kernel paths to support efficient distributed post-training for expert models without breaking sequence-parallel execution.
Integrated QuACK fused kernels into XoRL to improve kernel efficiency and support higher-throughput post-training recipes.
Added Qwen3.5 support and completed model bring-up across configs, training paths, and distributed recipes for reliable experimentation.
Diagnosed and fixed multi-node training failures (position_ids, cu_seqlens, attention-mask, and MoE dispatch issues) that destabilized distributed recipes across evolving model families.
Integrated long-context attention (Ulysses, Ring Attention) into Axolotl and supported SFT data flow from successful trajectories to extend supervised post-training to larger context windows.

Inference System — Pulsar & SGLang

Motivation. High-throughput serving requires lower KV overhead and more stable speculative decoding across cache-hit patterns, batch sizes, and multi-node deployments.

Contributions.

Applied a Swift-KV caching strategy to accelerate prefill by reducing KV memory overhead and improving end-to-end latency.
Designed and implemented KV-cache prompt caching for the Phoenix speculator in Pulsar, stabilizing acceptance rates and reducing end-to-end latency.
Resolved tokenizer chat-template issues and Docker deployment bugs for reliable multi-node operation, then benchmarked cache behavior across batch sizes and cache-hit scenarios to explain acceptance-rate variability and optimize cache-hit logic.
Integrated and implemented Llama 4 support for sliding window attention.

AgentGo

Motivation. Tool-using agents alternate between long-context reasoning and external actions, but request-centric runtimes either evict useful KV state too early or waste memory by pinning it too long.

Contributions.

Co-developed the core idea of treating multi-turn agent workflows as first-class programs rather than isolated requests.
Helped build the staged system path from telemetry and shadow prediction to offline replay, observability, and config-gated runtime integration for prediction-aware scheduling.

Hierarchical Performance Isolation for Distributed LLM

Motivation. Multi-tenant LLM serving needs hierarchical fairness and performance isolation across shared instances and clusters without sacrificing throughput.

Contributions.

Contributed to design discussions around hierarchical fairness, vruntime-style accounting, and weight partitioning across distributed serving instances.
Participated in experiments evaluating performance isolation and fairness under multi-tenant LLM serving workloads.

Modeling

CoderForge

Blog Code

Motivation. High-quality coding agents require strong trajectory data, stable post-training pipelines, and task-aligned optimization objectives for code generation.

Contributions.

Led the training pipeline for OpenHands R2E-Gym & SWE-Bench-scale data: curated high-signal SWE-smith / Rebench examples and fixed attention-mask plus position-ID issues in XoRL.
Distilled Qwen3-480B trajectories into a 30B coding model via supervised fine-tuning and activation distillation, then initiated MoE / RL scaling for Qwen3-30B to improve SWE-Bench solve rates.
Designed per-token loss formulations for coding-trajectory distillation and model-quality improvement.

Dolby

Mar 2024 - Sep 2024

Research Intern Sydney, Australia

Advisors: Yucheng Liu (Research Scientist, Dolby) · Shuaiwen Song (Vice President of Research, Together.AI)

Extreme Efficient Video Coding System

Motivation. Traditional codecs (H.264/H.265/AV1) lack content adaptivity and incur high compute/memory costs. Existing neural compressors are too heavy for real-time GPU and mobile streaming. We needed a low-latency, domain-aware solution that tailors compression to video content.

Contributions.

Invented and spearheaded E²ND-VC (Extreme Efficient Neural Domain Video Compression), a neural video compression framework that leverages content-aware quantization for low-latency, high-quality streaming on standard GPUs and mobile devices.
Designed Optimal Brain Stride-wise Quantization (OBSQ), a domain-specific quantization methodology that selectively compresses neural-network weights based on content type (video conferencing, gaming), enabling real-time 1080p performance with minimal quality loss.
Engineered a multi-kernel, sensitivity-based quantization pipeline with mixed-bit precision assignments, dynamically allocating bit depths across convolutional kernels to preserve critical visual features while maximizing compression ratios.
Collaborated closely with cross-functional teams to implement PoC streaming pipelines, demonstrating significant reductions in power consumption and bandwidth usage without compromising visual fidelity.

DeepSpeed Team, Microsoft

Mar 2023 - Feb 2024

Research Intern Sydney, Australia

Advisors: Xiaoxia Wu (Research Scientist, Microsoft) · Zhewei Yao (Senior Researcher, Microsoft) · Shuaiwen Song (Senior Principle Scientist, Microsoft)

DeepSpeed4Science

Paper Project

Motivation. To build unique capabilities through AI system technology innovations that help domain experts unlock today's biggest science mysteries.

Contributions.

Developed DeepSpeed4Science's blog website through Azure MySQL, WordPress, Virtual Server Hosting, JavaScript / HTML / CSS / AJAX, and Azure Migration.
Revised the blog content, font size, technical research architecture, and code related to GenSLMs — Megatron-DeepSpeed for Large-Scale AI4Science Model Training.

DeepSpeed Chat: Easy, Fast, and Affordable RLHF Training of ChatGPT-like Models at All Scales

Paper Code

Motivation. ChatGPT-like models have revolutionized the AI world, but an accessible end-to-end RLHF pipeline for training powerful ChatGPT-like models is still lacking within the AI community.

Contributions.

Applied INT4 and INT8 quantization to the RLHF pipeline, increased the batch size, and improved the speed of training and generation phases of RLHF without significantly compromising accuracy.
Investigated ColossalAI's pipeline, learned ColossalAI's Zero-2, Zero-3, and GeminiDDP, and adapted them for our RLHF algorithm.
Ran 400+ benchmark experiments for DeepSpeed Chat, ColossalAI, and HuggingFace powered by native PyTorch. Summarized the results and conclusions in the DeepSpeed blog.
Revised the DeepSpeed GitHub landing page, DeepSpeed Chat blog, and produced the DeepSpeed Chat video.

Future System Architecture (FSA) Lab, The University of Sydney (USYD)

Mar 2022 - Present

Visiting Scholar, Ph.D. Student Sydney, Australia

Advisors: Shuaiwen Song (Associate Professor, USYD) · Chang Xu (Associate Professor, USYD) · Yibo Yang (Research Scientist, JD Explore Academy)

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Models

Paper

Motivation. Text-to-image synthesis has become increasingly popular in the AI and computer graphics world (AIGC). However, there is no comprehensive survey paper that systematically introduces the frameworks and ideas behind text-to-image techniques.

Contributions.

Read over 100 papers, providing a literature review for each.
Collaborated with lab classmates to write the comprehensive survey paper.

Optimization of Diffusion Model Denoising Process

Motivation. Diffusion models currently require a large number of denoising steps. One reason for the lengthy process is the lack of a clear relationship between the noise and the trained image. Our goal is to explore additional methods to establish a connection between noise and the denoised image, beyond guidance techniques (such as incorporating text embeddings into the raw noise).

Contributions.

Developed innovative ideas, implemented them, and conducted comparative experiments to evaluate their performance.

Exploring Neural Collapse Phenomenon in Reinforcement Learning

Motivation. In reinforcement learning, agents may exhibit biased action selection in the environment due to incomplete understanding of the state and action distribution spaces. This research investigates whether the neural-collapse phenomenon occurs in policy-gradient networks and examines its implications for balancing action selection.

Contributions.

Conducted experiments applying ETF classifiers to 5+ neural networks in 10+ discrete-action RL environments (e.g. Atari, Gym Classic).
Derived and proved the formula and geometric properties of the policy-gradient loss function.
Authored paper drafts and submitted the work to NeurIPS.

Sparse Kernel Design in GPU TensorCore

Paper Code

Motivation. With the application of pruning methods, neural-network weight matrices become increasingly sparse, but there is no implementation for sparse kernels in GPU TensorCore.

Contributions.

Conducted comparative experiments between our sparse kernel and Google's Sputnik.
Summarized experiment results and figures in the paper.

DeepSpeed I/O Framework Support for AI4Science

Paper Project

Motivation. AI4Science models have revolutionized the AI world. DeepSpeed can support AI4Science models deployed across multiple nodes but lacks an I/O management framework for handling large amounts of training data efficiently.

Contributions.

Investigated DeepSpeed I/O support in supercomputers (Argonne HDF5 Lustre System), analyzed data shuffling and fetching patterns for AI4Science models, and implemented algorithms to accelerate I/O.
Implemented a ViT model for weather prediction.

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning

Paper Code

Motivation. Existing low-rank fine-tuning methods (e.g. LoRA) adapt LLMs without understanding which layers encode core knowledge vs. task-specific behavior, causing forgetting. We wanted a parameter-efficient method that adapts to the new task while preserving what the model already knows.

Contributions.

Designed the experimental methodology for evaluating task-aware parameter-efficient fine-tuning (dataset selection, baselines, metrics across math/code/instruction-following).
Implemented and executed large-scale experiments to compare CorDA against PEFT baselines, and helped collect and analyze empirical results used in the paper.

School of Computer Science and Engineering, SYSU

Sep 2018 - Mar 2022

Research Associate Guangzhou, China

Advisors: Dan Huang (Associate Professor, SYSU) · Yunfei Du · Yutong Lu (Professor, SYSU)

Pre-Expedite — Hierarchical Structure Space for Improving Small File Access in Parallel File Systems

Code

Motivation. Reduce clients' I/O communication with the metadata server, leveraging minimal additional client-side resources. Ensure high usability without modifying POSIX standards.

Contributions.

Investigated the I/O bottleneck in parallel/distributed file systems for Big Data and AI applications, identifying intensive metadata communication as a primary issue.
Utilized POSIX to create ZERO file blocks (Loop Device). Established a VFS within the ZERO file blocks, allowing each user to store small files in their designated blocks.

HybridShare: Universal Resource Scheduling for Hybrid Jobs

Code

Motivation. CPU- and GPU-centric applications allocate resources exclusively, leading to inefficient utilization of heterogeneous resources.

Contributions.

Analyzed the possibility of co-locating modern workflow applications on the same physical machine to share resources.
Proposed HybridShare algorithms that enable different resource-preferring jobs to be co-located in the same node and share hardware resources (GPU-centric, CPU-centric, mem-intensive) through Slurm, Mesos, and Kubernetes.

MAEM — Multiple Applications co-Execution Time Estimation

Motivation. There are few works to accurately estimate the slowdown of CPU/GPU applications based on the characteristics of applications and hardware architecture.

Contributions.

Conducted a literature review on application profiling, interference and slowdown estimation, and interference-aware scheduling.
Gathered resource-consumption data for various benchmarks and analyzed their behavior.

Institute of Advanced Networks and Computing Systems, SYSU

Oct 2018 - Mar 2019

Research Intern Guangzhou, China

Advisor: Hejun Wu (Associate Professor, SYSU)

EmReal: A Digital Twin Framework of Emulated and Real Components for Robots with Reinforcement Learning

Code

Motivation. Pioneered a digital-twin framework for robots utilizing reinforcement learning (RL), bridging the gap between simulations and real-world deployments. Developed solutions to transition RL algorithms from simulators to actual robots.

Contributions.

Conducted a survey on robotics simulator systems and reinforcement-learning algorithms.
Designed and implemented a one-legged robot, integrating real and emulated components using XLM, Python, ROS, and Arduino C.
Created a digital-twin framework for robotic systems, blending emulation, pre-training, connectivity, and hardware adaptation using ROS and PyBullet.

Co-authored a book on deep learning in reinforcement learning, awaiting publication.

Weixin Group, Tencent Holdings Ltd. & Dept. of CS, UIUC

Jul 2018 - Jul 2020

Research Intern · Testing, Technical-Architecture Department Champaign, IL, US & Guangzhou, China

Advisors: Tao Xie (Professor and Willett Faculty Scholar, UIUC) · Yuetang Deng (Director, Tencent)

JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games

Talk Paper

Motivation. In cases of plagiarism in mini-games, deeply obfuscated code cloned from the original often embodies malicious code and copyright infringements, posing great challenges for existing plagiarism-detection tools. We designed and implemented JSidentify, a hybrid framework for detecting plagiarism in online mini-games.

Contributions.

Focused on intermediate-representation analysis in V8 / Node.js Interpreter under the guidance of Prof. Tao Xie.
Conducted a literature review on code-plagiarism detection methods and clone-detection tools.
Developed an edit-distance estimation and network-flow algorithm to measure similarity in bytecode generated by Ignition / TurboFan Interpreter.
Designed a priority-queue-based framework to consolidate multiple plagiarism-detection algorithms.
Co-authored the paper "JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games."

Microsoft (China) Co., Ltd., Guangzhou Branch

Sep 2018 - Feb 2019

Project Assistant to Senior Cloud Architect Guangzhou, China

Advisor: Zhen Guan (Sr. Partner Technology Strategist, Microsoft)

Textile-Focused Q&A System

Motivation. The textile industry in China lacked an accessible domain-specific intelligent Q&A service, while relevant information was scattered across heterogeneous web sources. We aimed to build a practical NLP system that organized textile knowledge and provided question-answering through a cloud-deployed service.

Contributions.

Learned Azure cloud architecture and model-serving workflows to support production-oriented deployment of ML systems.
Collected textile-domain Q&A data by crawling major industry websites and constructed a cleaned, serialized, and tokenized corpus.
Implemented a pre-trained BERT model for the Q&A system and adapted it to the domain-specific dataset.
Deployed the BERT-based Q&A model on Azure as an online service for demonstration and practical use.

SYSU-CMU Joint Institute of Engineering (JIE)

Feb 2017 - Aug 2017

Research & Software Engineer Intern Guangzhou, China

Advisor: Xiaoyin Tang (Professor, Southern University of Science and Technology)

Created a front-end website integrated with a back-end deep-learning model for efficient analysis of large numbers of fundus photographs.
Enabled detection of diabetic retinopathy (DR) and diabetic macular edema (DME) through seamless collaboration between front-end and back-end systems.

Computational Medical Imaging Laboratory, SYSU

Jul 2016 - Aug 2017

Research Intern Guangzhou, China

Advisor: Yao Lu (Professor, SYSU)

OHIF Viewer Web Project — Intelligent Medical Media Platform

Project

Motivation. Medical-imaging workflows often rely on fragmented tooling and cumbersome access to image data, making it difficult for clinicians and researchers to browse, manage, and analyze large collections of breast-cancer images. We aimed to build a web-based medical media platform that streamlined image visualization for clinical research.

Contributions.

Collected and organized breast-cancer data through web crawling with Scrapy to support platform development and evaluation.
Developed an OHIF-based web viewer for medical-image browsing, visualization, and interactive review, and helped deploy the project online.
Contributed to the associated SIT (College Students' Innovative Entrepreneurial Training Plan), ID: 201502059.
Implemented traditional image-processing algorithms on mobile platforms to extend accessibility of medical-image analysis workflows.

Travel Globe

Cities I have visited — drag to orbit · scroll to zoom · hover for name.

Drag to orbit · Scroll to zoom

64 stops

18 countries / regions

Other Projects

LeetCode Record

Jun 2017 - Present

Honing programming skills daily

Code

Solve LeetCode algorithm questions across C, C++, Python3, Java, and Go.
Maintain a repository containing code and insights for each problem.

System-Related Conference Papers Crawler

Jun 2021 - Present

Web scraper and timeline for top-tier systems conferences

Code

Used Python, BeautifulSoup4, and Requests to scrape papers and crucial deadlines for major systems conferences.
Used Pandas and Matplotlib to create a timeline of significant submission deadlines.

DDLs

Dec 2017 - May 2018

Course project: Android personal-deadline manager

Guangzhou, China

Built DDLs, an Android application for personal deadline management, using Java + Android Studio (MVC) front-end and Node.js + Express back-end RESTful API.
Features: deadline CRUD, SQLite local storage, WebSocket server notifications, timeline-screenshot sharing via Android native sharing, and JWT-based user authentication.

ChainLoveHelp

May 2018

South China Microsoft Hackathon Competition

Guangzhou, China

Peer-to-peer platform for university task posting and processing based on blockchain technology.
Chain-end: Ethereum-based Parity consortium blockchain operating two nodes for transactions, accounting, and consensus.
Front-end / back-end: PHP server-side scripting, Apache, MySQL.

GuangTu

Apr 2017 - May 2017

South China Microsoft Hackathon Competition

Guangzhou, China

Windows-based map-planning software using gesture recognition for enhanced user interaction.
Built with Python, Leap Motion for gesture recognition, PyQt5 for GUI, and Django for the web/backend.

Seven Seconds

Apr 2017 - May 2017

SYSU Student Software Creative Design and Innovation Development Competition

Guangzhou, China

Code

Android app to organize and record memories, published on the 360 Mobile App Market.
Architecture includes sidebar, homepage, memory management, secure login/registration, RESTful APIs, and a Node.js backend.

PVmedtech

Jul 2016 - Aug 2017

Advisor: Yao Lu (Professor, SYSU)

Guangzhou, China

Collected breast-cancer data through web crawling with Scrapy.
Developed an OHIF Viewer web project; hosted SIT (College Students' Innovative Entrepreneurial Training Plan), ID: 201502059.
Implemented traditional image-processing algorithms on mobile platforms.