Turbo Team, Together.AI Mar. 2024 - Present
Senior Research Scientist Remote & San Francisco United States
Advisor: Ben Athiwaratkun (Senior Research Scientist, Together.AI), Shuaiwen Song (Vice President of Research, Together.AI)
Efficient ML Algorithms
Ladder-Residual
Motivation: Large-model inference under tensor parallelism often suffers from communication stalls and weak overlap between communication and computation; we sought an architecture-runtime
co-design that improves throughput without sacrificing model quality.
Contributions:
Co-conceived the parallelism-aware residual design and helped shape the paper’s system and evaluation story.
Implemented and optimized the gpt-fast inference path with CUDA Graphs and PyTorch compile (“reduce-overhead”) for large-model serving.
Benchmarked performance across model scales (1B–405B) and TP world sizes (1, 2, 4, 8, 16), validating up to 30% end-to-end throughput improvement on 70B models with P2P enabled and up to 60% with P2P disabled.
CREST (Turbo-reasoning)
Motivation: Reasoning models often under-think or over-think at test time, wasting tokens or missing correct solutions; we sought a training-free intervention that could be deployed in
mainstream serving stacks.
Contributions:
Co-developed the core idea of a training-free test-time steering method that identifies and modulates cognitive attention heads, improving accuracy by up to 17.5% and reducing token usage by 37.6% across reasoning benchmarks.
Designed deployment paths for integrating CREST into vLLM and SGLang.
CARE
Motivation: MLA-style attention can improve serving efficiency, but most pretrained checkpoints use GQA/MHA and cannot directly benefit; we sought a practical conversion path that preserves quality while lowering inference cost.
Contributions:
Developed the core idea and empirical framing for upgrading pretrained attention into MLA-compatible forms.
Proposed a conversion pipeline that upgrades pretrained attention (e.g. GQA) into multi-head latent attention (MLA) for faster inference without increasing KV-cache size.
Ran the full experimental suite and carried out vLLM integration and theoretical analysis.
SQUEEZE THINK
Motivation: Recursive self-aggregation improves reasoning quality, but uniform compute allocation across generation and aggregation wastes cost on easy subsets and under-allocates recovery on hard subsets.
Contributions:
Helped develop a multi-model orchestration view of recursive self-aggregation, routing generation and aggregation between large and small models based on cross-model confidence.
Owned coding-benchmark execution and evaluation pipelines, especially for LiveCodeBench V6, and supported ablations on routing thresholds and aggregation behavior across AIME 2025 and HMMT 2025.
Demonstrated 30–40% compute reduction at matched accuracy or 5–7 point accuracy gains at equivalent compute.
Agent Evolve
Motivation: Current LLM-based multiagent systems are largely static after deployment and lack mechanisms for continual adaptation across agents, skills, and populations.
Contributions:
Built a bio-inspired LLM multiagent framework with pheromone-style memory, evolutionary division of labor, and skill inheritance for open-ended population adaptation.
2. Studied population-level adaptation through competition, selection, and cross-generation strategy
Explored integration of LEXICO compression techniques.
Prototyped vocabulary-pruned speculators, Mix Architecture Speculator designs
Explored diffusion LLMs that interleave self-verification with token generation.
Investigated diffusion-style MoE routers for smoother expert selection.
Investigated diffusion-style speculator design
Efficient ML Systems
Training System: XoRL (RL Training System), Axolotl (SFT Training System)
Motivation: Building an RL and SFT training stack for coding and reasoning agents required more than model fine-tuning: it needed an end-to-end system that coupled sandboxed environments, distributed rollout workers, and multi-node training plus serving infrastructure while staying stable under long-context, MoE, and rapidly changing model variants.
Contributions:
Built much of the training-side RL framework, including agent PPO trainers, asynchronous rollout and pipeline-training paths, and the execution flow that converts multi-turn agent-environment interaction into PPO and GRPO training batches.
Owned the training pipeline that ingests rollout trajectories, computes advantage, and performs policy updates plus rollout-model weight synchronization for coding-agent post-training.
Implemented asynchronous rollout, replay-queue mini-batching, and router-assisted batching between rollout and training workers to overlap trajectory generation with policy optimization and improve distributed training throughput.
Developed trajectory/data transforms, token-level loss masks, stepwise-vs-trajectory advantage handling, rejection sampling, and batch balancing to improve GRPO signal quality and training stability.
Scaled long-context training recipes to 16K–32K contexts using Ulysses sequence parallelism, remove-padding, chunked prefill, and per-GPU token-budget tuning for DeepCoder and DeepScaleR-style runs.
Implemented sequence-parallel (SP) compatibility across the training stack so long-context post-training paths worked correctly with distributed attention, packed sequences, and rollout-to-training data flow.
Built SP-compatible MoE-LoRA kernel paths to support efficient distributed post-training for expert models without breaking sequence-parallel execution.
Integrated QuACK fused kernels into XoRL to improve kernel efficiency and support higher-throughput post-training recipes.
Added Qwen3.5 support and completed model bring-up across configs, training paths, and distributed recipes for reliable experimentation.
Diagnosed and fixed multi-node training failures, including position_ids, cu_seqlens, attention-mask, and MoE dispatch issues that destabilized distributed recipes across evolving model families.
Integrated long-context attention (Ulysses, Ring Attention) into Axolotl and supported SFT data flow from successful trajectories to extend supervised post-training to larger context windows.
Inference System: Pulsar & SGLang
Motivation: High-throughput serving requires lower KV overhead and more stable speculative decoding across cache-hit patterns, batch sizes, and multi-node deployments.
Contributions:
Applied a Swift-KV caching strategy to accelerate prefill by reducing KV memory overhead and improving end-to-end latency.
Designed and implemented KV-cache prompt caching for the Phoenix speculator in Pulsar, stabilizing acceptance rates and reducing end-to-end latency.
Resolved tokenizer chat-template issues and Docker deployment bugs for reliable multi-node operation, then benchmarked cache behavior across batch sizes and cache-hit scenarios to explain acceptance-rate variability and optimize cache-hit logic.
Integrated and implemented Llama 4 support for sliding window attention
AgentGo
Motivation: Tool-using agents alternate between long-context reasoning and external actions, but request-centric runtimes either evict useful KV state too early or waste memory by pinning it too
long.
Contributions:
Co-developed the core idea of treating multi-turn agent workflows as first-class programs rather than isolated requests.
Helped build the staged system path from telemetry and shadow prediction to offline replay, observability, and config-gated runtime integration for prediction-aware scheduling.
LCFS
Motivation: Multi-tenant LLM serving needs hierarchical fairness and performance isolation across shared instances and clusters without sacrificing throughput.
Contributions:
Contributed to design discussions around hierarchical fairness, vruntime-style accounting, and weight partitioning across distributed serving instances.
Participated in experiments evaluating performance isolation and fairness under multi-tenant LLM serving workloads.
Model Related
Ladder-Residual
Motivation: High-quality coding agents require strong trajectory data, stable post-training pipelines, and task-aligned optimization objectives for code generation.
Contributions:
Led the training pipeline for OpenHands R2E-Gym & SWE-Bench-scale data: curated high-signal SWE-smith / Rebench examples and fixed attention-mask plus position-ID issues in XoRL.
Distilled Qwen3-480B trajectories into a 30B coding model via supervised fine-tuning and activation distillation, then initiated MoE / RL scaling for Qwen3-30B to improve SWE-Bench solve rates.
Designed per-token loss formulations for coding-trajectory distillation and model-quality improvement.
Dolby Mar. 2024 - Sep.2024
Research Intern Sydney, Australia
Advisor: Yucheng Liu (Research Scientist, Dolby), Shuaiwen Song (Vice President of Research, Together.AI)
Extrem Efficient Video Coding System
Motivation: Traditional codecs (H.264/H.265/AV1) lack content adaptivity and incur high compute/memory costs. Existing neural compressors are too heavy for real-time GPU and mobile streaming. A need for a low-latency, domain-aware solution that tailors compression to video content.
Contributions:
Invented and spearheaded E^2ND-VC (Extreme Efficient Neural Domain Video Compression), a pioneering neural video compression framework that leverages content-aware quantization to deliver low-latency, high-quality streaming on both standard GPUs and mobile devices.
Designed Optimal Brain Stride-wise Quantization (OBSQ), a domain-specific quantization methodology that selectively compresses neural network weights based on content type (e.g., video conferencing, gaming), enabling real-time 1080p performance with minimal quality loss.
Engineered a multi-kernel, sensitivity-based quantization pipeline with mixed-bit precision assignments, dynamically allocating bit depths across convolutional kernels to preserve critical visual features while maximizing compression ratios.
Collaborated closely with cross-functional teams to implement PoC streaming pipelines, demonstrating significant reductions in power consumption and bandwidth usage without compromising visual fidelity.
DeepSpeed Team, Microsoft Mar. 2023 - Feb.2024
Research Intern Sydney, Australia
Advisor: Xiaoxia Wu (Research Scientist, Microsoft), Zhewei Yao (Senior Researcher, Microsoft), Shuaiwen Song (Senior Principle Scientist, Microsoft)
DeepSpeed4ScienceRenAIssance: A survey into AI text to image generation in the era of large models
Motivation: To build unique capabilities through AI system technology innovations to help domain experts to unlock today’s biggest science mysteries.
Contributions:
Developed deepspeed4science’s blog website through Azure MySQL, Wordpress, Virtual Server Hosting, JavaScript HTML, CSS, AJAX, Azure Migration. Website Link
Revised the blog content, font size, technical research architecture, and code related to GenSLMs-‘Megatron-DeepSpeed for Large-Scale AI4Science Model Training’.
DeepSpeed Chat: Easy, Fast, and Affordable RLHF Training of ChatGPT-like Models at All Scales
Motivation: ChatGPT-like models have revolutionized the AI world, but an accessible end-to-end RLHF pipeline for training powerful ChatGPT-like models is still lacking within the AI community.
Contributions:
Apply INT4 and INT8 quantization to the RLHF pipeline, increase the batch size and improve the speed of the training and generation phases of RLHF without significantly compromising accuracy.
Investigated ColossalAI’s pipeline, learned how to use ColossalAI’s Zero-2, 3, and GeminiDDP, and adapted them for our RLHF algorithm.
Ran 400+ benchmark experiments for DeepSpeed Chat, ColossalAI, and HuggingFace powered by native PyTorch. Summarized the results and conclusions in the DeepSpeed blog.
Revised DeepSpeed GitHub Landing Page, DeepSpeed Chat Blog, and produced DeepSpeed Chat video.
Visiting Scholar, Ph.D. student Sydney, Australia
Advisor: Shuaiwen Song (Associate Professor, USYD), Chang Xu (Associate Professor, USYD), Yibo Yang (Research Scientist in JD Explore Academy)
Research Projects:
RenAIssance: A survey into AI text to image generation in the era of large models
Motivation: Text-to-image synthesis has become increasingly popular in the AI and computer graphics world (AIGC). However, there is no comprehensive survey paper that systematically introduces the frameworks and ideas behind text-to-image techniques. We aim to fill this gap in the literature.
Contributions:
Read over 100 papers, providing a literature review for each.
Collaborated with lab classmates to write the comprehensive survey paper.
Optimization of Diffusion Model Denoising Process
Motivation: Diffusion models currently require a large number of denoising steps, which we aim to reduce. One reason for the lengthy process is the lack of a clear relationship between the noise and the trained image. Our goal is to explore additional methods to establish a connection between noise and the denoised image, beyond guidance techniques, such as incorporating text embeddings into the raw noise.
Contributions:
Develop innovative ideas, implement them, and conduct comparative experiments to evaluate their performance.
Exploring Neural Collapse Phenomenon in Reinforcement Learning
Motivation: In reinforcement learning, agents may exhibit biased action selection in the environment due to incomplete understanding of the state and action distribution spaces. This research investigates whether the neural collapse phenomenon occurs in policy gradient networks as agents train with sufficient examples and examines its implications for balancing action selection in reinforcement learning agents.
Contributions:
Conducted experiments applying ETF classifiers to 5+ neural networks in 10+ discrete-action reinforcement learning environments (e.g., Atari, Gym Classic)
Derived and proved the formula and geometric properties of policy gradient loss function
Authored paper drafts and submitted to the NeurIPS conference
Sparse Kernel Design in GPU TensorCore
Motivation: With the application of pruning methods, neural network weight matrices become increasingly sparse, but there is no implementation for sparse kernels in GPU TensorCore.
Contributions:
1. Conducted comparative experiments between our sparse kernel and Google’s Sputnik.
2. Summarized experiment results and figures in the paper.
DeepSpeed I/O Framework Support for AI4Science
Motivation: AI4Science models have revolutionized the AI world. DeepSpeed can support AI4Science models deployed across multiple nodes but lacks an I/O management framework for handling large amounts of training data efficiently.
Contributions:
Investigated DeepSpeed I/O support in supercomputers (Argonne HDF5 Luster System), analyzed data shuffling and fetching patterns for AI4Science models powered by DeepSpeed, and implemented algorithms to accelerate I/O.
Implemented a ViT model for weather prediction.
CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning
Motivation: Existing low-rank fine-tuning methods (e.g., LoRA) adapt LLMs without understanding which layers encode core knowledge vs. task-specific behavior, causing forgetting; we want a parameter-efficient method that adapts to the new task while preserving what the model already knows.
Contributions:
Designed the experimental methodology for evaluating task-aware parameter-efficient fine-tuning (dataset selection, baselines, and metrics across math/code/instruction-following).
Implemented and executed large-scale experiments to compare CorDA against PEFT baselines, and helped collect and analyze empirical results used in the paper.
Reserach Associate Guangzhou, China
Advisor: Dan Huang (Associate Professor, SYSU), Yunfei Du, Yutong Lu (Professor, SYSU)
Research Projects:
Pre-Expedite: Use Hierarchical Structure Space for Improving the Performance of Accessing Small Files in Parallel File System - Undergraduate Thesis
Motivation: Implemented an approach to reduce clients’ I/O communication with MDS, leveraging minimal additional client-side resources. Ensured high usability without modifying POSIX standards.
Contributions:
Investigated the I/O bottleneck in parallel/distributed file systems for Big Data and Artificial Intelligence applications, identifying intensive metadata communication with the metadata server as a primary issue.
Utilized POSIX to create ZERO file blocks (Loop Device). Established a VFS within the ZERO file blocks, allowing each user to store small files in their designated ZERO file blocks.
HybridShare: Universal Resource Scheduling for Hybrid Jobs
Motivation: CPU- and GPU-centric applications allocate resources exclusively, leading to inefficient utilization of heterogeneous resources.
Contributions:
Analyzed the possibility of co-locating modern workflow - application in the same physical machine to share resources.
Proposed HybridShare algorithms that can enable different resources-prefer jobs to be co-located in the same node and share hardware resources (e.g., GPU-concentric, CPU-concentric, Mem-intensive) through Slurm, Mesos, Kubernetes.
MAEM - Multiple Applications co-Execution time Estimation
Motivation: There are few works to accurately estimate the slowdown of CPU/GPU applications based on the characteristic of applications & hardware architecture
Contribution:
Conducted a literature review on application profiling, interference and slowdown estimation, and interference-aware scheduling.
Gathered resource consumption data for various benchmarks and analyzed their behavior.
Institute of Advanced Networks and Computing Systems, SYSU Oct. 2018 - Mar. 2019
Research Intern Guangzhou, China
Advisor: Hejun Wu (Associate Professor, SYSU)
Research Projects:
EmReal: A Digital Twin Framework of Emulated and Real Components for Robots with Reinforcement Learning
Motivation: Pioneered a digital twin framework for robots utilizing reinforcement learning (RL), bridging the gap between simulations and real-world deployments. Developed solutions to effectively transition RL algorithms from simulators to actual robots, advancing the field beyond its nascent stage.
Contributions:
Conducted a survey on robotics simulator systems and reinforcement learning algorithms.
Designed and implemented a one-legged robot, integrating real and emulated components using XLM, Python, ROS, and Arduino C programming.
Created a digital twin framework for robotic systems, employing reinforcement learning (RL) and seamlessly blending emulation, pre-training, connectivity, and hardware adaptation using ROS and PyBullet.
Co-authored a book on deep learning in reinforcement learning, awaiting publication.
Tencent Holdings Ltd. Weixin Group & Dep. of CS UIUC Jul. 2018 - Jul. 2020
Research Intern, Testing, Technical-Architecture Department Champaign, IL, US & Guangzhou, China
Advisor: Tao Xie (Professor and Willett Faculty Scholar, UIUC), Yuetang Deng (Director)
Industry Projects:
JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games
Motivation: In cases of plagiarism for mini-games, deeply obfuscated code cloned from the original code often embodies malicious code segments and copyright infringements, posing great challenges for existing plagiarism detection tools. To address these challenges, we design and implement JSidentify, a hybrid framework to detect plagiarism among online mini games.
Contributions:
Worked under the guidance of Prof. Tao Xie, focusing on intermediate representation analysis in V8 Node.js’s Interpreter.
Conducted literature review on code plagiarism detection methods and evaluations of clone detection tools.
Developed an edit distance estimation and network flow algorithm to measure similarity in bytecode generated by Ignition, TurboFan Interpreter.
Designed a priority-queue-based framework to consolidate multiple plagiarism detection algorithms.
Co-authored a paper titled ”JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games.”
Microsoft(China) Co.,Ltd. Guangzhou Branch Sep. 2018 - Feb. 2019
Project Assistant to Senior Cloud Acrchitect Guangzhou, China
Advisor: Zhen Guan (Sr.Partner Technology Strategist, Microsoft)
Textile-Focused Q&A System
Motivation:The textile industry in China lacked an accessible domain-specific intelligent Q&A service, while relevant information was scattered across heterogeneous web sources and difficult for users to query efficiently. We aimed to build a practical NLP system that could organize textile knowledge and
provide question-answering support through a cloud-deployed service.
Contributions:
Learned Azure cloud architecture and model-serving workflows to support production-oriented deployment of machine-learning systems.
Collected textile-domain Q&A data by crawling major industry websites and constructed a cleaned, serialized, and tokenized corpus.
Implemented a pre-trained BERT model for the Q&A system and adapted it to the domain-specific dataset.
Deployed the BERT-based Q&A model on Azure as an online service for demonstration and practical use.
SYSU-CMU Joint Institute of Engineering (JIE) Feb. 2017 - Aug. 2017
Research & Software Engineer Intern Guangzhou, China
Advisor: Xiaoyin Tang (Professor, Southern University of Science and Technology)
Created a front-end website to integrate with a back-end deep learning model for efficient analysis of numerous fundus photographs.
Enabled detection of diabetic retinopathy (DR) and diabetic macular edema (DME) through seamless collaboration between the front-end and back-end systems.
Computational Medical Imaging Laboratory, SYSU Jul. 2016 - Aug. 2017
Research Intern Guangzhou, China
Advisor: Yao Lu (Professor, SYSU)
OHIF Viewer Web Project - Intelligent Medical Media Platform
Motivation: Medical imaging workflows often rely on fragmented tooling and cumbersome access to image data, making it difficult for clinicians and researchers to browse, manage, and analyze large collections of breast-cancer images efficiently. We aimed to build a web-based medical media platform that streamlined image visualization and supported practical clinical research usage.
Contributions:
Collected and organized breast-cancer data through web crawling with Scrapy to support platform development and evaluation.
Developed an OHIF-based web viewer for medical-image browsing, visualization, and interactive review, and helped deploy the project online.
Contributed to the associated SIT (College Students’ Innovative Entrepreneurial Training Plan), ID: 201502059, helping integrate project components into a usable prototype. Implemented traditional image-processing algorithms on mobile platforms to extend accessibility of medical-image analysis workflows.
LeetCode Record Jun. 2017 - Present
Honing Programming Skills Daily
Utilized languages such as C, CPP, Python3, Java, and Go to solve LeetCode algorithm questions based on my preference.
Maintained a repository containing my code and insights for each LeetCode problem.
System Related Conference Papers Crawler Jun. 2021 - Present
Web Scraper and Timeline for Top-tier Systems Conference
Leveraged Python, BeautifulSoup4, and Requests to scrape papers and crucial deadlines for major computer system conferences
Employed Pandas and Matplotlib to create a timeline representing significant computer system paper submission deadlines.
DDLs Dec. 2017 - May. 2018
Course Project: Design and Development of Android Applications Guangzhou, China
BACKEND CODE LINK FRONTEND CODE LINK
Developed DDLs, an Android application for personal deadline management, using Java and Android Studio for the front-end, incorporating MVC architecture, and NodeJS with Express.js for the back-end RESTful API.
Implemented features such as deadline administration with CRUD operations, adding, completing, and deleting deadlines in a timeline using SQLite for local storage, marking completed deadlines as unfinished, receiving server notifications through WebSocket, sharing timeline screenshots using Android’s native sharing capabilities, and user authentication with JSON Web Tokens (JWT) for registration and login functionality.
ChainLoveHelp May. 2018 - May. 2018
South China Microsoft Hackathon Competition Guangzhou, China
ChainLoveHelp is dedicated to providing a peer-to-peer platform for university task posting and processing based on blockchain technology.
For the chain-end, employed Ethereum-based Parity to construct a consortium blockchain, operating two nodes on the chain for transaction processing, accounting, and consensus.
For the front-end, implemented a robust technology stack using PHP for server-side scripting, Apache as the web server, and MySQL for database management.
GuangTu Apr. 2017 - May. 2017
South China Microsoft Hackathon Competition Guangzhou, China
Guangtu is a Windows-based map planning software that utilizes gesture recognition technology for enhanced user interaction.
The application was developed using Python for programming, Leap Motion for gesture recognition, PyQt5 for creating the graphical user interface, and Django for building the web framework and backend functionality.
Seven Seconds Apr. 2017 - May. 2017
SYSU Student Software Creative Design and Innovation Development Competition Guangzhou, China
Designed and developed an Android App to organize and record memories, leveraging the capabilities of Android Studio and Java. Successfully published the app on the 360 Mobile App Market.
Implemented a robust mobile App architecture, encompassing a user-friendly sidebar, homepage, memory management, as well as secure login and registration modules. Employed advanced data handling techniques, RESTful APIs, and seamless integration with a Node.js backend for efficient data processing and storage.
Seven Seconds Apr. 2017 - May. 2017
SYSU Student Software Creative Design and Innovation Development Competition Guangzhou, China
Designed and developed a Android App to organize and record memories, leveraging the capabilities of Android Studio and Java. Successfully published the app on the 360 Mobile App Market.
Implemented a robust mobile App architecture, encompassing a user-friendly sidebar, homepage, memory management, as well as secure login and registration modules. Employed advanced data handling techniques, RESTful APIs, and seamless integration with a Node.js backend for efficient data processing and storage.
PVmedtech Jul. 2016 - Aug. 2017
Advisor: Yao Lu (Professor, SYSU) Guangzhou, China
Collected breast cancer data through web crawling Scrapy.
Developed an OHIF Viewer web project, available at LINK.
Hosted a SIT (College Students’ Innovative Entrepreneurial Training Plan), ID: 201502059. – Implemented traditional image processing algorithms on mobile platforms.