Turbo Team, Together.AI Mar. 2024 - Sep.2024
Research Intern Hybrid & San Francisco United States
Advisor: Shuaiwen Song (Vice President of Research, Together.AI), Ben Athiwaratkun (Senior Research Scientist, Together.AI)
Industry Projects:
Turbo Projects
Motivation: Driven by a deep understanding of training & inference efficiency and effectiveness, develop practical AI modeling technologies that deliver low-latency, high-throughput performance across diverse deployment environments.
Contributions:
Integrated long-context attention / sequence parallelism (feifeibear/long-context-attention) into the training engine axolotl & pulsar to support extended context windows.
Authored and published ladder-residual, documenting novel architectural improvements. Designed and executed inference experiments using “gpt-fast” with CUDA Graph and PyTorch compile (“reduce-overhead”mode), achieving up to 30% end-to-end throughput gains on 70B-scale models.
Benchmarked performance across model scales (1B–405B) and TP world sizes (1, 2, 4, 8, 16) for ladder-residual, validating up to 30% end-to-end throughput improvement on 70B models with P2P enabled and up to 60% with P2P disabled
Designed and implemented KV-cache prompt caching for the Phoenix speculator in Pulsar, stabilizing acceptance rates and reducing end-to-end latency. Resolved tokenizer chat-template issues and Docker deployment bugs for reliable multi-node operation. Benchmarked caching behavior across batch sizes and cache-hit scenarios, identified acceptance-rate variability, and optimized the cache-hit logic for consistent performance.
Explored integration of LEXICO compression techniques into Pulsar as a next-step speculative-caching enhancement.
Explore context parallelism techniques for extremely long-context inference, enabling efficient distributed attention computation across multiple devices; Apply a Swift KV caching strategy to accelerate the model’s prefill phase by reducing KV memory overhead and improving end-to-end latency.
Proposed Turbo-reasoning (CREST), a training-free test-time steering method that identifies and modulates “cognitive”attention heads to curb under/over-thinking in LLM CoT, improving accuracy by up to 17.5% and cutting token usage by 37.6% across reasoning benchmarks.
Led an together-coder training (OpenHands R2E-Gym & SWE-Bench pipeline): curated high-signal SWE-smith / Rebench datasets, added attention-mask + position ID fixes for Axolotl & Veomni, distilled Qwen3-480B trajectories into a 30B model via supervised fine-tuning and activation distillation, and began MoE / RL scaling for Qwen3-30B to drive higher SWE-Bench solve rates.
Drove early product/design work for a Reinforcement Learning fine-tuning service for enterprise agents: authored multi-scenario infra plan (privacy-preserving RL loops, colocated training+inference with RDMA/InfiniBand, and fully managed end-to-end RL), assessed compute/memory/latency trade-offs, and scoped business impact (who owns agent framework, who owns reward loop, how we deliver updated weights safely at scale).
Designed CARE, a conversion pipeline that upgrades pretrained attention (e.g. GQA) into multi-head latent attention (MLA) for faster inference without increasing KV-cache size.
Dolby Mar. 2024 - Sep.2024
Research Intern Sydney, Australia
Advisor: Shuaiwen Song (Vice President of Research, Together.AI), Yucheng Liu (Research Scientist, Dolby)
Industry Projects:
Extrem Efficient Video Coding System
Motivation: Traditional codecs (H.264/H.265/AV1) lack content adaptivity and incur high compute/memory costs. Existing neural compressors are too heavy for real-time GPU and mobile streaming. A need for a low-latency, domain-aware solution that tailors compression to video content.
Contributions:
Invented and spearheaded E^2ND-VC (Extreme Efficient Neural Domain Video Compression), a pioneering neural video compression framework that leverages content-aware quantization to deliver low-latency, high-quality streaming on both standard GPUs and mobile devices.
Designed Optimal Brain Stride-wise Quantization (OBSQ), a domain-specific quantization methodology that selectively compresses neural network weights based on content type (e.g., video conferencing, gaming), enabling real-time 1080p performance with minimal quality loss.
Engineered a multi-kernel, sensitivity-based quantization pipeline with mixed-bit precision assignments, dynamically allocating bit depths across convolutional kernels to preserve critical visual features while maximizing compression ratios.
Collaborated closely with cross-functional teams to implement PoC streaming pipelines, demonstrating significant reductions in power consumption and bandwidth usage without compromising visual fidelity.
DeepSpeed Team, Microsoft Mar. 2023 - Feb.2024
Research Intern Sydney, Australia
Advisor: Shuaiwen Song (Senior Principle Scientist, Microsoft), Xiaoxia Wu (Research Scientist, Microsoft), Zhewei Yao (Senior Researcher, Microsoft)
Industry Projects:
DeepSpeed4ScienceRenAIssance: A survey into AI text to image generation in the era of large models
Motivation: To build unique capabilities through AI system technology innovations to help domain experts to unlock today’s biggest science mysteries.
Contributions:
Developed deepspeed4science’s blog website through Azure MySQL, Wordpress, Virtual Server Hosting, JavaScript HTML, CSS, AJAX, Azure Migration. Website Link
Revised the blog content, font size, technical research architecture, and code related to GenSLMs-‘Megatron-DeepSpeed for Large-Scale AI4Science Model Training’.
DeepSpeed Chat: Easy, Fast, and Affordable RLHF Training of ChatGPT-like Models at All Scales
Motivation: ChatGPT-like models have revolutionized the AI world, but an accessible end-to-end RLHF pipeline for training powerful ChatGPT-like models is still lacking within the AI community.
Contributions:
Apply INT4 and INT8 quantization to the RLHF pipeline, increase the batch size and improve the speed of the training and generation phases of RLHF without significantly compromising accuracy.
Investigated ColossalAI’s pipeline, learned how to use ColossalAI’s Zero-2, 3, and GeminiDDP, and adapted them for our RLHF algorithm.
Ran 400+ benchmark experiments for DeepSpeed Chat, ColossalAI, and HuggingFace powered by native PyTorch. Summarized the results and conclusions in the DeepSpeed blog.
Revised DeepSpeed GitHub Landing Page, DeepSpeed Chat Blog, and produced DeepSpeed Chat video.
Visiting Scholar, Ph.D. student Sydney, Australia
Advisor: Shuaiwen Song (Associate Professor, USYD), Chang Xu (Associate Professor, USYD), Yibo Yang (Research Scientist in JD Explore Academy)
Research Projects:
RenAIssance: A survey into AI text to image generation in the era of large models
Motivation: Text-to-image synthesis has become increasingly popular in the AI and computer graphics world (AIGC). However, there is no comprehensive survey paper that systematically introduces the frameworks and ideas behind text-to-image techniques. We aim to fill this gap in the literature.
Contributions:
Read over 100 papers, providing a literature review for each.
Collaborated with lab classmates to write the comprehensive survey paper.
Optimization of Diffusion Model Denoising Process
Motivation: Diffusion models currently require a large number of denoising steps, which we aim to reduce. One reason for the lengthy process is the lack of a clear relationship between the noise and the trained image. Our goal is to explore additional methods to establish a connection between noise and the denoised image, beyond guidance techniques, such as incorporating text embeddings into the raw noise.
Contributions:
Develop innovative ideas, implement them, and conduct comparative experiments to evaluate their performance.
Exploring Neural Collapse Phenomenon in Reinforcement Learning
Motivation: In reinforcement learning, agents may exhibit biased action selection in the environment due to incomplete understanding of the state and action distribution spaces. This research investigates whether the neural collapse phenomenon occurs in policy gradient networks as agents train with sufficient examples and examines its implications for balancing action selection in reinforcement learning agents.
Contributions:
Conducted experiments applying ETF classifiers to 5+ neural networks in 10+ discrete-action reinforcement learning environments (e.g., Atari, Gym Classic)
Derived and proved the formula and geometric properties of policy gradient loss function
Authored paper drafts and submitted to the NeurIPS conference
Sparse Kernel Design in GPU TensorCore
Motivation: With the application of pruning methods, neural network weight matrices become increasingly sparse, but there is no implementation for sparse kernels in GPU TensorCore.
Contributions:
1. Conducted comparative experiments between our sparse kernel and Google’s Sputnik.
2. Summarized experiment results and figures in the paper.
DeepSpeed I/O Framework Support for AI4Science
Motivation: AI4Science models have revolutionized the AI world. DeepSpeed can support AI4Science models deployed across multiple nodes but lacks an I/O management framework for handling large amounts of training data efficiently.
Contributions:
Investigated DeepSpeed I/O support in supercomputers (Argonne HDF5 Luster System), analyzed data shuffling and fetching patterns for AI4Science models powered by DeepSpeed, and implemented algorithms to accelerate I/O.
Implemented a ViT model for weather prediction.
CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning
Motivation: Existing low-rank fine-tuning methods (e.g., LoRA) adapt LLMs without understanding which layers encode core knowledge vs. task-specific behavior, causing forgetting; we want a parameter-efficient method that adapts to the new task while preserving what the model already knows.
Contributions:
Designed the experimental methodology for evaluating task-aware parameter-efficient fine-tuning (dataset selection, baselines, and metrics across math/code/instruction-following).
Implemented and executed large-scale experiments to compare CorDA against PEFT baselines, and helped collect and analyze empirical results used in the paper.
Reserach Associate Guangzhou, China
Advisor: Dan Huang (Associate Professor, SYSU), Yunfei Du, Yutong Lu (Professor, SYSU)
Research Projects:
Pre-Expedite: Use Hierarchical Structure Space for Improving the Performance of Accessing Small Files in Parallel File System - Undergraduate Thesis
Motivation: Implemented an approach to reduce clients’ I/O communication with MDS, leveraging minimal additional client-side resources. Ensured high usability without modifying POSIX standards.
Contributions:
Investigated the I/O bottleneck in parallel/distributed file systems for Big Data and Artificial Intelligence applications, identifying intensive metadata communication with the metadata server as a primary issue.
Utilized POSIX to create ZERO file blocks (Loop Device). Established a VFS within the ZERO file blocks, allowing each user to store small files in their designated ZERO file blocks.
HybridShare: Universal Resource Scheduling for Hybrid Jobs
Motivation: CPU- and GPU-centric applications allocate resources exclusively, leading to inefficient utilization of heterogeneous resources.
Contributions:
Analyzed the possibility of co-locating modern workflow - application in the same physical machine to share resources.
Proposed HybridShare algorithms that can enable different resources-prefer jobs to be co-located in the same node and share hardware resources (e.g., GPU-concentric, CPU-concentric, Mem-intensive) through Slurm, Mesos, Kubernetes.
MAEM - Multiple Applications co-Execution time Estimation
Motivation: There are few works to accurately estimate the slowdown of CPU/GPU applications based on the characteristic of applications & hardware architecture
Contribution:
Conducted a literature review on application profiling, interference and slowdown estimation, and interference-aware scheduling.
Gathered resource consumption data for various benchmarks and analyzed their behavior.
Institute of Advanced Networks and Computing Systems, SYSU Oct. 2018 - Mar. 2019
Research Intern Guangzhou, China
Advisor: Hejun Wu (Associate Professor, SYSU)
Research Projects:
EmReal: A Digital Twin Framework of Emulated and Real Components for Robots with Reinforcement Learning
Motivation: Pioneered a digital twin framework for robots utilizing reinforcement learning (RL), bridging the gap between simulations and real-world deployments. Developed solutions to effectively transition RL algorithms from simulators to actual robots, advancing the field beyond its nascent stage.
Contributions:
Conducted a survey on robotics simulator systems and reinforcement learning algorithms.
Designed and implemented a one-legged robot, integrating real and emulated components using XLM, Python, ROS, and Arduino C programming.
Created a digital twin framework for robotic systems, employing reinforcement learning (RL) and seamlessly blending emulation, pre-training, connectivity, and hardware adaptation using ROS and PyBullet.
Co-authored a book on deep learning in reinforcement learning, awaiting publication.
Tencent Holdings Ltd. Weixin Group & Dep. of CS UIUC Jul. 2018 - Jul. 2020
Research Intern, Testing, Technical-Architecture Department Champaign, IL, US & Guangzhou, China
Advisor: Tao Xie (Professor and Willett Faculty Scholar, UIUC), Yuetang Deng (Director)
Industry Projects:
JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games
Motivation: In cases of plagiarism for mini-games, deeply obfuscated code cloned from the original code often embodies malicious code segments and copyright infringements, posing great challenges for existing plagiarism detection tools. To address these challenges, we design and implement JSidentify, a hybrid framework to detect plagiarism among online mini games.
Contributions:
Worked under the guidance of Prof. Tao Xie, focusing on intermediate representation analysis in V8 Node.js’s Interpreter.
Conducted literature review on code plagiarism detection methods and evaluations of clone detection tools.
Developed an edit distance estimation and network flow algorithm to measure similarity in bytecode generated by Ignition, TurboFan Interpreter.
Designed a priority-queue-based framework to consolidate multiple plagiarism detection algorithms.
Co-authored a paper titled ”JSidentify: A Hybrid Framework for Detecting Plagiarism Among JavaScript Code in Online Mini Games.”
Microsoft(China) Co.,Ltd. Guangzhou Branch Sep. 2018 - Feb. 2019
Project Assistant to Senior Cloud Acrchitect Guangzhou, China
Advisor: Zhen Guan (Sr.Partner Technology Strategist, Microsoft)
Gained proficiency in Azure’s architecture and utilized Azure for training multiple machine learning models.
Developed a textile-focused Q&A system to address a market gap in China:
Collected Q&A data by crawling prominent domestic textile websites.
Preprocessed data through cleaning, serializing, and tokenizing text into a corpus. ∗ Implemented a pre-trained BERT model for the Q&A system.
Deployed the BERT model on Azure as a service.
SYSU-CMU Joint Institute of Engineering (JIE) Feb. 2017 - Aug. 2017
Research & Software Engineer Intern Guangzhou, China
Advisor: Xiaoyin Tang (Professor, Southern University of Science and Technology)
Created a front-end website to integrate with a back-end deep learning model for efficient analysis of numerous fundus photographs.
Enabled detection of diabetic retinopathy (DR) and diabetic macular edema (DME) through seamless collaboration between the front-end and back-end systems.
Computational Medical Imaging Laboratory, SYSU Jul. 2016 - Aug. 2017
Research Intern Guangzhou, China
Advisor: Yao Lu (Professor, SYSU)
Collected breast cancer data through web crawling Scrapy.
Developed an OHIF Viewer web project, available at LINK.
Hosted a SIT (College Students’ Innovative Entrepreneurial Training Plan), ID: 201502059. – Implemented traditional image processing algorithms on mobile platforms.
LeetCode Record Jun. 2017 - Present
Honing Programming Skills Daily
Utilized languages such as C, CPP, Python3, Java, and Go to solve LeetCode algorithm questions based on my preference.
Maintained a repository containing my code and insights for each LeetCode problem.
System Related Conference Papers Crawler Jun. 2021 - Present
Web Scraper and Timeline for Top-tier Systems Conference
Leveraged Python, BeautifulSoup4, and Requests to scrape papers and crucial deadlines for major computer system conferences
Employed Pandas and Matplotlib to create a timeline representing significant computer system paper submission deadlines.
DDLs Dec. 2017 - May. 2018
Course Project: Design and Development of Android Applications Guangzhou, China
BACKEND CODE LINK FRONTEND CODE LINK
Developed DDLs, an Android application for personal deadline management, using Java and Android Studio for the front-end, incorporating MVC architecture, and NodeJS with Express.js for the back-end RESTful API.
Implemented features such as deadline administration with CRUD operations, adding, completing, and deleting deadlines in a timeline using SQLite for local storage, marking completed deadlines as unfinished, receiving server notifications through WebSocket, sharing timeline screenshots using Android’s native sharing capabilities, and user authentication with JSON Web Tokens (JWT) for registration and login functionality.
ChainLoveHelp May. 2018 - May. 2018
South China Microsoft Hackathon Competition Guangzhou, China
ChainLoveHelp is dedicated to providing a peer-to-peer platform for university task posting and processing based on blockchain technology.
For the chain-end, employed Ethereum-based Parity to construct a consortium blockchain, operating two nodes on the chain for transaction processing, accounting, and consensus.
For the front-end, implemented a robust technology stack using PHP for server-side scripting, Apache as the web server, and MySQL for database management.
GuangTu Apr. 2017 - May. 2017
South China Microsoft Hackathon Competition Guangzhou, China
Guangtu is a Windows-based map planning software that utilizes gesture recognition technology for enhanced user interaction.
The application was developed using Python for programming, Leap Motion for gesture recognition, PyQt5 for creating the graphical user interface, and Django for building the web framework and backend functionality.
Seven Seconds Apr. 2017 - May. 2017
SYSU Student Software Creative Design and Innovation Development Competition Guangzhou, China
Designed and developed an Android App to organize and record memories, leveraging the capabilities of Android Studio and Java. Successfully published the app on the 360 Mobile App Market.
Implemented a robust mobile App architecture, encompassing a user-friendly sidebar, homepage, memory management, as well as secure login and registration modules. Employed advanced data handling techniques, RESTful APIs, and seamless integration with a Node.js backend for efficient data processing and storage.
Seven Seconds Apr. 2017 - May. 2017
SYSU Student Software Creative Design and Innovation Development Competition Guangzhou, China
Designed and developed a Android App to organize and record memories, leveraging the capabilities of Android Studio and Java. Successfully published the app on the 360 Mobile App Market.
Implemented a robust mobile App architecture, encompassing a user-friendly sidebar, homepage, memory management, as well as secure login and registration modules. Employed advanced data handling techniques, RESTful APIs, and seamless integration with a Node.js backend for efficient data processing and storage.
PVmedtech Jul. 2016 - Aug. 2017
Advisor: Yao Lu (Professor, SYSU) Guangzhou, China
Collected breast cancer data through web crawling Scrapy.
Developed an OHIF Viewer web project, available at LINK.
Hosted a SIT (College Students’ Innovative Entrepreneurial Training Plan), ID: 201502059. – Implemented traditional image processing algorithms on mobile platforms.