Experience & Projects

Professional Experience & Industry Research Projects

Turbo Team, Together.AI Mar. 2024 - Sep.2024

Research Intern Hybrid & San Francisco United States

Advisor: Shuaiwen Song (Vice President of Research, Together.AI), Ben Athiwaratkun (Senior Research Scientist, Together.AI)

Industry Projects:

Turbo Projects

Motivation: Driven by a deep understanding of training & inference eﬀiciency and effectiveness, develop practical AI modeling technologies that deliver low-latency, high-throughput performance across diverse deployment environments.

Contributions:

Integrated long-context attention / sequence parallelism (feifeibear/long-context-attention) into the training engine axolotl & pulsar to support extended context windows.
Authored and published ladder-residual, documenting novel architectural improvements. Designed and executed inference experiments using “gpt-fast” with CUDA Graph and PyTorch compile (“reduce-overhead”mode), achieving up to 30% end-to-end throughput gains on 70B-scale models.
Benchmarked performance across model scales (1B–405B) and TP world sizes (1, 2, 4, 8, 16) for ladder-residual, validating up to 30% end-to-end throughput improvement on 70B models with P2P enabled and up to 60% with P2P disabled
Designed and implemented KV-cache prompt caching for the Phoenix speculator in Pulsar, stabilizing acceptance rates and reducing end-to-end latency. Resolved tokenizer chat-template issues and Docker deployment bugs for reliable multi-node operation. Benchmarked caching behavior across batch sizes and cache-hit scenarios, identified acceptance-rate variability, and optimized the cache-hit logic for consistent performance.
Explored integration of LEXICO compression techniques into Pulsar as a next-step speculative-caching enhancement.
Explore context parallelism techniques for extremely long-context inference, enabling eﬀicient distributed attention computation across multiple devices; Apply a Swift KV caching strategy to accelerate the model’s prefill phase by reducing KV memory overhead and improving end-to-end latency.
Proposed Turbo-reasoning (CREST), a training-free test-time steering method that identifies and modulates “cognitive”attention heads to curb under/over-thinking in LLM CoT, improving accuracy by up to 17.5% and cutting token usage by 37.6% across reasoning benchmarks.
Led an together-coder training (OpenHands R2E-Gym & SWE-Bench pipeline): curated high-signal SWE-smith / Rebench datasets, added attention-mask + position ID fixes for Axolotl & Veomni, distilled Qwen3-480B trajectories into a 30B model via supervised fine-tuning and activation distillation, and began MoE / RL scaling for Qwen3-30B to drive higher SWE-Bench solve rates.
Drove early product/design work for a Reinforcement Learning fine-tuning service for enterprise agents: authored multi-scenario infra plan (privacy-preserving RL loops, colocated training+inference with RDMA/InfiniBand, and fully managed end-to-end RL), assessed compute/memory/latency trade-offs, and scoped business impact (who owns agent framework, who owns reward loop, how we deliver updated weights safely at scale).
Designed CARE, a conversion pipeline that upgrades pretrained attention (e.g. GQA) into multi-head latent attention (MLA) for faster inference without increasing KV-cache size.

Dolby Mar. 2024 - Sep.2024

Research Intern Sydney, Australia

Advisor: Shuaiwen Song (Vice President of Research, Together.AI), Yucheng Liu (Research Scientist, Dolby)

Industry Projects:

Extrem Eﬀicient Video Coding System

Motivation: Traditional codecs (H.264/H.265/AV1) lack content adaptivity and incur high compute/memory costs. Existing neural compressors are too heavy for real-time GPU and mobile streaming. A need for a low-latency, domain-aware solution that tailors compression to video content.

Contributions:

Invented and spearheaded E^2ND-VC (Extreme Eﬀicient Neural Domain Video Compression), a pioneering neural video compression framework that leverages content-aware quantization to deliver low-latency, high-quality streaming on both standard GPUs and mobile devices.
Designed Optimal Brain Stride-wise Quantization (OBSQ), a domain-specific quantization methodology that selectively compresses neural network weights based on content type (e.g., video conferencing, gaming), enabling real-time 1080p performance with minimal quality loss.
Engineered a multi-kernel, sensitivity-based quantization pipeline with mixed-bit precision assignments, dynamically allocating bit depths across convolutional kernels to preserve critical visual features while maximizing compression ratios.
Collaborated closely with cross-functional teams to implement PoC streaming pipelines, demonstrating significant reductions in power consumption and bandwidth usage without compromising visual fidelity.

DeepSpeed Team, Microsoft Mar. 2023 - Feb.2024

Research Intern Sydney, Australia

Advisor: Shuaiwen Song (Senior Principle Scientist, Microsoft), Xiaoxia Wu (Research Scientist, Microsoft), Zhewei Yao (Senior Researcher, Microsoft)

Industry Projects:

DeepSpeed4ScienceRenAIssance: A survey into AI text to image generation in the era of large models

Experience & Projects

Professional Experience & Industry Research Projects

Future System Architecture (FSA) Lab, The University of Sydney (USYD) Mar. 2022 - Present

School of Computer Science and Engineering, SYSU Sep. 2018 - Mar. 2022

Other Projects