About Me 🧑‍💻

AI software engineer specializing in LLM inference optimization⚡with over 2 years of experience at Intel 🏢. Core focus areas include: 1) architecting model compression algorithms (sparsity, pruning, quantization) to overcome compute and memory bottlenecks, and 2) designing efficient heterogeneous (GPU-CPU) inference framework and KV-cache management mechanism for agentic AI workloads. Published 5+ papers in AI conferences (EMNLP, NAACL, NeurIPS), holding 1 US patent, with 150+ citations. Dedicated to delivering lightweight, efficient algorithmic solutions for research and deployment, and bridging the gap between high-level algorithmic research and low-level system efficiency.

Interests 🚩

Natural language processing (NLP), large language models (LLM).
Model inference acceleration, model compression (sparsity/pruning/quantization), neural architecture search (NAS).
LLM fine-tuning, parameter-efficient fine-tuning (PEFT), LoRA.
LLM serving engine, heterogeneous inference (GPU-CPU), KV-cache management and scheduling, agent memory management, agentic AI.
Retrieval-augmented generation (RAG), test-time compute.

I am always open to discussions and potential collaborations. Please feel free to reach out! 🤝

Experience 💼

System Software Development Engineer @ Intel
Jul. 2023 - Mar. 2026 (2 yrs 9 mos) · Beijing, China
System Software Development Engineer (Intern) @ Intel
May 2022 - Jun. 2023 (1 yr 2 mos) · Beijing, China
M.S. in Computer Science and Technology @ Guangdong University of Technology
Sep. 2020 - Jun. 2023 · Guangzhou, China
Recommended for Admission without Examination
_Prof. Ruichu Cai @ DMIR Lab
B.S. in Software Engineering @ Guangdong University of Technology
Sep. 2016 - Jun. 2020 · Guangzhou, China

Selected Publications 📚

For a full list, please refer to my Google Scholar.

RTTC: Reward-Guided Collaborative Test-Time Compute
J. Pablo Muñoz*, Jinjie Yuan* (Co-first author)
EMNLP 2025 Findings · Paper
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
J. Pablo Muñoz*, Jinjie Yuan* (Co-first author), Nilesh Jain
NAACL 2025 (Oral) · Paper · Code
Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
AAAI 2025 Workshop · Paper
SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models
J. Pablo Muñoz*, Jinjie Yuan* (Co-first author), Nilesh Jain
EMNLP 2024 Findings · Paper · Code
Shears: Unstructured Sparsity with Neural Low-rank Adapter Search
J. Pablo Muñoz*, Jinjie Yuan* (Co-first author), Nilesh Jain
NAACL 2024 · Paper · Code
LoNAS: Elastic Low-rank Adapters for Efficient Large Language Models
J. Pablo Muñoz*, Jinjie Yuan* (Co-first author), Yi Zheng, Nilesh Jain
LREC-COLING 2024 · Paper · Code
SADGA: Structure-Aware Dual Graph Aggregation Network for Text-to-SQL
Ruichu Cai, Jinjie Yuan (First student author), Boyan Xu, Zhifeng Hao
NeurIPS 2021 · Paper · Code

Patents 📝

Methods and apparatus for enabling efficient fine-tuning on unstructured sparse and low-precision large pre-trained foundation models
J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
US Patent App. 18/935,223 (2025)

Jinjie Yuan

Interests 🚩

Experience 💼

Selected Publications 📚

Patents 📝