About Me 🧑‍💻

I am a System Software Development Engineer at Intel 🏢 in Beijing, with over 2.5 years of experience. My research and work primarily focus on LLM inference optimization and efficient AI 🤖, including model compression (pruning, sparsity, quantization), inference acceleration⚡, and efficient hybrid GPU-CPU engine for Agentic AI. My work bridges the gap between high-level algorithmic research and low-level system efficiency.

Research Interests 🚩

  • Natural language processing (NLP), large language models (LLM).
  • LLM inference optimization, model compression (pruning, sparsity, and quantization), neural architecture search (NAS).
  • LLM fine-tuning, parameter-efficient fine-tuning (PEFT), LoRA.
  • LLM serving engine, hybrid GPU-CPU inference.
  • Agentic AI, KV-cache management, agent memory management, retrieval-augmented generation (RAG), test-time compute, agent skills.

I am always open to discussions and potential collaborations. Please feel free to reach out! 🤝

Experience 💼

  • System Software Development Engineer @ Intel
    Jul. 2023 - Present · Beijing, China

  • System Software Development Engineer (Intern) @ Intel
    May 2022 - Jun. 2023 · Beijing, China

  • M.S. in Computer Science and Technology @ Guangdong University of Technology
    Sep. 2020 - Jun. 2023 · Guangzhou, China
    Supervisor: Prof. Ruichu Cai @ DMIR Lab
    Research: NLP, Text-to-SQL
    Recommended for Admission without Examination

  • B.S. in Software Engineering @ Guangdong University of Technology
    Sep. 2016 - Jun. 2020 · Guangzhou, China

Selected Publications 📚

For a full list, please refer to my Google Scholar.

  • RTTC: Reward-Guided Collaborative Test-Time Compute
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author)
    EMNLP 2025 Findings · Paper

  • Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author), Nilesh Jain
    NAACL 2025 (Oral) · Paper · Code

  • Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
    J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
    AAAI 2025 Workshop on Connecting Low-rank Representations in AI · Paper

  • SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author), Nilesh Jain
    EMNLP 2024 Findings · Paper · Code

  • Shears: Unstructured Sparsity with Neural Low-rank Adapter Search
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author), Nilesh Jain
    NAACL 2024 · Paper · Code

  • LoNAS: Elastic Low-rank Adapters for Efficient Large Language Models
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author), Yi Zheng, Nilesh Jain
    LREC-COLING 2024 · Paper · Code

  • SADGA: Structure-Aware Dual Graph Aggregation Network for Text-to-SQL
    Ruichu Cai, Jinjie Yuan (Advisor 1st, Student 2nd), Boyan Xu, Zhifeng Hao
    NeurIPS 2021 · Paper · Code

Patents 📝

  • Methods and apparatus for enabling efficient fine-tuning on unstructured sparse and low-precision large pre-trained foundation models
    J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
    US Patent App. 18/935,223 (2025)