About Me 🧑‍💻

I am an engineer and researcher at Intel 🏢. I joined Intel as an intern in 2022 and became a full-time employee in 2023. My research and work primarily focus on AI model optimization 🤖, including model compression (pruning, sparsity, quantization), inference acceleration ⚡, and lightweight deployment. I am dedicated to making large language models (LLMs) smaller, faster, and more efficient for deployment in resource-constrained environments such as edge devices 📱.

🎓 I received my M.S. degree in Computer Science and Technology from Guangdong University of Technology in 2023, supervised by Prof. Ruichu Cai. Prior to that, I obtained my B.S. degree in Software Engineering from the same university in 2020, and was recommended for direct admission to the master’s program.

Current Work 🚀

My recent work focuses on Hybrid Inference for Agentic AI 🧠. I am exploring the integration of prefix caching mechanisms with CPU offloading architectures to optimize agentic workloads. These workloads often involve long instructions, few-shot examples, and iterative reasoning. By leveraging CPU resources for computation and KV cache storage, we aim to significantly enhance inference throughput and efficiency for LLM inference.

Research Interests 🚩

  • Natural language processing (NLP) & large language models (LLM)
  • Agentic AI & efficient inference/deployment
  • Model compression & optimization (pruning, sparsity, quantization)
  • Low-rank & adapter methods

💬 I am always open to discussions and potential collaborations. Please feel free to reach out to me!

Selected Publications 📚

For a full list, please refer to my Google Scholar.

  • RTTC: Reward-Guided Collaborative Test-Time Compute
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author)
    EMNLP 2025 Findings · Paper

  • Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author), Nilesh Jain
    NAACL 2025 (Oral) · Paper · Code

  • Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
    J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
    AAAI 2025 Workshop on Connecting Low-rank Representations in AI · Paper

  • SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author), Nilesh Jain
    EMNLP 2024 Findings · Paper · Code

  • Shears: Unstructured Sparsity with Neural Low-rank Adapter Search
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author), Nilesh Jain
    NAACL 2024 · Paper · Code

  • LoNAS: Elastic Low-rank Adapters for Efficient Large Language Models
    J. Pablo Muñoz*, Jinjie Yuan* (Co-first Author), Yi Zheng, Nilesh Jain
    LREC-COLING 2024 · Paper · Code

  • SADGA: Structure-Aware Dual Graph Aggregation Network for Text-to-SQL
    Ruichu Cai, Jinjie Yuan (Advisor 1st, Student 2nd), et al.
    NeurIPS 2021 · Paper · Code

Patents 📝

  • Methods and apparatus for enabling efficient fine-tuning on unstructured sparse and low-precision large pre-trained foundation models
    J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
    US Patent App. 18/935,223 (2025)