Llama's Paradox: Hardcore Inference Attack on Llama.cpp

This is my debut of one hour *(*and twenty minutes) presentation at ZeroCon 25 when I was 15 years-old (11 April, 2025), on a Llama.cpp RPC memory exploitation research I did. (My 10k words writeup: Llama's Paradox).

I made this 75 pages slides in two weeks everyday after school in the library (my roommates thought I sneak out every night since they thought I went missing from 3 p.m. to 11 p.m). I landed in Seoul alone, did the presentation then flown back to the states. (I did met very nice people and tried some best Korean Barbecue I ever had.)

I was pretty anxious on the trip from New York City to Seoul, because I never had the chance to actually run the presentation at school because of how long the things was, and I was still finishing up the slides at the airport. But the good thing is have my speaker notes with me, and here are them, hope it tells a good story:)

Ruikai: Thank you so much for the introduction (referring to the host). Before I start, I just want to say thanks you to every one for coming! I am so honored to be here, and wow, there are a lot of people!

Before we start, I want to play a little hand-raising game.

If you've ever done research on binary exploitation, please raise your hands
If you know of Llama.cpp, please raise your hands

Just as what the title tells you, this is "Hardcore Inferencing Attack: Unraveling Llama.cpp's RPC Heap Puzzle." I'll tell you about my whimsical experience turning a heap overflow, caused by a tensor misoperation, into RCE in a Llama.cpp distributed inferencing server.

And talking about this whimsical experience, as whimsical as it is, however, I did stumble.

Within the first week, deep diving into Llama.cpp RPC's memory internals, I found exactly nothing.

I was lucky enough to discover a constrained heap overflow in the tensor operation implementation on the third week. However, because of the project's unique memory layouts, custom heap logics, and multi-layered runtime security checks, it was extremely difficult to escalate it into anything meaningful.

From this speech, you will see how I used a novel exploit methodology that's only using CTFs and how it’s able to turn the situation a bit. Later, you will see how it comes with more setbacks and more complex situations.

With six straight hours of GDB’ing, fortunately, we were able to construct an exploit with countless setbacks, obstacles, snags, and a tangled paradox of memory state and object-free. The exploit found its own way out of the heap maze through nothing but Llama.cpp RPC’s weird behaviors, unpredictable object layouts, whimsical exploitation approach and working-with-what-you-got.

At some certain point in this speech, you will experience an awe moment where one single line of code, was able to solve two major “paradoxes” in the exploitation journey! Let's get started.

Ruikai: A little bit about who I am. I'm Ruikai Peng, people know me as retr0reg. I am a 15-year-old freshman in high school, also a boring security researcher, mostly in AI/ML.

I have over 20 CVEs, had found RCEs in ML frameworks like Transformer, TensorFlow, Llama.cpp, from Evernote Electron client RCE, to ROP'ing Tenda routers, and other small interesting bugs that I can’t really talk about, as governmental… I've been also working on security automation. You can see me in Hugging Face’s security scanners.