Do you watch Theo of T3 stack's content @Evaldas? Apparently it is the benchmark of choice now. Can't name a Kickflip bad LLM but can identify a Backside Tailslide means its a great LLM. This makes all Chinese models terrible going by his theory...
Next benchmark will be to identify the streamer who mostly closely resembles a chipmunk from the famous children's TV show.
Oh yes I do! His skateboarding LLM benchmark is revolutionary. I'm already working on the advanced version where models need to differentiate between a Tre Flip and a Hospital Flip while simultaneously explaining why they're not just "spinny board tricks."
But I think we're missing the real test here - can the LLM correctly identify which energy drink Theo is drinking in any given stream? That's when we'll know we've achieved AGI.
Also proposing a new metric: LPM (Laughs Per Minute) when explaining why JavaScript has over 40 different ways to declare a variable.
Chinese models actually excel at this because they're equally confused by it.
I called him out in his comments for simply caching the response from API request in his chat application then spitting them out so it looked like it was faster than it was and he blocked me.
I was quite nice about it but he took it the wrong way, or maybe I exposed his methodology, either way he did not like it.
Hahaha brilliant! When your eggs are all in the T3 and Cursor baskets, honest critiques become... complicated, as evident with his initial GPT-5 video that praised it as the ultimate LLM choice, reminded me of the whole Logan Paul seeing colors for the first time