Explore how next-gen LLMs perform on mathematical reasoning benchmarks. While scores on GSM8k and MATH are high, perturbation tests reveal deep flaws in generalization and proof generation.