This paper shows that current LLMs that are fine-tuned on f(x) = y will often fail to generalize to f-inverse(y) = x. Gary Marcus seems to think this is a fundamental problem in the current approach to AI.
I tested this myself and can confirm that ChatGPT has this problem.
At the beginning of 2026 I'll try something similar with the leading language model of the time. (Not the fine-tuning, just testing facts in its main training run via its public interface.) If there's at least one example where it consistently gets that f(x) = y and consistently does not get that f-inverse(y) = x, this resolves YES. If I can't find such an example, it resolves NO.
If the best general purpose AI is not longer an LLM, this resolves N/A.
Does this only apply to leading large language models? I.e., if other architectures for SOTA general purpose AI appear that are no longer considered language models (perhaps because that's no longer their primary training task, or because they're no longer Transformer based), would you check those models rather than the best LLMs?