Comments (31)
Our chemists were split: some argued it was an artifact, others dug deep and provided some reasoning as to why the generations were sound. Keep in mind, that was a non-reasoning, very early stage model with simple feedback mechanisms for structure and molecular properties.
In the wet lab, the model turned out to be right. That was five years ago. My point is, the same moment that arrived for our chemists will be arriving soon for theoreticians.
For instance, you can put a thousand temperature sensors in a room, which give you 1000 temperature readouts. But all these temperature sensors are correlated, and if you project them down to latent space (using PCA or PLS if linear, projection to manifolds if nonlinear) you’ll create maybe 4 new latent variables (which are usually linear combinations of all other variables) that describe all the sensor readings (it’s a kind of compression). All you have to do then is control those 4 variables, not 1000.
In the chemical space, there are thousands of possible combinations of process conditions and mixtures that produce certain characteristics, but when you project them down to latent variables, there are usually less than 10 variables that give you the properties you want. So if you want to create a new chemical, all you have to do is target those few variables. You want a new product with particular characteristics? Figure out how to get < 10 variables (not 1000s) to their targets, and you have a new product.
https://www.pnas.org/doi/10.1073/pnas.1611138113
You summarized it very well!
There are also nonlinear techniques. I’ve used UMAP and it’s excellent (particularly if your data approximately lies on a manifold).
https://umap-learn.readthedocs.io/en/latest/
The most general purpose deep learning dimensionality reduction technique is of course the autoencoder (easy to code in PyTorch). Unlike the above, it makes very few assumptions, but this also means you need a ton more data to train it.
Do you mean it makes the *strongest* assumptions? "your data is (locally) linear and more or less Gaussian" seems like a fairly strong assumption. Sorry for the newb question as I'm not very familiar with this space.
However I meant it colloquially in that those assumptions are trivially satisfied by many generating processes in the physical and engineering world, and there aren’t a whole lot of other requirements that need to be met.
There's a newer thing called PacMap which is an interesting thing that handles difference cases better. Not as robustly tested as UMAP but that could be said of any new thing. I'm a little wary that it might be overfitted to common test cases. To my mind it feels like PacMap seems like a partial solution of a better way of doing it.
The three stage process of PacMap is either asking to be developed into either a continuous system or finding a analytical reason/way to conduct a phase change.
T-SNE is good for visualization and for seeing class separation, but in my experience, I haven’t found it to work for me for dimensionality reduction per se (maybe I’m missing something). For me, it’s more of a visualization tool.
On that note, there’s a new algorithm that improves on T-SNE called PaCMAP which preserves local and global structures better. https://github.com/YingfanWang/PaCMAP
https://www.biorxiv.org/content/10.1101/2025.05.08.652944v1....
Embeddings are a form of latent variables.
Attention query/key/value vectors are latent variables.
More generally, a latent variable is any internal, not-directly-observed representation that compresses or restructures information from inputs into a form useful for producing outputs.
They usually capture some underlying behavior in either lower dimensional or otherwise compressed space.
I remember 3-FPM, that was what I imagined stimulants should be doing. It did everything just right. I got it back when it was legal. Any other stimulants come nowhere as close, maybe similar ones, but 4FA or whatever is for example, mostly euphoric, which is not what I want.
No clue about IBM's part in it.
https://www.economist.com/science-and-technology/2025/07/02/...
My understanding is, iterating on possible sequences (of codons, base pairs, etc) is exactly what LLMs, these feedback-looped predictor machines, are especially great at. With the newest models, those that "reason about" (check) their own output, are even better at it.
Similar for physicists, I think there’s a very confusing/unconventional antenna called the “evolved antenna” which was used on a NASA spacecraft. The idea behind it was supported from genetic programming. The science or understanding “why” the way the antenna bends at different areas supporting increased gain is not well understood by us today.
This all boils down to empirical reasoning, which underlies the vast majority of science (or science adjacent fields like software engineering, social sciences etc).
The question I guess is; does LLMs, “AI”, ML give us better hypothesis or tests to run to support empirical evidence-based science breakthroughs? The answer is yes.
Will these be substantial, meaningful or create significant improvements on today’s approaches?
I can’t wait to find out!
Wouldnt that mean a fall of US pharmaceutical conglomate based on current laws about copyright and AI content?
Is achieving the same result using different engines the same as designing combustion engine in different ways?
How public domain translate to that?
I really hope it kills any ways to claim patents on anything.
AI does not create anything, anymore than Word writes documents or Photoshop creates photos.
You never quite know.
Right now, it's mostly the former. I fully expect the latter to become more and more common as the performance of AI systems improves.
GPT-5 (and other LLMs) are by definition language models and though they will happily spew tokens about whatever you ask, they don't necessarily have the training data to properly encode the latent space of (e.g) drug interactions.
Confusing these two concepts could be deadly.
To improve next token prediction performance on these datasets and generalize requires a much richer latent space. I think it could theoretically lead to better results from cross-domain connections (ex: being fluent in a specific area of advanced mathematics, quantum mechanics, and materials engineering is key to a particular breakthrough)
A few things to consider:
1. This is one example. How many other attempts did the person try that failed to be useful, accurate, coherent? The author is an OpenAI employee IIUC, so it begs this question. Sora's demos were amazing until you tried it, and realized it took 50 attempts to get a usable clip.
2. The author noted that humans had updated their own research in April 2025 with an improved solution. For cases where we detect signs of superior behavior, we need to start publishing the thought process (reasoning steps, inference cycles, tools used, etc.). Otherwise it's impossible to know whether this used a specialty model, had access to the more recent paper, or in other ways got lucky. Without detailed proof it's becoming harder to separate legitimate findings from marketing posts (not suggesting this specific case was a pure marketing post)
3. Points 1 and 2 would help with reproducibility, which is important for scientific rigor. If we give Claude the same tools and inputs, will it perform just as well? This would help the community understand if GPT-5 is novel, or if the novelty is in how the user is prompting it
I should know, I've been using LLM thinking models to help brainstorm ideas for stickier proofs. It's been more successful at discovering esoteric entry points than I would like to admit.
If you could combine this with automated theorem proving, it wouldn't matter if it was right only 1 out of a 1000 times.
(Theory building is quite hard in math; the computation side is only hard after a point).
High chance given that this is the same guy that came up with SVG unicorn (sparks of AGI) which raises the same question even more obviously.
The entire field of math is fractal-like. There are many, many low hanging fruits everywhere. Much of it is rote and not life changing. A big part of doing “interesting” math is picking what to work on.
A more important test is to give an AI access to the entire history of math and have it _decide_ what to work on, and then judge it for both picking an interesting problem and finding a novel solution.
https://mathstodon.xyz/@tao/114881418225852441
https://mashable.com/article/openai-claims-gold-medal-perfor...
Note that no one expressed skepticism of what google said when they claimed they achieved gold medal. But no one is willing to believe OpenAI.
1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data 2. People glamorize math and feel like advancements in it would "be AGI"
They don't realize that having it generate "new math" is not much harder than having it generate "new programs." Instead of writing something in Python, it's writing something in Lean.
So then, what are they doing?
I'm seeing people creating full apps with GPT-5-pro, but nothing is novel.
Just discussed the "impressiveness" of it creating a gameboy emulator from scratch.
(There's over 3500 gameboy emulators on github. I would be suprised if it failed to produce a solution with that much training data).
Where's the novel break-throughs?
As it stands today, I'm sure it can produce a new ssl implementation or whatever it has been trained on, but to what benefit???
For a lay person, what are they actually doing instead?
Or if you ask it, "what is the capital of the state that has the city Dallas?", it understands the relations and can internally reason through the two step process of Dallas is in Texas -> the capital of Texas is Austin. A simple n-gram model may occasionally get questions like that right by a lucky guess (though usually not) while we can see experimentally the LLM is actually applying the proper reasoning to the question.
You can say this is all just advanced applications of memorizing and predicting patterns, but you would have to use a broad definition of "predicting patterns" that would likely include human learning. People who declare LLMs are just glorified auto-complete are usually trying to imply they are unable to "truly" reason at all.
Kant has an argument in the Critique of Pure Reason that reason cannot be reducible to the application of rules, because in order to apply rule A to a situation, you would need a rule B to follow for applying rule A, and a rule C for applying rule B, and this is an infinite regress. I think the same is true here: any reasonable characterization of "applying a pattern" that would succeed at reducing what LLMs do to something mechanical is vulnerable to the regress argument.
In short: even if you want to say it's pattern matching, retrieving a pattern and applying it requires something a lot closer to intelligence than the phrase makes it sound.
Second: Generative AI is about approximating an unknown data distribution. Every dataset - text, images, video - is treated as a sample from such a distribution. Success depends entirely on the model's ability to generalize outside the training set. For example, "This Person Does Not Exist" (https://this-person-does-not-exist.com/en) was trained on a data set of 1024x1024 RGB images. Each image can be thought of as a vector in a 1024x1024x3 = 3145728-dimensional space, and since all coefficients are in [0,1], these vectors are all in the interior of a 3145728-dimensional hypercube. But almost all points in that hypercube are going to be random noise that doesn't look like a person. The ones that do will be on a lower-dimensional manifold embedded in the hypercube. The goal of these models is to infer this manifold is from the training data, and generate a random point on it.
Third: Models do what they're trained to do. Next-token prediction is one of those things, but not the whole story. A model that literally did just memorize exact fragments would not be able to zero-shot new code examples at all. That is, the transformer architecture would have learned some nonlinear transformation that is only good at repeating exact fragments. Instead, they spend a ton of time training it to get good at generalizing to new things, and it learns whatever other nonlinear transformation makes it good at doing that instead.
Or at it's core, if you give it question that it's never seen, what's the most likely reply you might get, and it will give you that. But dosen't mean there is a internal world-model or anything, it's ultimately wether you think language is sufficient to model reality, which I probably think not. It obviously would be very convincing, but not necessairly correct.
> techniques like vector tokenization
(I assume you're talking about the input embedding.) This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net. This is very different than an n-gram model and is probably capable of figuring out anything a human can figure out given sufficient scale and the right weights. We don't have that yet in practice, but it's not due to a theoretical limitation of ANNs.
> probability distribution of the most likely next token given a preceding text.
What you're talking about is an autoregressive model. That's more of an implementation detail. There are other kinds of LLMs.
I think talking about how it's just predicting the next token is misleading. It's implying it's not reasoning, not world-modeling, or is somehow limited. Reasoning is predicting, and predicting well requires world-modeling.
What seperates transformers from LSTMs is their ability to proccess the entire corpus in parallel rather in-sequence and the inclusion of the more efficient "attention" mechanism that allows them to pick up long range dependencies across a language. We don't actually understand the full nature of the latter, but I suspect that is the basis behind the more "intelligent" actions of the LLM. There's quite a general range of problems that a long-range-dependency was encompass, but that's still ultimately limited by language itself.
But if you're talking about this being a fundamentally a probability distribution model, I stand by that, because that's literally the mathematical model (softmax for the encoder and decoder) that's being used in transformers here. It very much is generating a probability distribution over the vocabulary and just picking the highest probability (or beam search) as your next output.
>The LLMs absolutely world model and researchers have shown this many times on smaller language models.
We don't have a formal semantic definition of a "world model", I would take alot of what these researchers are writing with a grain of salt because something like that crosses more into philosophy (especially in the limits of language and logic) than hard engineering that these researchers are trained on.
Zoom out and look at it's trajectory over those 100,00 steps and ask again.
The answer is something alien. Probabilistically it is certain the description of its behavior is not going to exist in a space we as humans can understand. Maybe if we were god beings we could say 'No no, you see the behavior of the double pendulum isn't seemingly random, you just have to look at it like this'. Encryption is a decent analogy here.
We're fooled into thinking we can understand these systems because we forced them to speak English. Under the hood is a different story.
2) That's not even the point. The point is being trained on stolen data without permission, pretending that the resulting model of the training data is not a derived work of the training data and that the output of the model plus a prompt is not derived work of the training data.
Point 1 is just an extreme edge case which is a symptom of point 2 and yet people still have trouble accepting it.
GPL was about user freedom and now if derived work no longer applies as long as you run code through a sufficiently complex plagiarism automator, plagiarism is unprovable and GPL is broken. Great, we lost another freedom.
[0]: I recall a study or court document with 100 examples of plagiarising multiple whole paragraphs from the New York Times, don't have time to look for it now
Convenient. Well then, I recall two studies that said the opposite. Unfortunately pressed for time as well.
You didn't have to be rudely dismissive and lie, you chose to.
I would happily respond politely to a polite request.
Please be mindful of your behavior next time.
---
Link for everyone else: https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dk...
My sympathies to academic publishers ;)
I don't necessarily disagree about the copyleft stuff.
Transformers do sometimes overfit to exact token sequences from training data, but that isn't really what they the architecture does in general.
The same applies to valid new programs.
The issue I have with this is pretending that the word "new" is sufficient justification for giving all the credit/attribution and subsequent reward (reputational, financial, etc.) to the person who wrote the prompt instead of distributing it to the people in the whole chain of work according to how much work and what quality of work they did.
How many man-hours did it take to create the training data? How many to create the LLM training algorithm and the electricity to run it? How many to write the prompts?
The most work by many, many orders of magnitude was put in by the first group. They often did it with altruistic goals in mind and released their work under permissive or copyleft licenses.
And now somebody found a way to monetize this effort without giving them anything in return. In fact, they will have to pay to access the LLMs which are based on their own work.
Copyright or plagiarism are perhaps the wrong terms to use when talking about it. I think copyright should absolutely apply but it was designed to protect creative works, not code in the first place.
Either way it's a form of industrialized exploitation and we should use all available tools to defend against it.
I mean, sure. But so am I (in what is likely a far more advanced manner, but still). I also find it somewhat funny that I am also partially trained on stolen data without permission. I also jaywalk occasionally (perhaps I am trivializing the topic too much, but show me a researcher who hasn't _once_ downloaded a paper they really needed, in less than perfectly legal ways).
Human rights are valuable. LLMs allow laundering GPL code (removing both attribution and users' rights to inspect and modify the code). Free software cannot compete against proprietary in a world where making a copy is trivial but proving it's a copy is nearly impossible.
If LLMs were already a breakthrough in proving theorems, even for obscure minor theorems, there would be a massive increase in published papers due to publish or perish academic incentives.
I’m absolutely confident that AI/LLM can solve things, but you have to shift through a lot of crap to get there. Even further, it seems AI/LLM tend to solve novel problems in very unconventional ways. It can be very hard to know if an attempt is doomed, or just one step away from magic.
But similarly to how a computer plays chess, using heuristics to narrow down a vast search space into tractable options, LLMs have the potential to be a smarter way to narrow that search space to find proofs. The big question is whether these heuristics are useful enough, and the proofs they can find valuable enough, to make it worth the effort.
This is why the computer-assisted proof of the four-color theorem was such a talking point in math/cs-circles: how do you "really" know what was proven. This is slightly different from say an advisor who trains his students : you can often sketch out a proof, even though the details require quite a bit of work.
But it's a separate question of whether this is a good example of that. I think there is a certain dishonesty in the tagline. "I asked a computer to improve on the state-of-the-art and it did!". With a buried footnote that the benchmark wasn't actually state-of-the-art, and that an improved solution was already known (albeit structured a bit differently).
When you're solving already-solved problems, it's hard to avoid bias, even just in how you ask the question and otherwise nudge the model. I see it a lot in my field: researchers publish revolutionary results that, upon closer inspection, work only for their known-outcome test cases and not much else.
Another piece of info we're not getting: why this particular, seemingly obscure problem? Is there something special about it, or is it data dredging (i.e., we tried 1,000 papers and this is the only one where it worked)?
Programmers take pride in their ability to program and to reduce their own abilities into an algorithm reproducible by an LLM is both an attack on their pride and an attack on their livelihood.
It’s the same reason why artists say AI art is utter crap when in a blind folded test they usually won’t be able to tell the difference.
quanta published an article that talked about a physics lab asking chatGPT to help come up with a way to perform an experiment, and chatGPT _magically_ came up with an answer worth pursuing. but what actually happened was chatGPT was referencing papers that basically went unread from lesser famous labs/researchers
this is amazing that chatGPT can do something like that, but `referencing data` != `deriving theorems` and the person posting this shouldn't just claim "chatGPT derived a better bound" in a proof, and should first do a really thorough check if it's possible this information could've just ended up in the training data
Which is actually huge. Reviewing and surfacing all the relevant research out there that we are just not aware of would likely have at least as much impact as some truly novel thing that it can come up with.
now let's invalidate probably 70% of all patents
if LLMs arent being used by https://patents.stackexchange.com/ or patent troll fighters, shame on them.
On the other hand, I have a collection of unpublished results in less active fields that I’ve tested every frontier model on (publicly accessible and otherwise) and each time the models have failed to solve them. Some of these are simply reformulations of results in the literature that the models are unable to find/connect which is what leads me to formulate this as a search problem with the space not being densely populated enough in this case (in terms of activity in these subfields).
The paper in question is an arxiv preprint whose first author seems to be an undergraduate. The theorem in it which GPT improves upon is perfectly nice, there are thousands of mathematicians who could have proved it had they been inclined to. AI has already solved much harder math problems than this.
Of course, because I am a selfish person, I'd say I appreciate most his work on convex body chasing (see "Competitively chasing convex bodies" on the Wikipedia link), because it follows up on some of my work.
Objectively, you should check his conference submission record, it will be a huge number of A*/A CORE rank conferences, which means the best possible in TCS. Or the prizes section on Wikipedia.
Provocative as my question may be, the point I wanted to make is that his most highly cited paper that I already mentioned is suspiciously very in line with the OpenAI narrative. I doubt if any of his GPT research is really independent. With great salary comes great responsibility.
He is a mathematician. Unless you wanted to say "any other mathematicians..."
https://x.com/ErnestRyu/status/1958408925864403068?t=QmTqOcx...
There are a few masters-level publishable research problems that I have tried with LLMs on thinking mode, and it had produced a nearly complete proof before we had a chance to publish it. Like the problem stated here, these won't set the world on fire, but they do chip away at more meaningful things.
It often doesn't produce a completely correct proof (it's a matter of luck whether it nails a perfect proof), but it very often does enough that even a less competent student can fill in the blanks and fix up the errors. After all, the hardest part of a proof is knowing which tools to employ, especially when those tools can be esoteric.
https://xcancel.com/SebastienBubeck/status/19581986678373298...
I really don't know what to make of this. The conclusion is that a model could still do this without the paper containing the exact info on how to do this ?
Context: https://x.com/GeoffLewisOrg/status/1945864963374887401
aka the Grothendieck prime!
Bad at arithmetic, promising at math: https://www.lesswrong.com/posts/qy5dF7bQcFjSKaW58/bad-at-ari...
Reference: https://arxiv.org/abs/2507.15855
Alternative: If Gemini Deep Think or GPT5-Pro people are listening, I think they should give free access to their models with potential scaffolding (ie. agentic workflow) to say some ~100 researchers to see if any of them can prove new math with their technology.
s/prove/produce/g
I'm inclined to regard an LLM as modelling a collection of fuzzy production rules which occur in a hierarchical collection of semi-formal systems; an LLM attempts to produce typographically correct theorems, the proving occurs at the level of semantics. Meaning requires a mind to erect an isomorphic mapping which the LLM is not capable of. In other words, for the LLM the math is just symbols on a page that are arranged according to the typographic rules which it has an imperfect model of. On this view, nothing about what is happening with Gen AI is particularly surprising or novel.
But yes, it's getting better and better.
https://x.com/ErnestRyu/status/1958408925864403068?t=dAKXWtt...
The same doesn't really apply to everything outside of that.
Still, you'd think that status would still remain, it's not like the invention of the car removed the glory of being the world's fastest sprinter.