Comments (3)
I'm pretty interested in the intersection of code / ML. If that's your thing here are some other writing you might be interested in.
* Thinking about cuda: http://github.com/srush/gpu-puzzles
* Tensors considered harmful: https://nlp.seas.harvard.edu/NamedTensor
* Differentiating SVG: https://srush.github.io/DiffRast/
* Annotated S4: https://srush.github.io/annotated-s4/
Recently moved back to industry, so haven't had a chance to write in a while.
Once you realize that Attention is really just a re-framing of Kernel Smoothing it becomes wildly more intuitive [0]. It also allows you to view Transformers as basically learning a bunch of stacked Kernels which leaves them in a surprisingly close neighborhood to Gaussian Processes.
0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...
> I'd be grateful for any pointers to an example where system developers (or someone else in a position to know) have verified the success of a prompt extraction.
You can try this yourself with any open source llm setup that lets you provide a system prompt no? Just give it a prompt, ask the model the prompt ,and see if it matches.
gpt-oss is trained to refuse so it wont share (you can provide system prompt on lmstudio)
The thing that makes transformers work is multi dimensionality in the sense that you are multiplying matricies by matricies instead of computing dot products on vectors. And because matrix multiplication is effectively sums of dot products, you can represent all of the transformer as wide single layer perceptron sequences (albeit with a lot of zeros), but mathematically they would do the same thing.
> you can represent all of the transformer as wide single layer perceptron sequences
This isn't correct, again because of attention. The classic perceptron has static weights, they are not an input. The same mathematical function can be used to compute attention however there are no static weights. You've got your attention scores on one side and the V matrix on the other side.
Indeed I wonder if it's actually possible for a bunch of perceptrons to even 'discover' the attention mechanism given they inherently have static weights and they can't directly multiply two inputs (or directly multiply two internal activations). Given an MLP is a general function approximater I guess a sufficiently large number of them could get close enough?
Sure, but the K/V matrices are pretty much arbitrary weights, and so is the Q vector since its derived from the multiplication of the input vector by a learned matrix.
The thing im trying to convey is that the nomeclature of Key/Query/Value doesn't mean anything, so when people learn about transformers, they don't need to understand that those matricies correspond to some predefined structure that maps to the data in a specific way. You can have 2 identical model initialized with random values and trained on same datasets, and end up with different KQV matricies for the same input.
>This isn't correct, again because of attention. The classic perceptron has static weights, they are not an input.
K/Q/V are all derived by multiplying the input vector by static learned weights, then those are all multiplied together in the attention calculation. Basically just a whole bunch of dot products. You would just have flattened matricies with intermediate layers acting as accumulators.
>Indeed I wonder if it's actually possible for a bunch of perceptrons to even 'discover' the attention mechanism
It is. It won't be attention in a classical sense, it would just be extra connections on a wider dimension single layer stacks. The learning process would put the right values in the correct place.