I'm really curious to see if this kind of diffusion works with "guided generation" / grammars as well. There's a lot of "structure" in code that you could use for minimising the low hanging fruit errors, and focus on the logic of the code instead.
The memory bandwidth bottleneck limits the speed of running local models, the fact that this model is parallelizable means that even with one batch inference it will be possible to balance memory bandwidth bottleneck and compute bottleneck (aka much more speed).
Based on the animation, I personally don't expect this to be very helpful. The main way diffusion models help is preventing answers like "No. [proceeds to explain why the answer is yes]", and since the blocks are so small, the LLM can't fully explain before it has to say yes or no.
Could you expound on this? From what I'm reading, this sounds like an issue with diffusion models that their block diffusion model is purposefully designed to mitigate, by conditioning on previous blocks and allowing for larger blocks if that conditioning still doesn't help maintain coherence.
It's an issue that you run into as long as you're forced to start with a yes/no answer. It's a problem forward-only LLMs have and diffusion models don't, and normal block diffusion is closer to forward LLMs than diffusion models.
You could increase the block size to act more like a full diffusion model, but you would lose some of the benefits of block diffusion.
I'm not sure if this is what you mean, but LLaDA isn't block text diffusion. This is a mix between an autoregressive model and a diffusion model, which is brand new.
It is a soft-block text diffusion. They have one super-block of fixed size loaded and then allow the model to only unmask tokens by going through the soft-blocks. As the source code is available and I was able to change it into an actual block diffusion, but as the model was trained only on super-blocks, it was always trying to generate eos tokens at each block end before I extended it. I've tried a few workarounds that half worked, but I guess a very small scale finetune is needed to resolve it fully.
Those early tokens aren't necessarily immutable, they still could be "edited" depending on UI. Human conversation and even internal compositional cogitation is full of "what I meant by that" or "on second thought" type clarifications and corrections. Sometimes these aren't verbosely disclaimed, there's body language involved. Likewise there could be occasional lookback parsing and later blocks could convey modifications. The UI can then highlight those revisions transparently by applying strikethrough styling, coloration, dotted underline with tooltip on hover, etc.
Like we've seen with human interactions and media, this may be susceptible to misinterpretation by the reader or listener, especially via second-hand clips or screenshots lacking full context. But if the UX is clean and speedy it would be less likely.
I'm reminded of the Physics of Language Models[1] where they showed a standatd autoregressive LLM got a lot more accurate if the models got access to the backspace key, so to speak.
To be fair, it's not "obviously" better, but it opens a new point on the tradeoff curve. For a lot of use cases full autoregression is clearly better, and for some others ful diffusion will still be better.
Autoregressivity has high quality outputs but is fairly slow.
Diffusion has low quality output but is quite fast.
This allows you to go in the middle, not as high quality as full autoregression and not as fast as full diffusion, but a balance between both.
Diffusion on images is easy to understand for me: you start with noise, the model denoises by shifting the pixels towards their final value. What is the equivalent operation for increasing or reducing noise in language here? Is the "noisy" sentence half-way through training or inference sort-of-correct but not really, and at 90% almost-correct but with slightly wrong words (semantically)? Is the noise somehow semantic at all or is it something else?
Don't think about defusion as denoising, but rather as learning a delta operator, or rather, the inverse of one. I don't know how diffusion language models work precisely, but if I were to haphazard a guess, I would say you may think of a sentence as a matrix of values, the operator as simply filling a line with zeros, and the inverse — what the model learns — as adding it back.
This is equivalent to cutting an image in blocks, and learning how to generate incrementally images by inpainting missing blocks. This in-painting mind you can be generated in multiple steps, where you incrementally add more into the block.
this animation really makes the difference hit home: https://x.com/_akhaliq/status/1900027075370586262
I wish every paper had a succinct image like that!
This animation is a perfect abstract!
Imagine a whole git repo of a project materializing like that.
I'm really curious to see if this kind of diffusion works with "guided generation" / grammars as well. There's a lot of "structure" in code that you could use for minimising the low hanging fruit errors, and focus on the logic of the code instead.
Instead of vibe coding, imagine a Doctor Strange method of coding or as Mezerg uses Theremin
The memory bandwidth bottleneck limits the speed of running local models, the fact that this model is parallelizable means that even with one batch inference it will be possible to balance memory bandwidth bottleneck and compute bottleneck (aka much more speed).
Based on the animation, I personally don't expect this to be very helpful. The main way diffusion models help is preventing answers like "No. [proceeds to explain why the answer is yes]", and since the blocks are so small, the LLM can't fully explain before it has to say yes or no.
Could you expound on this? From what I'm reading, this sounds like an issue with diffusion models that their block diffusion model is purposefully designed to mitigate, by conditioning on previous blocks and allowing for larger blocks if that conditioning still doesn't help maintain coherence.
It's an issue that you run into as long as you're forced to start with a yes/no answer. It's a problem forward-only LLMs have and diffusion models don't, and normal block diffusion is closer to forward LLMs than diffusion models.
You could increase the block size to act more like a full diffusion model, but you would lose some of the benefits of block diffusion.
Interesting. Makes me want to play around with an open diffusion LM. Do you have any recommendations?
My understanding here is block size can be arbitrarily large, under similar constraints as diffusion models. Is that not the case?
https://m-arriola.com/bd3lms/
https://github.com/kuleshov-group/bd3lms
Isn't this basically the diffusion-autoregressive sampling strategy from the LLaDA paper, maybe more carefully evaluated?
the LLaDA paper is a scaled-up version of this paper; they cite it as an anonymous ICLR submission
I'm not sure if this is what you mean, but LLaDA isn't block text diffusion. This is a mix between an autoregressive model and a diffusion model, which is brand new.
It is a soft-block text diffusion. They have one super-block of fixed size loaded and then allow the model to only unmask tokens by going through the soft-blocks. As the source code is available and I was able to change it into an actual block diffusion, but as the model was trained only on super-blocks, it was always trying to generate eos tokens at each block end before I extended it. I've tried a few workarounds that half worked, but I guess a very small scale finetune is needed to resolve it fully.
Ah.
This is cool but I feel like you lose the best part of language-diffusion models which is their ability to edit early tokens.
Those early tokens aren't necessarily immutable, they still could be "edited" depending on UI. Human conversation and even internal compositional cogitation is full of "what I meant by that" or "on second thought" type clarifications and corrections. Sometimes these aren't verbosely disclaimed, there's body language involved. Likewise there could be occasional lookback parsing and later blocks could convey modifications. The UI can then highlight those revisions transparently by applying strikethrough styling, coloration, dotted underline with tooltip on hover, etc.
Like we've seen with human interactions and media, this may be susceptible to misinterpretation by the reader or listener, especially via second-hand clips or screenshots lacking full context. But if the UX is clean and speedy it would be less likely.
I'm reminded of the Physics of Language Models[1] where they showed a standatd autoregressive LLM got a lot more accurate if the models got access to the backspace key, so to speak.
[1]: https://physics.allen-zhu.com/home
To be fair, it's not "obviously" better, but it opens a new point on the tradeoff curve. For a lot of use cases full autoregression is clearly better, and for some others ful diffusion will still be better.
Autoregressivity has high quality outputs but is fairly slow. Diffusion has low quality output but is quite fast.
This allows you to go in the middle, not as high quality as full autoregression and not as fast as full diffusion, but a balance between both.
Only starts approaching PPL of AR at block size 4. May as well just use multi-token prediction with a standard AR model.
I love when someone comes up with a good idea that becomes immediately obvious as soon as it is introduced
I wonder how sliding window would work over blocks.
Excellent question
Diffusion on images is easy to understand for me: you start with noise, the model denoises by shifting the pixels towards their final value. What is the equivalent operation for increasing or reducing noise in language here? Is the "noisy" sentence half-way through training or inference sort-of-correct but not really, and at 90% almost-correct but with slightly wrong words (semantically)? Is the noise somehow semantic at all or is it something else?
For LLM, standard method to provide to foundation model sentence without one word (random, but already known to model) and ask to fill this word.
How is that diffusion? What noise is it denoising there?
Don't think about defusion as denoising, but rather as learning a delta operator, or rather, the inverse of one. I don't know how diffusion language models work precisely, but if I were to haphazard a guess, I would say you may think of a sentence as a matrix of values, the operator as simply filling a line with zeros, and the inverse — what the model learns — as adding it back.
This is equivalent to cutting an image in blocks, and learning how to generate incrementally images by inpainting missing blocks. This in-painting mind you can be generated in multiple steps, where you incrementally add more into the block.
Example, as I understand:
(I'm not sure how should look prompt, my guess): Prompt: answer, what word is missing in text query. Query: What is it denoising there?
Not familiar at all with diffusion LLM but I'd guess you'd have noisy logits.
[dead]