How to convert a positionally encoded predicted embedding from a decoder to its matching token?

yboutros · 9 months ago

How to convert a positionally encoded predicted embedding from a decoder to its matching token?

fish@feddit.uk · 7 months ago

Hey there! Great question. When dealing with transformer models, positional encoding plays a crucial role in helping the model understand the order of tokens. Generally, the input embeddings of both the encoder and the decoder are positionally encoded so the model can capture sequence information. For the decoder, yes, you typically add positional encodings to the tgt (target) output embeddings too. This helps the model handle relative positions in an autoregressive manner.

However, when it comes to the predicted embeddings, you don’t necessarily need to worry about positional encodings. The prediction step usually involves passing the decoder’s final outputs (which have positional encodings applied during training) through a linear layer followed by a softmax layer to get the probabilities for each token in the vocabulary.

Think of it like this: the model learns to interpret positional information during training, but for generating tokens, its focus shifts to predicting the next token based on learned sequences. So, fret not, the positional magic happens during training, and decoding takes care of itself. Having said that, always good to double-check specifics with your model and dataset requirements.

Hope this helps clarify things a bit! Would love to hear how your project is going.

yboutros · 7 months ago

Thanks for the feedback! I also asked a similar question on the ai stack exchange thread and got some helpful feedback there

It was a great project for brushing up on seq2seq modeling, but I decided to shelve it since someone released a polished website doing the same thing.

The idea was the vocabulary of music composition are chords and the sentences / paragraphs that are measures are sequences of chords or sequences of measures

I think it’s a great project because the limited vocab size and max sequence length are much shorter than what is typical for transformers applied to LLM tasks like digesting novels for example. So for consumer grade harder (12GB VRam) it’s feasible to train a couple different model architectures in tandem

Additionally, nothing sounds bad in music composition, it’s up to the musician to find a creative way to make it sound good. So even if the model is poorly trained, so long as it doesn’t output EOS immediately after BOS, and the sequences are unique enough, it’s pretty hard to find something that isn’t different that still works.

It’s also fairly easy to gather data from a site like iRealPro

The repo is still disorganized, but if you’re curious the main script is scrape.py

https://github.com/Yanall-Boutros/pyRealFakeProducer