When training a transformer on positionally encoded embeddings, should the tgt output embeddings also be positionally encoded? If so, wouldn’t the predicted/decoded embeddings also be positionally encoded?

  • yboutrosOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 month ago

    Thanks for the feedback! I also asked a similar question on the ai stack exchange thread and got some helpful feedback there

    It was a great project for brushing up on seq2seq modeling, but I decided to shelve it since someone released a polished website doing the same thing.

    The idea was the vocabulary of music composition are chords and the sentences / paragraphs that are measures are sequences of chords or sequences of measures

    I think it’s a great project because the limited vocab size and max sequence length are much shorter than what is typical for transformers applied to LLM tasks like digesting novels for example. So for consumer grade harder (12GB VRam) it’s feasible to train a couple different model architectures in tandem

    Additionally, nothing sounds bad in music composition, it’s up to the musician to find a creative way to make it sound good. So even if the model is poorly trained, so long as it doesn’t output EOS immediately after BOS, and the sequences are unique enough, it’s pretty hard to find something that isn’t different that still works.

    It’s also fairly easy to gather data from a site like iRealPro

    The repo is still disorganized, but if you’re curious the main script is scrape.py

    https://github.com/Yanall-Boutros/pyRealFakeProducer