• treadful@lemmy.zip
    link
    fedilink
    English
    arrow-up
    1
    ·
    9 months ago

    What do you think happens to data when it’s scraped? Copying the data is a fundamental requirement for using it in training. These models are trained in big datacenters where the original work is split up and tokenized and used over and over again.

    Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.

    The difference between you training a model and you reading a book (put online by its author in clear text, to avoid the obvious issue of actual piracy for human use) is that you reading on a website is the intention of the copyright holder and you as a person have a fundamental right to remember things and be inspired.

    Copyright holders can’t say what I do with their work, nor what I do with the knowledge of their book. They can only say how I copy and distribute it. I don’t need consent to burn an author’s book, create fan art around it, or quote characters in my blog. I do need their consent to copy and distribute their works directly.

    You don’t however have a right to copy and use the text for other purposes, whether that’s making a t-shirt with a memorable line, printing it out to give to someone else, or tokenizing it to train a computer algorithm.

    And at some point the resolution of said words is so specific that it becomes uncopyrightable. You can’t copyright most phrases nor words.

    • Zaktor@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      1
      ·
      9 months ago

      Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.

      It very much is. You can’t just run a cipher on a copyrighted work and say “it’s not the same, so I didn’t copy it”. Tokenization is reversible to the original text. And “distributing” is separate from violating copyright. It’s not distriburight, it’s copyright. Copying a work without authorization for private use is still violating copyright.

      • treadful@lemmy.zip
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        9 months ago

        You can’t just run a cipher on a copyrighted work and say “it’s not the same, so I didn’t copy it”.

        Yes I can. I can download a Web page, encrypt it on my machine, and I’m not distributing said work.

        And “distributing” is separate from violating copyright. It’s not distriburight, it’s copyright. Copying a work without authorization for private use is still violating copyright.

        That’s just false.

        • Zaktor@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          2
          ·
          9 months ago

          You absolutely do not know what you’re talking about. This is just trivial copyright law, but there’s a weird internet mythology that if you can access something on the net you can take it as long as you don’t share it further. The reason the mass-sharers tended to get prosecuted is because they were easier and more valuable targets, not because the people they were sharing it with weren’t also breaking the law.