Running out of text

From Simia
Jump to navigation Jump to search

Many of the available text corpora have by now been used for training language models. One untapped corpus so far have been our private messages and emails.

How fortunate that none of the companies that train large language models have access to humongous logs of private chats and emails, often larger than any other corpus for many languages.

How fortunate that those who do have well working ethic boards established, who would make sure that such requests are evaluated.

How fortunate that we have laws in place to protect our privacy.

How fortunate that when new models are published also the corpora are being published on which the models are being trained.

What? Your telling me, "Open"AI is keeping the training corpus for GPT-4 secret? The company closely associated with Microsoft, who own Skype, Office, Hotmail? The same Microsoft who just fired an ethics team? Why would all that be worrisome?

P.S.: To make it clear: I don't think that OpenAI has used private chat logs and emails as training data for GPT-4. But by not disclosing their corpora, they might be checking if they can get away with not being transparent, so that maybe next time they might do it. No one would know, right? And no one would stop them. And hey, if it improves the metrics...


Previous entry:
Oscar winning families
Next entry:
Last Unicorn dreamcast