Sam Altman-run OpenAI has introduced data partnerships, where it will work together with organisations to produce public and private datasets for training AI models.
The company is aiming to collaborate with organisations to help AI models understand “all subject matters, industries, cultures, and languages” which requires as broad a training dataset as possible.
“Data Partnerships are intended to enable more organisations to help steer the future of AI and benefit from models that are more useful to them, by including content they care about,” the company said in a statement.
The ChatGPT developer said that it is interested in large-scale datasets that reflect human society and that are not already easily accessible online to the public.
“We can work with any modality, including text, images, audio, or video. We’re particularly looking for data that expresses human intention (e.g. long-form writing or conversations rather than disconnected snippets), across any language, topic, and format,” the company noted.
OpenAI said it can work with data in almost any form and can use its next-generation in-house AI technology to help people digitise and structure their data.
“For example, we have world-class optical character recognition (OCR) technology to digitise files like PDFs, and automatic speech recognition (ASR) to transcribe spoken words,” the company added.
The company is seeking partners to help it create an open-source dataset for training language models.
“This dataset would be public for anyone to use in AI model training. We would also explore using it to safely train additional open-source models ourselves. We believe open-source plays an important role in the ecosystem,” said OPenAI.
“We are also preparing private datasets for training proprietary AI models, including our foundation models and fine-tuned and custom models,” it added.
20231110159817