In a groundbreaking development for the machine learning (ML) community, the Launch of Croissant by MLCommons by Google AI, a revolutionary metadata format designed to significantly improve the discoverability and usability of ML datasets across an extensive range of tools and platforms.
This initiative aims to address the long-standing challenge of dataset diversity and organization that has been a major roadblock in the path of ML progress.
Today on the blog, we’re excited to announce the release of @MLCommons Croissant, a metadata format to make ML datasets more easily discoverable and usable across a wide array of tools and platforms. Learn more and try it today →https://t.co/fdjBDannpx #ml #datasets pic.twitter.com/DCoWcyRRYV
— Google AI (@GoogleAI) March 6, 2024
Machine learning datasets span a vast array of content types, including text, structured data, images, audio, and video, each with unique organization and format challenges. This diversity often results in decreased productivity during the ML development cycle, as practitioners grapple with understanding and utilizing these datasets effectively.
Croissant emerges as a beacon of hope, offering a standardized way to describe and organize ML datasets without altering their actual data representation.
It enhances the existing schema.org standard, which is already utilized by over 40 million datasets, by adding layers specifically tailored for ML, such as metadata for responsible use, data resources, organization, and default ML semantics.
Today’s @MLCommons Croissant metadata format release includes the format documentation, an open source library, visual editor, thanks to industry support from @HuggingFace, @GoogleAI Dataset Search, @Kaggle, and @Open_ML.#ai #ml #datasetshttps://t.co/vBJPCeOgI3
— MLCommons (@MLCommons) March 6, 2024
In a significant boost to its adoption, Croissant has garnered support from leading ML dataset collections and tools. Kaggle, Hugging Face, and OpenML will start incorporating the Croissant format, and ML frameworks like TensorFlow, PyTorch, and JAX are set to facilitate easy loading of Croissant datasets through the TensorFlow Datasets (TFDS) package.
This widespread support underscores the potential of Croissant to streamline the ML development process.
Google said in a blog post:
In addition, we are announcing support from major tools and repositories: Today, three widely used collections of ML datasets — Kaggle, Hugging Face, and OpenML — will begin supporting the Croissant format for the datasets they host; the Dataset Search tool lets users search for Croissant datasets across the Web; and popular ML frameworks, including TensorFlow, PyTorch, and JAX, can load Croissant datasets easily using the TensorFlow Datasets (TFDS) package
The Croissant 1.0 release includes a comprehensive format specification, example datasets, an open-source Python library for managing Croissant metadata, and a visual editor to facilitate the intuitive creation and inspection of dataset descriptions.
Moreover, the Croissant Responsible AI (RAI) vocabulary extension introduces critical properties to address key RAI use cases, embedding principles of fairness, safety, and compliance from the outset.
By offering a unified format for ML data, Croissant aims to alleviate the data development burden, particularly for resource-constrained researchers and startups.
It promises to enhance dataset discoverability, ease the development of data processing tools, and enable more straightforward model training and testing with minimal coding.
This initiative not only boosts the value of datasets for authors but also fosters a more vibrant ecosystem for ML research and application.
👏👏👏 We’re very excited to support @MLCommons‘ new Croissant format for datasets on Kaggle. Learn more about the goals and benefits of Croissant here: https://t.co/8nXISQM9vN https://t.co/dmxo8vpaF9
— Kaggle (@kaggle) March 6, 2024
As the ML community looks forward to leveraging Croissant’s capabilities, MLCommons encourages dataset creators, hosting platforms, and tool developers to adopt and support this format. By doing so, they can contribute to a more efficient and collaborative ML development landscape.
For more of the latest news, visit our AI news at allaboutai.com