Launch of Croissant by MLCommons – Google AI

  • Editor
  • March 7, 2024
    Updated
Launch-of-Croissant-by-MLCommons-Revolutionizing-ML-Dataset-Discoverability-and-Usability

In a groundbreaking development for the machine learning (ML) community, the Launch of Croissant by MLCommons by Google AI, a revolutionary metadata format designed to significantly improve the discoverability and usability of ML datasets across an extensive range of tools and platforms.

This initiative aims to address the long-standing challenge of dataset diversity and organization that has been a major roadblock in the path of ML progress.

Machine learning datasets span a vast array of content types, including text, structured data, images, audio, and video, each with unique organization and format challenges. This diversity often results in decreased productivity during the ML development cycle, as practitioners grapple with understanding and utilizing these datasets effectively.

Croissant emerges as a beacon of hope, offering a standardized way to describe and organize ML datasets without altering their actual data representation.

It enhances the existing schema.org standard, which is already utilized by over 40 million datasets, by adding layers specifically tailored for ML, such as metadata for responsible use, data resources, organization, and default ML semantics.

In a significant boost to its adoption, Croissant has garnered support from leading ML dataset collections and tools. Kaggle, Hugging Face, and OpenML will start incorporating the Croissant format, and ML frameworks like TensorFlow, PyTorch, and JAX are set to facilitate easy loading of Croissant datasets through the TensorFlow Datasets (TFDS) package.

This widespread support underscores the potential of Croissant to streamline the ML development process.

Google said in a blog post:

In addition, we are announcing support from major tools and repositories: Today, three widely used collections of ML datasets — Kaggle, Hugging Face, and OpenML — will begin supporting the Croissant format for the datasets they host; the Dataset Search tool lets users search for Croissant datasets across the Web; and popular ML frameworks, including TensorFlow, PyTorch, and JAX, can load Croissant datasets easily using the TensorFlow Datasets (TFDS) package

The Croissant 1.0 release includes a comprehensive format specification, example datasets, an open-source Python library for managing Croissant metadata, and a visual editor to facilitate the intuitive creation and inspection of dataset descriptions.

Moreover, the Croissant Responsible AI (RAI) vocabulary extension introduces critical properties to address key RAI use cases, embedding principles of fairness, safety, and compliance from the outset.

By offering a unified format for ML data, Croissant aims to alleviate the data development burden, particularly for resource-constrained researchers and startups.

It promises to enhance dataset discoverability, ease the development of data processing tools, and enable more straightforward model training and testing with minimal coding.

This initiative not only boosts the value of datasets for authors but also fosters a more vibrant ecosystem for ML research and application.

As the ML community looks forward to leveraging Croissant’s capabilities, MLCommons encourages dataset creators, hosting platforms, and tool developers to adopt and support this format. By doing so, they can contribute to a more efficient and collaborative ML development landscape.

For more of the latest news, visit our AI news at allaboutai.com

Was this article helpful?
YesNo
Generic placeholder image

Dave Andre

Editor

Digital marketing enthusiast by day, nature wanderer by dusk. Dave Andre blends two decades of AI and SaaS expertise into impactful strategies for SMEs. His weekends? Lost in books on tech trends and rejuvenating on scenic trails.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *