The Open Source AI Definition – draft v. 0.0.7.1

version 0.0.7.1

See the latest draft

Note: This document is made of three parts: A preamble, stating the intentions of this document; the Definition of Open Source AI itself; and a checklist to evaluate legal documents.

This document follows the definition of AI system adopted by the Organization for Economic and Co-operation Development (OECD):

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.

More information about definitions of AI systems on OSI’s blog.

Preamble

Why we need Open Source Artificial Intelligence (AI)

Open Source has demonstrated that massive benefits accrue to everyone when you remove the barriers to learning, using, sharing and improving software systems. These benefits are the result of using licenses that adhere to the Open Source Definition. The benefits can be summarized as autonomy, transparency, and collaborative improvement.

Everyone needs these benefits in AI. We need essential freedoms to enable users to build and deploy AI systems that are reliable and transparent.

Out of scope issues

The Open Source AI Definition doesn’t say how to develop and deploy an AI system that is ethical, trustworthy or responsible, although it doesn’t prevent it. The efforts to discuss the responsible development, deployment and use of AI systems, including through appropriate government regulation, are a separate conversation.

What is Open Source AI

An Open Source AI is an AI system made available under terms that grant the freedoms to:

  • Use the system for any purpose and without having to ask for permission.
  • Study how the system works and inspect its components.
  • Modify the system for any purpose, including to change its output.
  • Share the system for others to use with or without modifications, for any purpose.

Precondition to exercise these freedoms is to have access to the preferred form to make modifications to the system.

This checklist is based on the paper The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI published Mar 21, 2024.

Preferred form to make modifications to machine-learning systems

The default set of components required for a machine-learning Open Source AI are:

  • Data transparency: Sufficiently detailed information on how the system was trained. This may include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics; how the data was obtained and selected, the labeling procedures and data cleaning methodologies.
  • Code: The code used for pre-processing data, the code used for training, validation and testing, the supporting libraries like tokenizers and hyperparameters search code (if used), the inference code, and the model architecture.
  • Model: The model parameters, including weights. Where applicable, these should include checkpoints from key intermediate stages of training as well as the final optimizer state.

Table of default required components

Required componentsLegal frameworks
Code
– Data pre-processingAvailable under OSI-compliant license
– Training, validation and testingAvailable under OSI-compliant license
– Inference codeAvailable under OSI-compliant license
– Supporting libraries and toolsAvailable under OSI-compliant license
Model
– Model architectureAvailable under OSI-compliant license
– Model parameters (including weights)Available under terms compatible with Open Source principles
Data transparency
– Training methodologies and techniquesAvailable under OSI-compliant license
– Training data scope and characteristicsAvailable under OSI-compliant license
– Training data provenance (including how data was obtained and selected)Available under OSI-compliant license
– Training data labeling procedures, if usedAvailable under OSI-compliant license
– Training data cleaning methodologyAvailable under OSI-compliant license

The following components are not required, but their inclusion in public releases is appreciated.

Optional components
Code
– Code used to perform inference for benchmark tests
– Evaluation code
Data All data sets, including:
– Training data sets
– Testing data sets
– Validation data sets
– Benchmarking data sets
– Data cards
– Evaluation metrics and results
– All other data documentation
Model All model elements, including:
– Model card
– Sample model outputs
Other Any other documentation or tools produced or used, including:
– Thorough research papers
– Usage documentation
– Technical report
– Supporting tools

See the latest draft