The Open Source AI Definition – draft v. 0.0.8

version 0.0.8

Leave comments for this text

Note: This document is made of three parts: A preamble, stating the intentions of this document; the Definition of Open Source AI itself; and a checklist to evaluate legal documents.

This document follows the definition of AI system adopted by the Organization for Economic and Co-operation Development (OECD)

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.

More information about definitions of AI systems on OSI's blog.

Preamble

Why we need Open Source Artificial Intelligence (AI)

Open Source has demonstrated that massive benefits accrue to everyone when you remove the barriers to learning, using, sharing and improving software systems. These benefits are the result of using licenses that adhere to the Open Source Definition. The benefits can be summarized as autonomy, transparency, frictionless reuse, and collaborative improvement.

Everyone needs these benefits in AI. We need essential freedoms to enable users to build and deploy AI systems that are reliable and transparent.

What is Open Source AI

An Open Source AI is an AI system made available under terms that grant the freedoms to:

  • Use the system for any purpose and without having to ask for permission.
  • Study how the system works and inspect its components.
  • Modify the system for any purpose, including to change its output.
  • Share the system for others to use with or without modifications, for any purpose.

Precondition to exercise these freedoms is to have access to the preferred form to make modifications to the system.

The preferred form of making modifications for a machine-learning Open Source AI must include:

  • Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.
    • For example, if used, this would include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics, how the data was obtained and selected, the labeling procedures and data cleaning methodologies.
  • Code: The source code used to train and run the system.
    • For example, if used, this would include code used for pre-processing data, code used for training, validation and testing, supporting libraries like tokenizers and hyperparameters search code, inference code, and model architecture.
  • Model: The model parameters.
    • For example, this might include checkpoints from key intermediate stages of training as well as the final optimizer state.
This checklist is based on the paper The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI published Mar 21, 2024.

Table of default required components

Required componentsLegal frameworks
Data information
– Training methodologies and techniquesAvailable under OSD-compliant license
– Training data scope and characteristicsAvailable under OSD-compliant license
– Training data provenance (including how data was obtained and selected)Available under OSD-compliant license
– Training data labeling procedures, if usedAvailable under OSD-compliant license
– Training data cleaning methodologyAvailable under OSD-compliant license
Code
– Data pre-processingAvailable under OSI-approved license
– Training, validation and testingAvailable under OSI-approved license
– InferenceAvailable under OSI-approved license
– Supporting libraries and toolsAvailable under OSI-approved license
Model
– Model architectureAvailable under OSI-approved license
– Model parametersAvailable under OSD-conformant terms

The following components are not required as the preferred form of making modifications, but their inclusion in releases is appreciated.

Optional componentsLegal frameworks
Data information All data sets, including:Available under OSD-compliant license
– Training data setsAvailable under OSD-compliant license
– Testing data setsAvailable under OSD-compliant license
– Validation data setsAvailable under OSD-compliant license
– Benchmarking data setsAvailable under OSD-compliant license
– Data cardAvailable under OSD-compliant license
– Evaluation dataAvailable under OSD-compliant license
– Evaluation resultsAvailable under OSD-compliant license
– Other data documentationAvailable under OSD-compliant license
Code
– Code used to perform inference for benchmark testsAvailable under OSI-approved license
– Evaluation codeAvailable under OSI-approved license
Model All model elements, including:
– Model cardAvailable under OSD-compliant license
– Sample model outputsAvailable under OSD-compliant license
– Model metadataAvailable under OSD-compliant license
Other Any other documentation or tools produced or used, including:
– Research papersAvailable under OSD-compliant license
– Technical reportAvailable under OSD-compliant license

Leave comments for this text