Explaining the concept of Data information

There seems to be some confusion caused by the concept of Data information included in the draft v0.0.8 of the Open Source AI Definition. Some readers may have seen the original dataset included in the list of optional components and quickly jumped to the wrong conclusions. This post clarifies how the draft arrived at its current state, the design principles behind the Data information concept and the constraints (legal and technical) it operates under.

The objective of the Open Source AI Definition

The objective of the Open Source AI Definition is to replicate in the context of artificial intelligence (AI) the principles of autonomy, transparency, frictionless reuse, and collaborative improvement for end users and developers of AI systems. These are described in the preamble.

Following the preamble is the definition of Open Source AI, an adaptation of the definition of Free Software (also known as “the four freedoms”) to AI nomenclature. The preamble and the four freedoms have been co-designed over several meetings and public discussions, online and in-person, and have not recently received significant comments. 

The Free Software definition specifies that a precondition to the freedom to study and modify a program is to have access to the source code. Source code is defined as “the preferred form of the program for making changes in.” Draft v0.0.8 contains a description of what’s necessary to enjoy the freedoms to study and modify an AI system. This new section titled Preferred form to make modifications to machine-learning systems has generated a heated debate. 

What is the preferred form to make modifications

The concept of “preferred form to make modifications” focuses on machine learning systems because these systems require data and training to produce a working system. Other AI systems are more easily classifiable as software and don’t require a special definition. 

The system analysis phase of the co-design process revealed that studying and modifying machine learning systems requires data, code for training and inference and model parameters. For the parameters, there’s no ambiguity: an Open Source AI must make them available under terms that respect the Open Source principles (no field-of-use restrictions, no discrimination against people, etc). For the data and code requirements, the text in the “preferred form to make modifications” section is longer and harder to parse, generating some confusion. 

The intent of the code and data requirements is to  ensure that end users, deployers and developers of an Open Source AI system have all the tools and instructions to recreate that AI system from scratch, to satisfy the freedoms to study and modify the system. At a high-level view, it makes sense to suggest that training datasets should be mandatorily released with permissive licenses in order to be Open Source AI.

However on close examination, it became clear that sharing the original datasets is full of traps. It actually puts Open Source at a disadvantage compared to opaque and proprietary AI systems.

The issue with data

Data is not software: The legal landscape for data is much wider than copyright. Aggregating large datasets and distributing them internationally is an endless nightmare that includes privacy laws, copyright, sui-generis rights, patents, secrets and more. Without diving deeper into legal issues, let’s focus on practical examples to clarify why the distribution of the training dataset is not spelled out as a requirement in the concept of Data information.

  • The Pile, the open dataset used to train the very open Pythia models, was taken down after an alleged copyright infringement, currently being litigated in the United States. However, the Pile appears to be legal to share in Japan. It’s also unclear whether it can be legally shared in the European Union. 
  • DOLMA, the open dataset used to train the very open OLMo models, was initially released with a restrictive license. It later switched to a permissive one. On further inspection, DOLMA appears to suffer from the same legal uncertainties of the Pile, however the Allen Institute has not been sued yet.
  • Training techniques that preserve privacy like federated learning don’t create datasets. 

All these cases show that requiring the original datasets creates vagueness and uncertainty in applying the Open Source AI Definition:

  • If a dataset is only legal in Japan, is that AI Open Source only in Japan?
  • If a dataset is initially legally available but later retracted, does the AI go from being Open Source to not?
    • If so, what happens to the applications that use such AI?
  • If no dataset is created, then will any AI trained with such techniques ever be Open Source?

Additionally, there are reasons to believe that OpenAI, Anthropic and other proprietary systems have been trained on the same questionable data inside The Pile and DOLMA: Proving that’s the case is a lot harder and expensive though. This is clearly a disincentive to be open and transparent on the data sources, adding a burden to the organizations that try to do the right thing.

The solution to these questions, draft v0.0.8 contains the concept of Data information, coupled with code requirements to obtain the expected result: for end users, developers and deployers of AI systems to be able to reproduce an Open Source AI.

Understanding the concept of Data Information

Data information, in the draft Open Source AI Definition, is defined as: 

Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

Read that from the end: The intention of Data information is to allow developers to recreate a substantially equivalent system using the same or similar data. That means that an Open Source AI must disclose all the ingredients, where they’ve been bought and all the instructions to prepare the dish.  

This is a solution that came out of the co-design process, where reviewers didn’t rank the training datasets as high as they ranked the training code and data transparency requirements. 

Data information and the code requirements also address all of the questions around the legality of distributing data and datasets, or their absence.

If a dataset is only legal in Japan or becomes illegal later, one should still be able to recreate a dataset suitable to train an equivalent system replacing the illegal or unavailable pieces with similar ones.

AI systems trained with federated learning (where a dataset isn’t created) can still be Open Source AI if all instructions and code are released so that a new training with different data can generate an equivalent system.

The Data information concept also solves an example (raised on the forum) of an AI system trained on data licensed directly from Reddit. In this case, if the original developers released enough information to allow another AI developer to recreate a substantially equivalent system with Reddit data taken from an existing dataset, like CommonCrawl, it would be considered Open Source AI.

The proposed alternatives

While generally well received, draft v0.0.8 has been criticized by a few people on the forum for putting the training dataset in the “optional requirements”. Some suggestions and pushback we’ve received:

  • Require the use of synthetic data when the training dataset cannot be legally shared: This technique may work in some corner cases, if the technology evolves to be reliable enough. It’s expensive and untested at scale.
  • Classify as Open Source AI systems where all their components are “open source”: This approach is not rooted in the longstanding practice of the GNU project to accept system library exceptions and other compromises in exchange for more Open Source tools.
  • Datasets built by crawling the internet are the equivalent of theft, they shouldn’t be allowed  at all, let alone allowed in Open Source AI: This pushback ignores the reality that large data aggregators already have acquired legally the rights to accumulate that same data (through scraping and terms of use) and are trading it, exclusively capturing the economic value of what should be in the commons. Read Towards a Books Data Commons for AI Training for more details. There is no general agreement that text and data mining is equivalent to theft.

These demands and suggestions are hard to accept. We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones. We need a Definition that contains positive examples from the start so we can practically demonstrate positive qualities to policymakers. 

The discussion about data, how to generate incentives to create datasets that can be distributed internationally, safely, preserving privacy, is extremely complex. It can be addressed separately from the Open Source AI Definition. In collaboration with Open Future Foundation and others, OSI is designing a series of conferences to tackle the data governance issue. We’ll make an announcement soon.

Have your say now

The concept of Data information and code requirements is hard to grasp at first. But the preliminary results of the validation phase confirm that the draft v0.0.8 works as expected: Pythia and OLMo both would be Open Source AI, while Falcon, Grok, Llama, Mistral would not (even if they used OSD-compatible licenses) because they don’t share Data information. BLOOM and StarCoder would fail because of field-of-use restrictions in their models.

Data information can be improved but it’s better than other solutions proposed so far. As we get closer to the release of the stable version of the Open Source AI Definition, we need to hear from you: If you support this concept please comment on the forum today. If you don’t support it, please try to propose an alternative that at least covers the practical examples of Pile, DOLMA and federated learning above. Help the community move the conversation forward.