Open Source AI Definition – Weekly update May 20

A week loaded with important questions.

Overarching concerns with Draft v.0.0.8 and suggested modifications

A post signed by the AWS Open Source raised important questions, illustrating a disagreement on the concept of “Data information.”

  • A detailed post signed by the AWS Open Source team raises concerns about the draft concept of Data information in v0.0.8 and other important topics. I suggest reading their post. The major points discussed this week are:
    • The discussion on training data is not settled. AWS Open Source team argues that for an Open Source AI Definition to be effective, the data used to train the AI system must be included, similar to the requirement for source code in Open Source software. They say the current definitions mark the inclusion of datasets as optional, undermining transparency and reproducibility.
    • Their suggestion: Use synthetic data where the inclusion of actual datasets poses legal or privacy risks.
      • Valentino Giudice takes issues with the phrase “or AI systems, data is the equivalent of source code,” and states that “equivalent” is used too liberally here. For trained models, the dataset isn’t necessary to understand the model’s operations, which are determined by architecture and frameworks.
        • Ferraioli disagrees, stating that “A trained model cannot be considered open source without the data, processing code, and training code. Comparing a trained model to a software binary, we don’t call binaries open source without the source code being available and licensed as open source. “
      • Zacchiroli adds that they support the suggestion to use “high quality equivalent synthetic datasets” when the original data cannot be released. Although “equivalent” remains undefined and could create loopholes, this issue doesn’t worsen OSAID
    • Some proposed modifications otherwise include:
    • Require Release of Dependent Datasets
      • Mandate the release of training, testing, validation, and benchmarking datasets under an open data license or high-quality synthetic data if legal restrictions apply.
      • Update the “Data Information” section to make dataset release a requirement.
  • Prevent Restrictions on Outputs
    • Prohibit restrictions on the use, modification, or distribution of outputs generated by Open Source AI systems.
  • Eliminate Optional Components
    • Remove optional components from the OSAID to maintain a high standard of openness and transparency.
  • Address Combinatorial Ambiguity
    • Ensure any license applied to the distribution of multiple components in an Open Source AI system is OSD-approved.

Why and how to certify Open Source AI

  • The post from AWS team contained a comment about certification process for Open Source AI that deserves a separate thread. There are pending questions to be answered:
    • who exactly needs a certification that an AI system is Open Source AI?
    • who is going to use such certification? Is anyone of the groups deploying open foundation models today thinking that they could use one? For what purpose?
    • who is going to consume the information carried by the certification, why and how?
  • Zacchiroli adds that the need for certifying AI systems as OSAID compliant arises from inherent ambiguities in the definitions, such as terms like “sufficiently” and “high quality equivalent synthetic dataset.” Disagreements on compliance will require a judging authority, akin to OSI for the OSD. While managing judgments for OSAID might be more complex due to the potential volume, the community is likely to turn to OSI for such decisions.

Can a derivative of non-open-source AI be considered Open Source AI?

  • This question was asked on the draft document and moved to the forum for higher visibility. Is it technically possible to fine-tune a model without knowing the details of its initial training? Are there examples of successfully fine-tuned AI/ML systems where the initial training data and techniques were unknown but the fine-tuning data and methods were fully disclosed?
    • Shuji Sado added that fine-tuning typically involves updating the weights of newly added layers and some layers of the pre-trained model, but not all layers, to maintain the benefits of pre-training.
    • Valentino Giudice raised concerns over this point as multiple strategies for fine-tuning exist, allowing for flexibility in updating weights in any amount of existing layers without necessarily adding new ones. Even updating the entire network can be beneficial, as it leverages the pre-trained model’s information and can be more efficient than training a new model from scratch. Fine-tuning can slightly adjust the model’s performance or behaviour, integrating new data effectively.

Please, especially if you are knowledgeable in this field, we would love to hear more thoughts!