Open Source AI Definition – Weekly update June 3

Initial report on definition validation

  • A first draft of the report of the validation phase has been published. The validation phase is designed to review the compatibility of existing systems with the current draft definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
  • Problems and initial findings:
    • Elusive documents: Not having system creators involved meant reviewers had to independently search for legal documents, resulting in many blanks in the document list and subsequent analysis.
    • One component, many artifacts, and documents: Some components were linked to multiple artifacts and documents, complicating the review process as source code and documentation could be spread across several repositories and reports.
    • Compounded components: Components in the checklist often combined multiple artifacts, such as training and validation code, making it difficult to track down specific legal documents.
    • Compliant? Conformant? Six out of eleven required components need a legal framework that is “compliant” or “conformant” with the Open Source Definition, prompting a need for clearer guidance on reviewing non-software components.
    • Reverting to the license: Reviewers suggested simplifying the process by relying on whether a legal document is OSI-approved, conformant, or compliant to guarantee the right to use, study, modify, and share the component, eliminating the need for independent assessment.
  • Next steps:
    • As we are looking to fill in the gaps from above we call on both system creators and independent volunteers to complete various system reviews. 
    • If your familiar system is not on the list, contact Mer on the forum
  • Initial questions and queries:
    • @jasonbrooks asks if the validation process should check if there’s “sufficiently detailed information about the data used to train the system so a skilled person can recreate a substantially equivalent system.” It’s unclear if this has been confirmed, and examples of skilled individuals achieving this would be helpful.
      • @stefano replies that the Preferred form lists enduring principles, while the Checklist details required components. Validation ensures components like training methodologies and data provenance are available, enabling system recreation. Mer’s report highlights the difficulty in finding these components, suggesting a need for a better method. One idea is a detailed survey for AI developers, though companies like Meta might misuse the “Open Source” label. Public pressure may eventually deter such abuses.
    • @amcasari adds insights into the process of reviewing licenses.

Open Source AI needs to require data to be viable 

  • This week, the conversation shifted heavily toward the possibilities of creating a gradient approach to open licensing.
  • @Markhas shared that he is publishing a paper regarding open washing, the AI ACT, and a case for a gradient notion of openness.
    • In line with previous points mostly raised by @danish_contactor, Mark highlights the RAIL licenses and argues that it should count towards openness too, stating that “I think providers and users of LLMs should not be free to create oil spills in our information landscape and I think RAIL provides useful guardrails for that.”
    • They also present their visualization of the degrees of openness of different systems 
  • @stefano has reiterated that the open-source AI definition will remain binary, just like the Open Source Definition is binary. And responding to @Markhas and @danish_contactor, he linked to Kate Downing legal analysis of RAIL licensing framework.

Can a derivative of non-open-source AI be considered Open Source AI? 

  • Answering @stefano’s earlier questions, @mark adds that it’s challenging to fine-tune a model without knowing the initial training data and techniques. Examples like Meta and Mistral fine-tunes show success despite the lack of transparency in the original training data. Intel’s Neural 7B and AllenAI’s Tulu 70B demonstrate effective fine-tuning with detailed disclosure of fine-tuning steps and data. However, these efforts can’t qualify as truly open AI systems due to the closed nature of the base models and potential legal liabilities.
  • @stefano closed the topic stating that, based on feedback, “Derivatives of non-Open Source AI cannot be Open Source AI”

Why and how to certify Open Source AI

  • @amscott added that AI developers will likely self-certify compliance with the OSAID, with objective certification needed for arbitration in nuanced cases. Like the OSD, the OSAID will mature through community practice. A simple self-certification tool could promote transparency and document good practices.
  • @mark added that The EU AI Act emphasizes “Open Source” systems, offering exemptions attractive to companies like Meta and Mistral. The AI Act requires disclosure templates overseen by an AI Office, potentially leading to intense lobbying efforts. If Open Source organizations influence regulation and certification, transparency may strengthen the Open Source ecosystem.

Question regarding the 0.0.8 definition 

  • Question from @Jennifer Ding regarding why “information” is a focus for the data category and not the code and model categories.
  • @Matt White adds that OSD-Conformant (in the checklist) should be defined somewhere.
    • He further adds (to Data Information, under checklist) that many “open” models withhold various forms of data, making it unreasonable to expect model producers to release all the information necessary for full replication of the data pipeline if data is not a required component of the definition
  • @Micheal Dolan adds that ”the use of OSD-compliant and OSD-conformant without any definitions of either term is difficult to parse the meaning of.” and suggests some solutions.

OSAID at PyCon US

  • Missing a recap of how we got to where we are now? OSI was present at PyCon in Pittsburgh where we held a workshop regarding our current definition and spoke with many knowledgeable shareholders. You can read about it here.