Open Source AI Definition – Weekly update June 10

Open Source AI needs to require data to be viable

  • With many different discussions happening at once, here are the main points:
    • On the issue of training data
      • @mark is concerned with openness of AI not being meaningful if there is not a focus on the training data.” Model weights are the most inscrutable component of current generative AI, and providers that release only [the weights] should not get a free ‘openness’ pass.”
      • @stefano agrees with all of that but questions the criteria used to assign green marks in Mark’s paper, pointing out inconsistencies. They use the example of Pythia-Chat-Base-7, which relies on a dataset from OpenDataHub with potential issues like non-versioned data and stale links, failing to meet stringent requirements required by @juliaferraioli. Similar concerns are raised for other models like OLMo 7B Instruct, which lack specific data versioning details. Maffulli also highlights the case of Pythia-7B, which once may have been compliant but it’s now problematic due to the unavailability of its foundational dataset, the Pile, illustrating the complexities in maintaining an “open source” status over time, if the stringent proposal suggested by @juliaferraioli and the AWS team is adopted.
      • @shujisado adds that while he sympathizes with @juliaferraioli‘s request for datasets, @stefano‘s arguments in support of the concept of “Data information” are aligned with the OSI principles and are reasonable.
      • @spotaws stresses that “data information” alone is insufficient if the data itself is too vague.
      • @juliaferraioli adds that while replicating AI systems like OLMo or Pythia may seem impractical due to costs and statistical nature, the capability is crucial for broader adoption and consistency.  She finds the current definition to be unclear and subjective.
      • @zack recommends to review StarCoder2, recognizing that it would be in the same category of BLOOM: a system with lots of transparency and a dataset made available but released with a restrictive license.
      • @Ezequiel_Lanza joined the conversation in support of the concept of Data information, claiming, with technical arguments that “sharing the dataset is not necessarily required and may not justify the potential risks associated with making it mandatory.”
      • Partially open / restrictive licenses
        • Continuing @marks points regarding restrictive licenses (like the ethical licenses), @stefano has added a link to an article highlighting some reasons why OSI is staying away from these licenses.
        • @pchestek further adds that a partially open license would create even more opportunities for open washing, as “open source AI” could have many meanings.
        • @mark clarified that rather than proposing a variety of meanings, they are seeking to highlight the dimensions of openness in their paper, exploring the broader landscape. 
        • @stefano adds that in the 26 years of OSI, it has contended with numerous organizations claiming varying degrees of openness as “open source. This issue is now mirrored in AI, as companies seek the market value of being labeled Open Source. Open Source is binary: either users have full rights or they don’t, and any system that falls short is not Open Source AI, regardless of how “almost” open it is.
      • Field of use/restriction 
        • @juliaferraioli believes that OSAID should include prohibitions against field-of-use restrictions.
        • @shujisado adds that OSAID specifies four freedoms as requirements for being considered open source and that this should be understood as the same since “freedom” is the same as “non-restricted”. The 10 clauses of the OSD have been replaced by the checklist in draft v0.0.8.
        • @juliaferraioli adds that individual components may be covered by their individual licenses, but the overall system may be subject to additional terms, which is why we need this to be explicit.

Initial Report on Definition Validation

  • @Mer has added how far we are regarding our system analysis compared to our current draft definition. Some points that remain incomplete have been highlighted.
  • Mistral (Mixtral 8x7B) is considered not in alignment with the OSAID because its data pre-processing code is not released under an OSI-approved license.

Can a derivative of non-open-source AI be considered Open Source AI?

  • @tarek_ziade shares his experience fine-tuning a “small” model (200M parameters) for a Firefox feature to describe images, using a base model for image encoding and text decoding. Despite not having 100% traceability of upstream data, Tarek argues that intentional fine-tuning and transparency make the new fine-tuned model open source. Any issues arising from downstream data can be addressed by the project maintainers, maintaining the model’s open source status.

Town hall recording out

  • We held our 10th town hall meeting a week and a half ago. You can access the recording here if you missed it.
  • A new town hall meeting is scheduled for this Friday, June 14.