Open Source AI Definition – Weekly update May 27
Open Source AI needs to require data to be viable
- @juliaferraioli and the AWS team have reopened the debate regarding access to training data. This comes in a new forum which mirrors concerns raised in a previous one. They argue that to achieve modifiability, an AI system must ship the original training dataset used to train it. Full transparency and reproducibility require the release of all datasets used to train, validate, test, and benchmark. For Ferraioli, data is considered equivalent to source code for AI systems, therefore its inclusion should not be optional. In a message signed by the AWS Open Source team, she proposed that original training datasets or synthetic data with justification for non-release be required to meet the Open Source AI standard.
- @stefano added some reminders as we reopen this debate. These are the points to keep in mind:
- Abandon the mental map that makes you look for the source of AI (or ML) as that map has been driving us in circles. Instead, we’re looking for the “preferred form to make modifications to the system”
- The law in most legislation around the world makes it illegal to distribute data, because of copyright, privacy and other laws. It’s also not as clear how the law treats datasets and it’s constantly changing
- Text of draft 0.0.8 is drafted to be vague on purpose regarding “Data information”. This is to resist the test of time and technology changes.
- When criticizing the draft, please provide specific examples in your question, and avoid arguing in the abstract.
- @danish_contractor argues that the current draft is likely to disincentivize openness due to the community viewing models (BLOOM or StarCoder), which include usage restrictions to prevent harms, less favorably despite being more transparent, reproducible, and thus more “open” than models like Mistral.
- @Pam Chestek clarified that Open Source has two angles: the rights to use, study, modify and share, coupled with those rights being unrestricted. Both are equally important.
- This debate echoes earlier ones on recognizing open components of an AI system.
The FAQ page has been updated
- The FAQ page is starting to take shape and we would appreciate more feedback. So far, we have preliminary answers to these questions:
- Why is the original training dataset not required?
- Why the grant of freedoms is to its users?
- What are the model parameters?
- Are model parameters copyrightable?
- What does “Available under OSD-compliant license” mean?
- What does “Available under OSD-conformant terms” mean?
- Why is the Open Source AI Definition includes a list of components while the Open Source Definition for software doesn’t say anything about documentation, roadmap and other useful things?
- Why is there no mention of safety and risk limitations in the Open Source AI Definition?
Draft v0.0.8 Review from LLM360
- @vamiller has submitted on behalf of the LLM360 team a review of their models. In his view the v0.0.8 reflect the principles of Open Source applied to AI. He asks about the ODC-By licence, arguing that it is compatible with OSI’s principles but it’s a data-only license.
Join the next town hall meeting
- The next town hall meeting will take place on May 31st at 3:00 pm – 4:00 pm UTC. We encourage all who can participate to attend. This week, we will delve deeper into the issues regarding access (or not) to training data.
Reposts
Likes