Cailean Osborne: voices of the Open Source AI Definition
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Cailean Osborne
What’s your background related to Open Source and AI?
My interest in Open Source AI began around 2020 when I was working in AI policy at the UK Government. I was surprised that Open Source never came up in policy discussions, given its crucial role in AI R&D. Having been a regular user of libraries like scikit-learn and PyTorch in my previous studies. I followed Open Source AI trends in my own time and eventually I decided to do a PhD on the topic. When I started my PhD back in 2021, Open Source AI still felt like a niche topic, so it’s been exciting to watch it become a major talking point over the years.
Beyond my PhD, I’ve been involved in Open Source AI community as a contributor to scikit-learn and as a co-developer of the Model Openness Framework (MOF) with peers from the Generative AI Commons community. Our goal with the MOF is to provide guidance for AI researchers and developers to evaluate the completeness and openness of “Open Source” models based on open science principles. We were chuffed that the OSI team chose to use the 16 components from the MOF as the rubric for reviewing models in the co-design process.
What motivated you to join this co-design process to define Open Source AI?
The short answer is: to contribute to establishing an accurate definition for “Open Source AI” and to learn from all the other experts involved in the co-design process. The longer answer is: There’s been a lot of confusion about what is or is not “Open Source AI,” which hasn’t been helped by open-washing. “Open source” has a specific definition (i.e. the right to use, study, modify, and redistribute source code) and what is being promoted as “Open Source AI” deviates significantly from this definition. Rather than being pedantic, getting the definition right matters for several reasons; for example, for the “Open Source” exemptions in the EU AI Act to work (or not work), we need to know precisely what “Open Source” models actually are. Andreas Liesenfeld and Mark Dingemanse have written a great piece about the issues of open-washing and how they relate to the AI Act, which I recommend reading if you haven’t yet. So, I got involved to help develop a definition and to learn from all the other experts involved. It hasn’t been easy (it’s a pretty divisive topic!), but I think we’ve made good progress.
Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?
First off, I have to give credit to Stef and Mer for maintaining momentum throughout the process. Coordinating a co-design effort with volunteers scattered around the globe, each with varying levels of availability and (strong) opinions on the matter, is no small feat. So, well done! I also enjoyed seeing how others agreed or disagreed when reviewing models. The moments of disagreement were the most interesting; for example, about whether training data should be available versus documented and if so, in how much detail… Personally, the main challenge was searching for information about the various components of models that were apparently “Open Source” and observing how little information was actually provided beyond weights, a model card, and if you’re lucky an arXiv preprint or technical report.
Why do you think AI should be Open Source?
When talking about the benefits of Open Source AI, I like to point folks to a 2007 paper, in which 16 researchers highlighted “The Need for Open Source Software in Machine Learning” due to basically the complete lack of OSS for ML/AI at the time. Fast forward to today, AI R&D is practically unthinkable without OSS, from data tooling to the deep learning frameworks used to build LLMs. Open source and openness in general have many benefits for AI, from enabling access to SOTA AI technologies and transparency which is key for reproducibility, scrutiny, and accountability to widening participation in their design, development, and governance.
What do you think is the role of data in Open Source AI?
If the question is strictly about the role of data in developing open AI models, the answer is pretty simple: Data plays a crucial role because it is needed for training, testing, aligning, and auditing models. But if the question is asking “should the release of data be a condition for an open model to qualify as Open Source AI,” then the answer is obviously much more complicated.
Companies are in no rush to share training data due to a handful of reasons: be it competitive advantage, data protection, or frankly being sued for copyright infringement. The copyright concern isn’t limited to companies: EleutherAI has also been sued and had to take down the Books3 dataset from The Pile. There are also many social and cultural concerns that restrict data sharing; for example, the Kōrero Kaitiakitanga license has been developed to protect the interests of indigenous communities in New Zealand. So, the data question isn’t easy and perhaps we shouldn’t be too dogmatic about it.
Personally, I think the compromise in v. 0.0.8, which states that model developers should provide sufficiently detailed information about data if they can’t release the training dataset itself, is a reasonable halfway house. I also hope to see more open pre-training datasets like the one developed by the community-driven BigScience Project, which involved open deliberation about the design of the dataset and provides extensive documentation about data provenance and processing decisions (e.g. check out their Data Catalogue). The FineWeb dataset by Hugging Face is another good example of an open pre-training dataset, which they released with pre-processing code, evaluation results, and super detailed documentation.
Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?
To be honest, my personal definition hasn’t changed much. I am not a big fan of the use of “Open Source AI” when folks specifically mean “open models” or “open-weight models”. What we need to do is raise awareness about appropriate terminology and point out “open-washing”, as people have done, and I must say that subjectively I’ve seen improvements: less “Open Source models” and more “open models”. But I will say that I do find “Open Source AI” a useful umbrella term for the various communities of practice that intertwine in the development of open models, including OSS, open data, and AI researchers and developers, who all bring different perspectives and ways of working to the overarching “Open Source AI” community.
What do you think the primary benefit will be once there is a clear definition of Open Source AI?
We’ll be able to reduce confusion about what is or isn’t “Open Source AI” and more easily combat open-washing efforts. As I mentioned before, this clarity will be beneficial for compliance with regulations like the AI Act which includes exemptions for “Open Source” AI.
What do you think are the next steps for the community involved in Open Source AI?
We still have many steps to take but I’ll share three for now.
First, we urgently need to improve the auditability and therefore the safety of open models. With OSS, we know that (1) the availability of source code and (2) open development enable the distributed scrutiny of source code. Think Linus’ Law: “Given enough eyeballs, all bugs are shallow.” Yet open models are more complex than just source code, and the lack of openness of many key components like training data is holding back adoption because would-be adopters can’t adequately run due diligence tests on the models. If we want to realise the benefits of “Open Source AI,” we need to figure out how to increase the transparency and openness of models —we hope the Model Openness Framework can help with this.
Second, I’m really excited about grassroots initiatives that are leading community-driven approaches to developing open models and open datasets like the BigScience project. They’re setting an example of how to do “Open Source AI” in a way that promotes open collaboration, transparency, reproducibility, and safety from the ground up. I can still count such initiatives with my fingers but I am hopeful that we will see more community-driven efforts in the future.
Third, I hope to see the public sector and non-profit foundations get more involved in supporting public interest and grassroots initiatives. France has been a role model on this front: providing a public grant to train the BigScience project’s BLOOM model on the Jean Zay supercomputer, as well as funding the scikit-learn team to build out a data science commons.
How to get involved
The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the working groups: be part of a team to evaluate various models against the OSAID.
- Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
- Comment on the latest draft: provide feedback on the latest draft document directly.
- Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
- Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.