OSI at the United Nations OSPOs for Good
Earlier this month the Open Source Initiative participated in the “OSPOs for Good” event promoted by the United Nations in NYC. Stefano Maffulli, the Executive Director of the OSI, participated in a panel moderated by Mehdi Snene about Open Source AI alongside distinguished speakers Ashley Kramer, Craig Ramlal, Sasha Luccioni, and Sergio Gago. Please find below a transcript of Stefano’s presentation.
Mehdi Snene
What is Open Source in AI? What does it mean? What are the foundational pieces? How far along is the data? There is mention of weights, and data skills. How can we truly understand what Open Source in AI is? Today, joining us, we’ll have someone who can help us understand what Open Source in AI means and where we are heading. Stefano, can you offer your insights?
Stefano Maffulli
Thanks. We have some thoughts on this. We’ve been pondering these questions since they first emerged when GPT started to appear. We asked ourselves: How do we transfer the principles of permissionless innovation and the immense value created by the Open Source ecosystem into the AI space?
After a little over two years of research and global conversations with multiple stakeholders, we identified three key elements. Firstly, permissionless innovation needs to be ported to AI, but this is complex and must be broken down into smaller components.
We realized that, as developers, users, and deployers of AI systems, we need to understand how these systems are built. This involves studying all components carefully, being able to run them for any purpose without asking for permission (a basic tenet of Open Source), and modifying them to change outputs based on the same inputs. These basic principles include being able to share these modifications with others.
To achieve this, you need data, the code used for training and cleaning the data (e.g., removing duplicates), the parameters, the weights, and a way to run inference on those weights. It’s fairly straightforward. However, the challenge lies in the legal framework.
Now, the complicated piece is how Open Source software has had a very wonderful run, based on the fact that the legal framework that governs Open Source is fairly simple and globally accepted. It’s built on copyright, a system that has worked wonderfully in both ways. It gives exclusive rights to the content creators, but also the same mechanism can be used to grant rights to anyone who receives the creation.
With data, we don’t have that mechanism. That is a very simple and dramatic realization. When we talk about data, we should pay attention to what kind of data we’re discussing. There is data as content created, and there is data as facts; like fires, speed limits, or traces of a road. Those are facts, and they have different ways of being treated. There is also private data, personal information, and various other kinds of data, each with different rules and regulations around the world.
Governments’ major role in the future will be to facilitate permissionless innovation in data by harmonizing these rules. This will level the playing field, where currently larger corporations have significantly more power than Open Source developers or those wishing to create large language models. Governments should help create datasets, remove barriers, and facilitate access for academia, smaller developers, and the global south.
Mehdi Snene
We already have open data and Open Source. Now, we need to create open AI and open models. Are we bringing these two domains together and keeping them separate, or are we creating something new from scratch when we talk about open AI?
Stefano Maffulli
This is a very interesting and powerful question. I believe that open data as a movement has been around for quite a while. However, it’s only recently that data scientists have truly realized the value they hold in their hands. Data is fungible and can be used to build new things that are completely different from their original domains.
We need to talk more about this and establish platforms for better interaction. One striking example is a popular dataset of images used for training many image generation AI tools, which contained child sexual abuse images for many years. A research paper highlighted this huge problem, but no one filed a bug report, and there was no easy way for the maintainers of this dataset to notice and remove those images.
There are things that the software world understands very well, and things that data scientists understand very well. We are starting to see the need for more space for interactions and learning from each other.
The conversation is extremely complicated. Alex and I have had long discussions about this. I don’t want to focus entirely on this, but I do want to say that Open Source has never been about pleasing companies or specific stakeholders. We need to think of it as an ecosystem where the balances of power are maintained.
While Open Source software and Open Source AI are still evolving, the necessary ingredients—data, code, and other components—are there. However, the data piece still needs to be debated and finalized. Pushing for radical openness with data has clear drawbacks and issues. It’s going to be a balance of intentions, aiming for the best outcome for the general public and the whole ecosystem.
Mehdi Snene
Thank you so much. My next question is about the future. What are your thoughts on the next big technology?
Stefano Maffulli
From the perspective of open innovation, it’s about what’s going to give society control over technology. The focus of Open Source has always been to enable developers and end-users to have sovereignty over the technology they use. Whether it’s quantum computers, AI, or future technologies, maintaining that control is crucial.
Governments need to play a role in enabling innovation and ensuring that no single power becomes too dominant. The balance between the private sector, public sector, nonprofit sector, and the often-overlooked fourth sector—which includes developers and creators who work for the public good rather than for profit—must be maintained. This balance is essential for fostering an ecosystem where all stakeholders have equal interests and influence.
If you would like to listen to the panel discussion in its entirety, you can do so here (the Open Source AI panel starts at 1:00:00 approximately).