Nick Vidal – Open Source Initiative

Deshni Govender: Voices of the Open Source AI Definition

Nick Vidal — Thu, 01 Aug 2024 18:14:15 +0000

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Deshni Govender

What’s your background related to Open Source and AI?

I am the South Africa country focal point for the German Development Cooperation initiative “FAIR Forward – Artificial Intelligence for All” and the project strives for a more open, inclusive and sustainable approach to AI on an international level. More significantly, we seek to democratize the field of AI, to enable more robust, inclusive and self-determined AI ecosystems. Having worked in private sector and then now being in international development, my attention has been drawn to the disparity between the power imbalances of proprietary vs open and how this results in economic barriers for global majority, but also creates further harms and challenges for vulnerable populations and marginalized sectors, especially women. This fuelled my journey of working towards bridging the digital divide and digital gender gap through democratizing technology.

Some projects I am working on in this space include developing data governance models for African NLP (with Masakhane Foundation) and piloting new community-centered, equitable license types for voice data collection for language communities (with Mozilla).

What motivated you to join this co-design process to define Open Source AI?

I have experienced first hand the power imbalances that exist in geo-politics, but also in the context of economics where global minority countries shape the ‘global trajectory’ of AI without global voices. The definition of open means different things to different people / ecosystems / communities, and all voices should be heard and considered. Defining open means the values and responsibilities attached to it should be considered in a diverse manner, else the context of ‘open’ is in and of itself a hypocrisy.

Why do you think AI should be Open Source?

An enabling ecosystem is one that benefits all the stakeholders and ecosystem components. Inclusive efforts must be outlaid to explore and find tangible actions or potential avenues on how to reconcile the tension between openness, democracy and representation in AI training data whilst preserving community agency, diverse values and stakeholder rights. However, the misuse, colonization and misinterpretation of data continues unabated. Much of African culture and knowledge is passed down generations by story telling, art, dance and poetry and is done so verbally or through different ways of documentation, and in local manners and nuances of language. It is rarely digitized and certainly not in English. Language is culture and culture is context, yet somehow we find LLMs being used as an agent for language and context. Solutions and information are provided about and for communities but not with those communities, and the lack of transparency and post-colonial manipulation of data and culture is both irresponsible and should be considered a human rights violation.

Additionally, Open Source and open systems enable nations to develop inclusive AI policy processes so that policymakers from Global South countries can draw from peer experience on tackling their AI policies and AI-related challenges to find their own approaches to AI policy. This will also challenge dependence from and domination by western centric / Global North countries on AI policies to push a narrative or agenda on ‘what’ and ‘how’; i.e. Africa / Asia / LATAM must learn from us how to do X (since we hold the power, we can determine the extent and cost – exploitative). We aim for government self-determination and to empower countries, so that they may collectively have a voice on the global stage.

Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?

My personal definition has not changed but it has been refreshing to witness the diverse views on how open is defined. The idea that behavior (e.g. of tech oligopolies) could reshape the way we define an idea or concept was thought-provoking. It means therefore that as emerging technology evolves, the idea of ‘open’ could change still in the future, depending on the trajectory of emerging technology and the values that society holds and attributes.

What do you think the primary benefit will be once there is a clear definition of Open Source AI?

A clear and more inclusive definition of Open Source AI would commerce a wave towards making data injustice, data invisibility, data extractivism, and data colonialism more visible and for which there exists repercussions. It would spur open, inclusive and responsible repositories of data, data use, and more importantly accuracy of use and interpretation. I am hoping that this would also spur innovative ways on how to track and monitor / evaluate use of Open Source data, so that local and small businesses are encouraged to develop in an Open Source while still being able to track and monitor players who extract and commercialize without giving back.

Ideally it would begin the process (albeit transitional) of bridging the digital divide between source and resource countries (i.e. global majority where data is collected from versus those who receive and process data for commercial benefit).

What do you think are the next steps for the community involved in Open Source AI?

If we make everything Open Source, it encourages sharing and use in developing and deploying, offers transparency and shared learning but enables freeriding. However the corollary is that closed models such as copyright prioritize proprietary information and commercialisation but can limit shared innovation, and does not uphold the concept of communal efforts, community agency and development. How do we quell this tension? I would like to see the Open Source community working to find practical and actionable ways in which we can make this work (open, responsible and innovative but enabling community benefit / remuneration).

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

Join the working groups: be part of a team to evaluate various models against the OSAID.
Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
Comment on the latest draft: provide feedback on the latest draft document directly.
Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.

Hailey Schoelkopf: Voices of the Open Source AI Definition

Nick Vidal — Thu, 25 Jul 2024 11:45:26 +0000

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Hailey Schoelkopf

What’s your background related to Open Source and AI?

One of the main reasons I was able to get more deeply involved in AI research was through open research communities such as the BigScience Workshop and EleutherAI, where discussions and collaboration were available to outsiders. These opportunities to share knowledge and learn from others more experienced than me were crucial to learning about the field and growing as a practitioner and researcher.

I co-lead the training of the Pythia language models (https://arxiv.org/abs/2304.01373), some of the first fully-documented and reproducible large-scale language models with as many related artifacts as possible released Open Source. We were happy and lucky to see these models fill a clear need, especially in the research community, where Pythia has since contributed to a large amount of studies attempting to build our understanding of LLMs, including interpreting their internals, understanding the process by which these models improve over training, and disentangling some of the effects of the dataset contents on these models’ downstream behavior.

What motivated you to join this co-design process to define Open Source AI?

There has been a significant amount of confusion induced by the fact that not all ‘open-weights’ AI models released are released under OSI-compliant licenses-–or impose restrictions on their usage or adaptation-–so I was excited that OSI was working on reducing this confusion by producing a clear definition that could be used by the Open Source community. I more directly joined the process by helping discuss how the Open Source AI Definition could be mapped onto the Pythia language models and the accompanying artifacts we released.

Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?

Deciding what counts as sufficient transparency and modifiability to be Open Source was an interesting problem. Although public model weights are very beneficial to the Open Source community, releasing model weights without sufficient detail to understand the model and its development process to make modifications or understand reasons behind its design and resulting characteristics can hinder understanding or prevent the full benefits of a completely Open Source model from being realized.

Why do you think AI should be Open Source?

There are clear advantages to having models that are Open Source. Access to such fully-documented models can help a much, much broader group of people–trained researchers and also many others–who can use, study, and examine these models for their own purposes. While not every model should be made Open Source under all conditions, wider scrutiny and study of these models can help increase our understanding of AI systems’ behavior, raise societal preparedness and awareness of AI capabilities, and improve these models’ safety by allowing more people to understand them and explore their flaws.

With the Pythia language models, we’ve seen many researchers explore questions around the safety and biases of these models, including a breadth of questions we’d not have been able to study ourselves, or many that we could not even anticipate. These different perspectives are a crucial component in making AI systems safer and more broadly beneficial.

What do you think is the role of data in Open Source AI?

Data is a crucial component of AI systems. Transparency around (and, potentially, open release of) training datasets can enable a wide range of extended benefits to researchers, practitioners, and society at large. I think that for a model to be truly Open Source, and to derive the greatest benefits from its openness, information on training data must be shared transparently. This information also importantly allows various members of the Open Source community to avoid replicating each other’s work independently. Transparent sharing about motivations and findings with respect to dataset creation choices can improve the community’s collective understanding of system and dataset design for the future and minimize overlapping, wasted effort.

Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?

An interesting perspective that I’ve grown to appreciate is that the Open Source AI definition includes public and Open Source licensed training and inference code. Actually making one’s Open Source AI model effectively usable by the community and practitioners is a crucial step of promoting transparency, though not often enough discussed.

What do you think the primary benefit will be once there is a clear definition of Open Source AI?

Having a clear definition of Open Source AI can make it clearer where existing currently “open” systems fall, and potentially encourage future open-weights models to be released with more transparency. Many current open-weights models are shared under bespoke licenses with terms not compliant with Open Source principles–this creates legal uncertainty and also makes it less likely that a new open-weights model release will benefit practitioners at large or contribute to better understanding of how to design better systems. I would hope that a clearer Open Source AI definition will make it easier to draw these lines and encourage those currently releasing open-weights models to do so in a way more closely fitting the Open Source AI standard.

What do you think are the next steps for the community involved in Open Source AI?

An exciting future direction for the Open Source AI research community is to explore methods for greater control over AI model behavior; attempting to explore approaches to collective modification and collaborative development of AI systems that can adapt and be “patched” over time. A stronger understanding of how to properly evaluate these systems for capabilities, robustness, and safety will also be crucial. I hope to see the community direct greater attention to evaluation in the future as well.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

Join the working groups: be part of a team to evaluate various models against the OSAID.
Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
Comment on the latest draft: provide feedback on the latest draft document directly.
Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.

Better identifying conda packages with ClearlyDefined

Nick Vidal — Tue, 23 Jul 2024 23:17:18 +0000

ClearlyDefined, an Open Source project that helps organizations with supply chain compliance, now provides a new harvester implementation for conda, a popular package manager with a large collection of pre-built packages for various domains, including data science, machine learning, scientific computing and more.

Conda provides package, dependency and environment management for any language and is very popular with Python and R. It allows users to manage and control the dependencies and versions of packages specific to each project, ensuring reproducibility and avoiding conflicts between different software requirements.

ClearlyDefined crawls both the main conda package and the source code for licensing metadata. The main conda package is hosted on the conda channels themselves and contains all necessary licensing information, compilers, environment configuration scripts and dependencies that are needed to make the package work. The source code from which the conda package is created oftentimes is hosted in an external website such as GitHub.

The conda crawler uses the following coordinates:

type (required): conda or condasource
provider (required): channel on which the package will be crawled, such as conda-forge, anaconda-main or anaconda-r
namespace (optional): architecture and OS of the package to be crawled, i.e. win64, linux-aarch64 or any if no architecture is specified.
package name (required): name of the package
revision (optional): package version and optional build version

For example, the popular numpy package is represented as shown below.

With the increased importance of data science, machine learning and scientific computing, this support for conda packages in ClearlyDefined is extremely important. It will allow organizations to better manage the licenses of their conda packages for compliance. This work was led by Basit Ayantunde from CodeThink with the stewardship from Qing Tomlison from SAP. We would like to thank them and all those involved in the development and testing of this implementation.

We are looking for feedback. Please test this feature on dev.clearlydefined.io or dev-api.clearlydefined.io and file any issues here.

Cailean Osborne: voices of the Open Source AI Definition

Nick Vidal — Thu, 18 Jul 2024 17:09:33 +0000

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Cailean Osborne

What’s your background related to Open Source and AI?

My interest in Open Source AI began around 2020 when I was working in AI policy at the UK Government. I was surprised that Open Source never came up in policy discussions, given its crucial role in AI R&D. Having been a regular user of libraries like scikit-learn and PyTorch in my previous studies. I followed Open Source AI trends in my own time and eventually I decided to do a PhD on the topic. When I started my PhD back in 2021, Open Source AI still felt like a niche topic, so it’s been exciting to watch it become a major talking point over the years.

Beyond my PhD, I’ve been involved in Open Source AI community as a contributor to scikit-learn and as a co-developer of the Model Openness Framework (MOF) with peers from the Generative AI Commons community. Our goal with the MOF is to provide guidance for AI researchers and developers to evaluate the completeness and openness of “Open Source” models based on open science principles. We were chuffed that the OSI team chose to use the 16 components from the MOF as the rubric for reviewing models in the co-design process.

What motivated you to join this co-design process to define Open Source AI?

The short answer is: to contribute to establishing an accurate definition for “Open Source AI” and to learn from all the other experts involved in the co-design process. The longer answer is: There’s been a lot of confusion about what is or is not “Open Source AI,” which hasn’t been helped by open-washing. “Open source” has a specific definition (i.e. the right to use, study, modify, and redistribute source code) and what is being promoted as “Open Source AI” deviates significantly from this definition. Rather than being pedantic, getting the definition right matters for several reasons; for example, for the “Open Source” exemptions in the EU AI Act to work (or not work), we need to know precisely what “Open Source” models actually are. Andreas Liesenfeld and Mark Dingemanse have written a great piece about the issues of open-washing and how they relate to the AI Act, which I recommend reading if you haven’t yet. So, I got involved to help develop a definition and to learn from all the other experts involved. It hasn’t been easy (it’s a pretty divisive topic!), but I think we’ve made good progress.

Can you describe your experience participating in this process? What did you most enjoy about it and what were some of the challenges you faced?

First off, I have to give credit to Stef and Mer for maintaining momentum throughout the process. Coordinating a co-design effort with volunteers scattered around the globe, each with varying levels of availability and (strong) opinions on the matter, is no small feat. So, well done! I also enjoyed seeing how others agreed or disagreed when reviewing models. The moments of disagreement were the most interesting; for example, about whether training data should be available versus documented and if so, in how much detail… Personally, the main challenge was searching for information about the various components of models that were apparently “Open Source” and observing how little information was actually provided beyond weights, a model card, and if you’re lucky an arXiv preprint or technical report.

Why do you think AI should be Open Source?

When talking about the benefits of Open Source AI, I like to point folks to a 2007 paper, in which 16 researchers highlighted “The Need for Open Source Software in Machine Learning” due to basically the complete lack of OSS for ML/AI at the time. Fast forward to today, AI R&D is practically unthinkable without OSS, from data tooling to the deep learning frameworks used to build LLMs. Open source and openness in general have many benefits for AI, from enabling access to SOTA AI technologies and transparency which is key for reproducibility, scrutiny, and accountability to widening participation in their design, development, and governance.

What do you think is the role of data in Open Source AI?

If the question is strictly about the role of data in developing open AI models, the answer is pretty simple: Data plays a crucial role because it is needed for training, testing, aligning, and auditing models. But if the question is asking “should the release of data be a condition for an open model to qualify as Open Source AI,” then the answer is obviously much more complicated.

Companies are in no rush to share training data due to a handful of reasons: be it competitive advantage, data protection, or frankly being sued for copyright infringement. The copyright concern isn’t limited to companies: EleutherAI has also been sued and had to take down the Books3 dataset from The Pile. There are also many social and cultural concerns that restrict data sharing; for example, the Kōrero Kaitiakitanga license has been developed to protect the interests of indigenous communities in New Zealand. So, the data question isn’t easy and perhaps we shouldn’t be too dogmatic about it.

Personally, I think the compromise in v. 0.0.8, which states that model developers should provide sufficiently detailed information about data if they can’t release the training dataset itself, is a reasonable halfway house. I also hope to see more open pre-training datasets like the one developed by the community-driven BigScience Project, which involved open deliberation about the design of the dataset and provides extensive documentation about data provenance and processing decisions (e.g. check out their Data Catalogue). The FineWeb dataset by Hugging Face is another good example of an open pre-training dataset, which they released with pre-processing code, evaluation results, and super detailed documentation.

Has your personal definition of Open Source AI changed along the way? What new perspectives or ideas did you encounter while participating in the co-design process?

To be honest, my personal definition hasn’t changed much. I am not a big fan of the use of “Open Source AI” when folks specifically mean “open models” or “open-weight models”. What we need to do is raise awareness about appropriate terminology and point out “open-washing”, as people have done, and I must say that subjectively I’ve seen improvements: less “Open Source models” and more “open models”. But I will say that I do find “Open Source AI” a useful umbrella term for the various communities of practice that intertwine in the development of open models, including OSS, open data, and AI researchers and developers, who all bring different perspectives and ways of working to the overarching “Open Source AI” community.

What do you think the primary benefit will be once there is a clear definition of Open Source AI?

We’ll be able to reduce confusion about what is or isn’t “Open Source AI” and more easily combat open-washing efforts. As I mentioned before, this clarity will be beneficial for compliance with regulations like the AI Act which includes exemptions for “Open Source” AI.

What do you think are the next steps for the community involved in Open Source AI?

We still have many steps to take but I’ll share three for now.

First, we urgently need to improve the auditability and therefore the safety of open models. With OSS, we know that (1) the availability of source code and (2) open development enable the distributed scrutiny of source code. Think Linus’ Law: “Given enough eyeballs, all bugs are shallow.” Yet open models are more complex than just source code, and the lack of openness of many key components like training data is holding back adoption because would-be adopters can’t adequately run due diligence tests on the models. If we want to realise the benefits of “Open Source AI,” we need to figure out how to increase the transparency and openness of models —we hope the Model Openness Framework can help with this.

Second, I’m really excited about grassroots initiatives that are leading community-driven approaches to developing open models and open datasets like the BigScience project. They’re setting an example of how to do “Open Source AI” in a way that promotes open collaboration, transparency, reproducibility, and safety from the ground up. I can still count such initiatives with my fingers but I am hopeful that we will see more community-driven efforts in the future.

Third, I hope to see the public sector and non-profit foundations get more involved in supporting public interest and grassroots initiatives. France has been a role model on this front: providing a public grant to train the BigScience project’s BLOOM model on the Jean Zay supercomputer, as well as funding the scikit-learn team to build out a data science commons.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

Join the working groups: be part of a team to evaluate various models against the OSAID.
Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
Comment on the latest draft: provide feedback on the latest draft document directly.
Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.

Mer Joyce: voices of the Open Source AI Definition

Nick Vidal — Wed, 10 Jul 2024 13:36:16 +0000

The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process. We’ll be featuring the voices of the volunteers who have helped shape and are shaping the Definition.

The OSI started researching the topic in 2022, and in 2023 began the co-design process of a new definition of Open Source that applies to AI. The OSI hired Mer Joyce, founder and principal of Do Big Good, as an independent consultant to lead the co-design process. She has worked for over a decade at the intersection of research, policy, innovation and social change.

Mer Joyce, process facilitator for the Open Source AI Definition

About co-design

Co-design, also called participatory or human-centered design, is a set of creative methods used to solve communal problems by sharing knowledge and power. The co-design methodology addresses the challenges of reaching an agreed definition within a diverse community (Costanza-Chock, 2020: Escobar, 2018: Creative Reaction Lab, 2018: Friedman et al., 2019).

As noted in MIT Technology Review’s article about the OSAID, “[t]he open-source community is a big tent… encompassing everything from hacktivists to Fortune 500 companies…. With so many competing interests to consider, finding a solution that satisfies everyone while ensuring that the biggest companies play along is no easy task.” (Gent, 2024).

The co-design method allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support from such a significant and broad group of people also creates a tension to be managed between moving swiftly enough to deliver outputs that can be used operationally and taking the time to consult widely to understand the big issues and garner community buy-in. Having Mer as facilitator of the OSAID co-design, with her in-depth experience, has been important in ensuring the integrity of the process.

The OSAID co-design process

The first step of the OSAID co-design process was to identify the freedoms needed for Open Source AI. After various online and in-person activities and discussions, including five workshops across the world, the community adopted the four freedoms for software, now adapted for AI systems:

Freedom to Use the system for any purpose and without having to ask for permission.
Freedom to Study how the system works and inspect its components.
Freedom to Modify the system for any purpose, including to change its output.
Freedom to Share the system for others to use with or without modifications, for any purpose.

The next step was the formation of four working groups to initially analyze four different AI systems and their components. To achieve better representation, special attention was given to diversity, equity and inclusion. Over 50% of the working group participants are people of color, 30% are black, 75% were born outside the US, and 25% are women, trans or nonbinary.

These working groups discussed and voted on which AI system components should be required to satisfy the four freedoms for AI. The components adopted are described in the Model Openness Framework developed by the Linux Foundation.

The vote compilation was performed based on the mean total votes per component (μ). Components that received over 2μ votes were marked as “required,” and between 1.5μ and 2μ were marked “likely required.” Components that received between 0.5μ and μ were marked as “likely not required,” and less than 0.5μ were marked “not required.”

After the working groups evaluated legal frameworks and legal documents for each component, each working group published a recommendation report. The end result is the OSAID with a comprehensive definition checklist encompassing a total of 17 components. More working groups are being formed to evaluate how well other AI systems align with the Definition.

OSAID multi-stakeholder co-design process: from component list to a definition checklist

Meet Mer Joyce

Video recorded by Ezequiel Lanza, Open Source AI Evangelist at Intel

I am the process facilitator for the Open Source AI Definition, the Open Source Initiative project creating a definition of Open Source AI that will be a part of the stable public infrastructure of Open Source technology that everyone can benefit from, similar to the Open Source Definition that OSI currently stewards. The co-design of the Open Source AI Definition involves consulting with global stakeholders to ensure their vast range of needs are represented while integrating and weaving together the variety of different perspectives on what Open Source AI should mean.

If you would like to participate in the process, we’re currently on version 0.0.7. We will have a release candidate in June and a stable version in October. There is a public forum at discuss.opensource.org where anyone can create an account and make comments. As different versions are created, updates about our process are released here as well. I am available, as is the executive director of the OSI, to answer questions at bi-weekly town halls that are open for anyone to attend.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

Join the working groups: be part of a team to evaluate various models against the OSAID.
Join the forum: support and comment on the drafts, record your approval or concerns to new and existing threads.
Comment on the latest draft: provide feedback on the latest draft document directly.
Follow the weekly recaps: subscribe to our newsletter and blog to be kept up-to-date.
Join the town hall meetings: participate in the online public town hall meetings to learn more and ask questions.
Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.

One of the many OSAID workshops organized by Mer Joyce around the world

Beyond SPDX: expanding licenses identified by ClearlyDefined

Nick Vidal — Tue, 09 Jul 2024 18:26:47 +0000

ClearlyDefined is an Open Source project that helps organizations with supply chain compliance. Until recently, ClearlyDefined’s tooling only supported licenses that were part of the standardized SPDX license list. Any component identified by a license that was not part of this list resulted in NOASSERTION, which introduced uncertainty about the permissible use of such component, potentially hindering collaboration, creating legal complexities and security concerns for developers.

Fortunately, Scancode, which is an integral part of how ClearlyDefined detects and normalizes origin, dependencies and licensing metadata of components, already supports non-SPDX licenses thanks to its use of LicenseDB. LicenseDB is the largest free and open database of software licenses, in particular all the Open Source software licenses, with over 2000 community curated licenses texts and their metadata.

Philippe Ombredanne, the leading author of Scancode and LicenseDB, defended ClearlyDefined leveraging this capability already provided by Scancode:

As one of many examples, common public domain dedications are not tracked nor supported by SPDX and are not approved as OSI licenses. Not a single lawyer I know is treating these as proprietary licenses. They are carefully cataloged and properly detected by ScanCode (at least 850+ variants of these at last count plus an infinity of variations detected approximately)…

Collecting data is not endorsing nor promoting anything in particular be it proprietary, open source, free software, source available or else. But rather, just accepting that the world of actual licenses is what it is in all its glorious messy diversity and capturing what these licenses are, without discarding valuable information detected by ScanCode. Discarding and losing data has been the problem until now and has been making ClearlyDefined data mostly harmless and useless at scale as you get better and more information out of a straight ScanCode scan.

You are welcome to use anything you like, but I think it would be better to adopt the de-facto industry standard of ScanCode license data, rather than to reinvent the wheel, especially since ClearlyDefined is kinda using ScanCode rather heavily.

We use a suffix as LicenseRef-scancode in https://scancode-licensedb.aboutcode.org/ and guarantee stability of these with the track record to prove this.

After a healthy discussion on the topic, the ClearlyDefined community agreed that supporting non-SPDX licenses was important. Scancode already provides this functionality and it offers mapping from these non-SPDX licenses to the SPDX LicenseRef. Organizations using ClearlyDefined now have the option to decide how to handle non-SPDX licenses based on their own needs. This work to have ClearlyDefined use the latest version of Scancode and support non-SPDX licenses was led by Lukas Spieß from GitHub with the stewardship from Qing Tomlinson (from SAP) and E. Lynette Rayle (also from GitHub). We would like to thank them and all those involved in the development and testing of this implementation.

We are looking for feedback. Please test this feature on dev.clearlydefined.io or dev-api.clearlydefined.io and file any issues here.

Highlights from AI_dev Paris

Nick Vidal — Wed, 03 Jul 2024 17:34:43 +0000

On June 19-20, the Linux Foundation hosted AI_dev: Open Source GenAI & ML Summit Europe 2024. This event brought together developers exploring the complex world of Open Source generative AI and Machine Learning. Central to this event is the conviction that Open Source drives innovation in AI. Please find below some highlights from AI_dev Paris and how they are aligned with OSI’s work on the Open Source AI Definition.

Keynote: Welcome & Opening Remarks

Ibrahim Haddad, Executive Director of the LF AI & Data Foundation, provided an overview of the major challenges in Open Source AI, which include:

Lack of a common understanding of openness in AI
Open Source software licenses used on non-software assets
Diverse restrictions including the use of Acceptable Use Policies
Lack of understanding of licenses and implications in the context of AI models
Incomplete release of model components

To address some of these challenges, Haddad introduced the Model Openness Framework (MOF) and announced the official launch of the Model Openness Tool (MOT) at the conference.

Introducing the Model Openness Framework: Achieving Completeness and Openness in a Confusing Generative AI Landscape

Anni Lai, Matt White, and Cailean Osborne delved into the Model Openness Framework, a comprehensive system for evaluating and classifying the completeness and openness of Machine Learning models. This framework assesses which components of the model development lifecycle are publicly released and under what licenses, ensuring an objective evaluation. Matt White, Executive Director of the Pytorch Foundation and author of the MOF white paper, went on to demonstrate the Model Openness Tool, which evaluates each model across 3 classes: Open Science (Class I), Open Tooling (Class II), and Open Model (Class III).

Model Openness Tool: launched at the Linux Foundation’s AI_dev Paris conference

The Open Source AI dilemma: Crafting a clear definition for Open Source AI

Ofer Hermoni, founder of the LF AI & Data Foundation, continued examining the Model Openness Framework and explained how this framework and its list of components serve as the basis for OSI’s Open Source AI Definition (OSAID). The OSAID evaluates each component on the four fundamental freedoms of Open Source:

To use the system for any purpose and without having to ask for permission
To study how the system works and inspect its components
To modify the system for any purpose, including to change its output
To share the system for others to use with or without modifications, for any purpose

Toward AI Democratization with Digital Public Goods

Lea Gimpel and Daniel Brumund from the Digital Public Goods Alliance (DPGA) emphasized the importance of democratizing AI through digital public goods, including Open Source software, open AI models, open data, open standards, and open content. Lea highlighted that, while open data is desirable, it is not conditional. She supported the OSI’s Open Source AI Definition, as it helps the DPGA navigate legal uncertainties around data sharing and broadens the pool of potential solutions that can be recognized, marketed, and made available as digital public goods, thereby offering more opportunities to positively impact people’s lives.

Conclusion

It was clear throughout this conference that the work to create a standard Open Source AI Definition that upholds the fundamental freedoms of Open Source is vital for addressing some of the key challenges in AI and ML development and democratization. The OSI appreciates Linux Foundation’s collaboration toward this goal and its commitment to host another successful event to facilitate these important discussions.

OSI at PyCon US: engaging with AI practitioners and developers as we reach OSAID’s first release candidate

Nick Vidal — Wed, 29 May 2024 12:00:33 +0000

As part of the Open Source AI Definition roadshow and as we approach the first release candidate of the draft, the Open Source Initiative (OSI) participated at PyCon US 2024, the annual gathering of the Python community. This opportunity was important because PyCon US brings together AI practitioners and developers alike, and having their input regarding what constitutes Open Source AI is of most value. The OSI organized a workshop and had a community booth there.

OSAID Workshop: compiling a FAQ to make the definition clear and easy to use

The OSI has embarked on a co-design process with multiple stakeholders to arrive at the Open Source AI Definition (OSAID). This process has been led by Mer Joyce, the co-design expert and facilitator, and Stefano Maffulli, the executive director of the OSI.

At the workshop organized at PyCon US, Mer provided an overview of the co-design process so far, summarized below.

The first step of the co-design process was to identify the freedoms needed for Open Source AI. After various online and in-person activities and discussions, including five workshops across the world, the community identified four freedoms:

To Use the system for any purpose and without having to ask for permission.
To Study how the system works and inspect its components.
To Modify the system for any purpose, including to change its output.
To Share the system for others to use with or without modifications, for any purpose.

The next step was to form four working groups to initially analyze four AI systems. To achieve better representation, special attention was given to diversity, equity and inclusion. Over 50% of the working group participants are people of color, 30% are black, 75% were born outside the US and 25% are women, trans and nonbinary.

These working groups discussed and voted on which AI system components should be required to satisfy the four freedoms for AI. The components we adopted are described in the Model Openness Framework developed by the Linux Foundation.

The vote compilation was performed based on the mean total votes per component (μ). Components which received over 2μ votes were marked as required and between 1.5μ and 2μ were marked likely required. Components that received between 0.5μ and μ were marked likely not required and less than 0.5μ as not required.

The working groups evaluated legal frameworks and legal documents for each component. Finally, each working group published a recommendation report. The end result is the OSAID with a comprehensive definition checklist encompassing a total of 17 components. More working groups are being formed to evaluate how well other AI systems align with the definition.

OSAID multi-stakeholder process: from component list to a definition checklist

After providing an overview of the co-design process, Mer went on to organize an exercise with the participants to compile a FAQ.

The questions raised at the workshop revolved around the following topics:

End user comprehension: how and why are AI systems different from Open Source software? As an end-user, why should they care if an AI system is open?
Datasets: Why is data itself not required? Should Open Source AI datasets be required to prove copyright compliance? How can one audit these systems for bias without the data? What does data provenance and data labeling entail?
Models: How can proper attribution of model parameters be enforced? What is the ownership/attribution of model parameters which were trained by one author and then “fine-tuned” by another?
Code: Can projects that include only source code (no data info or model weights) still use a regular Open Source license (MIT, Apache, etc.)?
Governance: For a specific AI, who determines whether the information provided about the training, dataset, process, etc. is “sufficient” and how?
Adoption of the OSAID: What are incentives for people/companies to adopt this standard?
Legal weight: Is the OSAID supposed to have legal weight?

These questions and answers raised at the workshop will be important for enhancing the existing FAQ, which will be made available along with the OSAID.

OSAID workshop: a collection of post-its with questions raised by participants.

Community Booth: gathering feedback on the “Unlock the OSAID” visualization

At the community booth, the OSI held two activities to draw in participants interested in Open Source AI. The first activity was a quiz developed by Ariel Jolo, program coordinator at the OSI, to assess participants’ knowledge of Python and AI/ML. Once we had an understanding of their skills, we went on to the second and main activity, which was to gather feedback on the OSAID using a novel way to visualize how different AI systems match the current draft definition as described below.

Making it easy for different stakeholders to visualize whether or not an AI system matches the OSAID is a challenge, especially because there are so many components involved. This is where the visualization concept we named “Unlock the OSAID” came in.

The OSI keyhole is a well recognized logo that represents the source code that unlocks the freedoms to use, study, modify, and share software. With the Unlock the OSAID, we played on that same idea, but now for AI systems. We displayed three keyholes representing the three domains these 17 components fall within: code, model and data information.

Here is the image representing the “code keyhole” with the required components to unlock the OSAID:

On the inner ring we have the required components to unlock the OSAID, while on the outer ring we have optional components. The required code components are: libraries and tools; inference; training, validation and testing; data pre-processing. The optional components are: inference for benchmark and evaluation code.

To fully unlock the OSAID, an AI system must have all the required components for code, model and data information. To better understand how the “Unlock the OSAID” visualization works, let’s look at two hypothetical AI systems: example 1 and example 2.

Let’s start looking at example 1 (in red) and see if this system unlocks the OSAID for code:

Example 1 only provides inference code, so the key (in red) doesn’t “fit” the code keyhole (in green).

Now let’s look at example 2 (in blue):

Example 2 provides all required components (and more), so the key (in blue) fits the code keyhole (in green). Therefore, example 2 unlocks the OSAID for code. For example 2 to be considered Open Source AI, it would also have to unlock the OSAID for model and data information:

We received good feedback from participants about the “Unlock the OSAID” visualization. Once participants grasped the concept of the keyholes and which components were required or optional, it was easy to identify if an AI system unlocks the OSAID or not. They could visually see if the keys fit the keyholes or not. If all keys fit, then that AI system adheres to the OSAID.

Final thoughts: engaging with the community and promoting Open Source principles

For me, the highlight of PyCon US was the opportunity to finally meet members of the OSI and the Python community in person, both new and old acquaintances. I had good conversations with Deb Nicholson (Python Software Foundation), Hannah Aubry (Fastly), Ana Hevesi (Uploop), Tom “spot” Callaway (AWS), Julia Ferraioli (AWS), Tony Kipkemboi (Streamlit), Michael Winser (Alpha-Omega), Jason C. MacDonald (OWASP), Cheuk Ting Ho (CMD Limes), Kamile Demir (Adobe), Mariatta Wijaya (PSF), Loren Clary (PSF) and Miaolai Zhou (AWS). I also interacted with many folks from the following communities: Python Brazil, Python en Español, PyLadies and Black Python Devs. It was great to bump into great legends like Seth Larson (PSF), Peter Wang (Anaconda) and Guido van Rossum.

I loved all the keynotes, in particular from Sumana Harihareswara about how she has improved Python Software Foundation’s infrastructure, and from Simon Willison about how we can all benefit from Open Source AI.

We also had a special dinner hosted by Stefano to celebrate this special milestone of the OSAID, with Stefano, Mer and I overlooking Pittsburgh.

Overall, our participation at PyCon US was a success. We shared the work OSI has been doing toward the first release candidate of the Open Source AI Definition, and we did it in an entertaining and engaging way, with plenty of connection throughout.

Photo credits: Ana Hevesi, Mer Joyce, and Nick Vidal

Unveiling ClearlyDefined: this free SBOM service gets cleared for takeoff

Nick Vidal — Thu, 16 May 2024 13:43:00 +0000

With all the buzz around SBOMs and Open Source supply chain compliance and security, a new revolution is igniting at ClearlyDefined. This amazing project has been flying under the radar since its inception six years ago, but now this free service and open source project from the Open Source Initiative (OSI) gets cleared for takeoff with the launch of a new website focused on stellar documentation, excellent engineering, and healthy community growth.

Generating SBOMs at scale for each stage on the supply chain, for every build or release, has proven to be a real challenge for organizations. And fixing the same missing or wrongly identified licensing metadata over and over again has been a redundant pain for everyone. This is where ClearlyDefined shines, as it makes it really easy for organizations to fetch a cached copy of licensing metadata for each component through a simple API, which is always up-to-date thanks to its crowdsourced database.

The all-new ClearlyDefined website was completely revamped to welcome community members and foster collaboration united by a shared vision of Open Source excellence. The website is divided into three sections: Docs, Resources, and Community.

Under Docs, both new and existing community members will find several comprehensive guides and tutorials. The main guide is “Getting involved,” where members will embark on a journey to learn how to use the data, curate the data, contribute data, contribute code, add a harvest and adopt practices. The “Roles” guide provides a detailed description of how different roles can master ClearlyDefined, from data consumer and data curator to data contributor and code contributor. Other guides that will expand in the coming months include the “Curation” and “Harvest” guides. Curation is the process of fixing or identifying missing licensing metadata and sharing that with the community, while harvest is the process of fetching licensing metadata directly from the source (package managers like npm and PyPi), processing the license definitions, and making them available through an API.

Under Resources, members will find a rich collection of content: Blog, FAQ, Glossary, Providers, Architecture and Roadmap. The roadmap was created in collaboration with members of the community, who provided input into what they would like to see in 2024 and how they would be able to contribute towards these goals.

Under Community, members will find links to various channels where they can engage with others online or in-person: GitHub, Forum, Events and Meetings. They’ll also find a list of other community members with whom they can forge connections, as well as the Code of Conduct and the project Charter.

We would like to extend a heartfelt thank you to our existing community members who have been instrumental with the launch of the new website and welcome new ones who are learning about the project. Besides expanding the “Curation” and “Harvest” guides, next steps include enhancing the user experience by implementing sitewide search and adding case studies filled with rich media. Come and join the ClearlyDefined community here and get ready to take off together with us. Let’s define the future of Open Source, one definition at a time!

Compelling responses to NTIA’s AI Open Model Weights RFC

Nick Vidal — Tue, 09 Apr 2024 12:03:50 +0000

The National Telecommunications and Information Administration (NTIA) posted a request for comments on Dual Use Foundation Artificial Intelligence Models with Widely Available Model Weights, and it has received 362 comments.

In addition to the Open Source Initiative’s (OSI) joint letter drafted by Mozilla and the Center for Democracy and Technology (CDT), the OSI has also sent a letter of its own, highlighting our multi-stakeholder process to create a unified, recognized definition of Open Source AI.

The following is a list of some comments from nonprofit organizations and companies.

Comments from additional nonprofit organizations

Researchers from Stanford University’s Human-centered AI (HAI) and Princeton University recommend that the federal government prioritize understanding of the marginal risk of open foundational models when compared to proprietary, creating policies based on this marginal risk. Their response also highlighted several unique benefits from open foundational models, including higher innovation, transparency, diversification, and competitiveness.
Wikimedia Foundation recommends that regulatory approaches should support and encourage the development of beneficial uses of open technologies rather than depending on more closed systems to mitigate risks. Wikimedia believes open and widely available AI models, along with the necessary infrastructure to deploy them, could be an equalizing force for many jurisdictions around the world by mitigating historical disadvantages in the ability to access, learn from, and use knowledge.
EleutherAI Institute recommends Open Source AI and warns that restrictions on open-weight models are a costly intervention with comparatively little benefit. EleutherAI believes that open models enable people close to the deployment context to have greater control over the capabilities and usage restrictions of their models, study the internal behavior of models during deployment, and examine the training process and especially training data for signs that a model is unsafe to deploy in a specific use-case. They also lower barriers of entry by making models cheaper to run and enable users whose use-cases require strict guarding of privacy (e.g., medicine, government benefits, personal financial information) to use.
MLCommons recommends the use of standardized benchmarks, which will be a critical component for mitigating the risk of models both with and without widely available open weights. MLCommons believes models with widely available open weights allow the entire AI safety community – including auditors, regulators, civil society, users of AI systems, and developers of AI systems – to engage with the benchmark development process. Together with open data and model code, open weights enable the community to clearly and completely understand what a given safety benchmark is measuring, eliminating any confounding opacity around how a model was trained or optimized.
The AI Alliance recommends regulation shaped by independent, evidence-based research on reliable methods of assessing the marginal risks posed by open foundation models; effective risk management frameworks for the responsible development of open foundation models; and balancing regulation with the benefits that open foundation models offer for expanding access to the technology and catalyzing economic growth.
The Alliance for Trust in AI recommends that regulation should protect the many benefits of increasing access to AI models and tools. The Alliance of Trust in AI believes that openness should not be artificially restricted based on a misplaced belief that this will decrease risk.
Access Now recommends NTIA to think broadly about how developments in AI are reshaping or consolidating corporate power, especially with regard to ‘Big Tech.’ Access Now believes in the development and use of AI systems in a sustainable, resource-friendly way that considers the impact of models on marginalized communities and how those communities intersect with the Global South.
Partnership on AI (PAI) recommends NTIA’s work should be informed by the following principles: all foundation models need risk mitigations; appropriate risk mitigations will vary depending on model characteristics; risk mitigation measures, for either open or closed models, should be proportionate to risk; and voluntary frameworks are part of the solution.
R Street recommends pragmatic steps towards AI safety, relying on multistakeholder processes to address problems in a more flexible, agile, and iterative fashion. The government should not impose arbitrary limitations on the power of Open Source AI systems, which could result in a net loss of competitive advantage.
The Computer and Communications Industry Association (CCIA) recommends assessment based on the risks, highlighting that open models provide the potential for better security, less bias, and lower costs to AI developers and users alike. The CCIA acknowledged that the vast majority of Americans already use systems based on Open Source software (knowingly or unknowingly) on a daily basis.
The Information Technology Industry Council (ITI) recommends adopting a risk-based approach with respect to open foundation models, since not all models pose an equivalent degree of risk, and that the risk management is a shared responsibility across the AI value chain.
The Center for Data Innovation recommends that U.S. policymakers defend open AI models at the international level as part of its continued embrace of the global free flow of data. It also encourages them to learn lessons from past debates about dual-use technologies, such as encryption, and refrain from imposing restrictions on foundation models because such policies would not only be ultimately ineffective at addressing risk, but they would slow innovation, reduce competition, and decrease U.S. competitiveness.
The International Center for Law & Economics recommends that AI regulation must be grounded in empirical evidence and data-driven decision making. Demanding a solid evidentiary basis as a threshold for intervention would help policymakers to avoid the pitfalls of reacting to sensationalized or unfounded AI fears.
New America’s Open Technology Institute (OTI) recommends a coordinated interagency approach designed to ensure that the vast potential benefits of a flourishing open model ecosystem serve American interests, in order to counter or at least offset the trend toward dominant closed AI systems and continued concentrations of power in the hands of a few companies.
Electronic Privacy Information Center (EPIC) recommends NTIA to grapple with the nuanced advantages, disadvantages, and regulatory hurdles that emerge within AI models along the entire gradient of openness, highlighting that AI models with weights widely available may foster more independent evaluation of AI systems and greater competition compared to closed systems.
The Software & Information Industry Association (SIIA) recommends a risk-based approach to foundation models that considers the degree and type of openness. SIIA believes openness has already proved to be a catalyst for research and innovation by essentially democratizing access to models that are cost-prohibitive for many actors in the AI ecosystem to develop on their own.
The Future Society recommends that the government should establish risk categories (i.e., designations of “high-risk” or “unacceptable-risk”), thresholds, and risk-mitigation measures that correspond to evaluation outcomes. The Future Society is concerned that overly restrictive policies could lead to market concentration, hindering competition and innovation in both industry and academia. A lack of competition in the AI market can have far-reaching knock-on consequences, including potentially stifling efforts to improve transparency, safety, and accountability in the industry. This, in turn, can impair the ability to monitor and mitigate the risks associated with dual-use foundation models and to develop evidence-based policymaking.
The Software Alliance (BSA) recommends NTIA to avoid restricting the availability of open foundation models; ground policies that address risks of open foundation models on empirical evidence; and encourage the implementation of safeguards to enhance the safety of open foundation models. BSA recognizes the substantial benefits that open foundation models provide to both consumers and businesses.
The US Chamber of Commerce recommends NTIA to make decisions based on sound science and not unsubstantiated concerns that open models pose an increased risk to society. The US Chamber of Commerce believes that Open-source technology allows developers to build, create, and innovate in various areas that will drive future economic growth.

Comments from companies

Meta recommends NTIA to establish common standards for risk assessments, benchmarks and evaluations informed by science, noting that the U.S. national interest is served by the broad availability of U.S.-developed open foundation models. Meta highlighted that Open source democratizes access to the benefits of AI, and that these benefits are potentially profound for the U.S., and for societies around the world.
Google recommends a rigorous and holistic assessment of the technology to evaluate benefits and risks. Google believes that Open models allow users across the world, including in emerging markets, to experiment and develop new applications, lowering barriers to entry and making it easier for organizations of all sizes to compete and innovate.
IBM recommends preserving and prioritizing the critical benefits of open innovation ecosystems for AI for increasing AI safety, advancing national competitiveness, and promoting democratization and transparency of this technology.
Intel recommends accountability for responsible design and implementation to help mitigate potential individual and societal harm. This includes establishing robust security protocols and standards to identify, address, and report potential vulnerabilities. Intel believes openness not only allows for faster advancement of technology and innovation, but also faster, transparent discovery of potential harms and community remediation and address. Intel also believes that Open AI development is essential to facilitate innovation and equitable access to AI, as open innovation, open platforms, and horizontal competition help offer choice and build trust.
Stability AI recommends that regulation must support a diverse AI ecosystem – from the large firms building closed products to the everyday developers using, refining, and sharing open technology. Stability AI recognizes that Open models promote transparency, security, privacy, accessibility, competition, and grassroots innovation in AI.
Hugging Face recommends establishing standards for best practices building on existing work and prioritizing requirements of safety by design across both the AI development chain and its deployment environments. Hugging Face believes that open-weight models contribute to competition, innovation, and broad understanding of AI systems to support effective and reliable development.
GitHub recommends regulatory risk assessment should weigh empirical evidence of possible harm against the benefits of widely available model weights. GitHub believes Open source and widely available AI models support research on AI development and safety, as well as the use of AI tools in research across disciplines. To-date, researchers have credited these models with supporting work to advance the interpretability, safety, and security of AI models; to advance the efficiency of AI models enabling them to use less resources and run on more accessible hardware; and to advance participatory, community-based ways of building and governing AI.
Microsoft recommends cultivating a healthy and responsible open source AI ecosystem and ensuring that policies foster innovation and research. This will be achieved through direct engagement with open source communities to understand the impact of policy interventions on them and, as needed, calibrations to address risks of concern while also minimizing negative impacts on innovation and research.
Y Combinator recommends NTIA and all stakeholders to realize the immense promise of open-weight AI models while ensuring this technology develops in alignment with our values. Y Combinator believes the degree of openness of AI models is a crucial factor shaping the trajectory of this transformative technology. Highly open models, with weights accessible to a broad range of developers, offer unparalleled opportunities to democratize AI capabilities and promote innovation across domains. Y Combinator has seen firsthand the incredible progress driven by open models, with a growing number of startups harnessing these powerful tools to pioneer groundbreaking applications.
AH Capital Management, L.L.C. (a16z) recommends NTIA to be wary of generalized claims about the risks of Open Models and calls to treat them differently from Closed Models, especially those made by AI companies seeking to insulate themselves from market competition. a16z believes Open Models promote innovation, reduce barriers to entry, protect against bias, and allow such models to leverage and benefit from the collective expertise of the broader artificial intelligence (“AI”) community.
Uber recommends promoting widely available model weights to spur innovation in the field of AI. Uber believes that, by democratizing access to foundational AI models, innovators from diverse backgrounds can build upon existing frameworks, accelerating the pace of technological advancement and increasing competition in the space. Uber also believes widely available model weights, source code, and data are necessary to foster accountability, facilitate collaboration in risk mitigation, and promote ethical and responsible AI development.
Databricks recommends regulation of highly capable AI models should focus on consumer-facing deployments and high risk deployments, with the obligations focused on the deployer. Databricks believes that the benefits of open models substantially outweigh the marginal risks, so open weights should be allowed, even at the frontier level.