Opinions – Open Source Initiative

Explaining the concept of Data information

Stefano Maffulli — Fri, 14 Jun 2024 13:53:28 +0000

There seems to be some confusion caused by the concept of Data information included in the draft v0.0.8 of the Open Source AI Definition. Some readers may have seen the original dataset included in the list of optional components and quickly jumped to the wrong conclusions. This post clarifies how the draft arrived at its current state, the design principles behind the Data information concept and the constraints (legal and technical) it operates under.

The objective of the Open Source AI Definition

The objective of the Open Source AI Definition is to replicate in the context of artificial intelligence (AI) the principles of autonomy, transparency, frictionless reuse, and collaborative improvement for end users and developers of AI systems. These are described in the preamble.

Following the preamble is the definition of Open Source AI, an adaptation of the definition of Free Software (also known as “the four freedoms”) to AI nomenclature. The preamble and the four freedoms have been co-designed over several meetings and public discussions, online and in-person, and have not recently received significant comments.

The Free Software definition specifies that a precondition to the freedom to study and modify a program is to have access to the source code. Source code is defined as “the preferred form of the program for making changes in.” Draft v0.0.8 contains a description of what’s necessary to enjoy the freedoms to study and modify an AI system. This new section titled Preferred form to make modifications to machine-learning systems has generated a heated debate.

What is the preferred form to make modifications

The concept of “preferred form to make modifications” focuses on machine learning systems because these systems require data and training to produce a working system. Other AI systems are more easily classifiable as software and don’t require a special definition.

The system analysis phase of the co-design process revealed that studying and modifying machine learning systems requires data, code for training and inference and model parameters. For the parameters, there’s no ambiguity: an Open Source AI must make them available under terms that respect the Open Source principles (no field-of-use restrictions, no discrimination against people, etc). For the data and code requirements, the text in the “preferred form to make modifications” section is longer and harder to parse, generating some confusion.

The intent of the code and data requirements is to ensure that end users, deployers and developers of an Open Source AI system have all the tools and instructions to recreate that AI system from scratch, to satisfy the freedoms to study and modify the system. At a high-level view, it makes sense to suggest that training datasets should be mandatorily released with permissive licenses in order to be Open Source AI.

However on close examination, it became clear that sharing the original datasets is full of traps. It actually puts Open Source at a disadvantage compared to opaque and proprietary AI systems.

The issue with data

Data is not software: The legal landscape for data is much wider than copyright. Aggregating large datasets and distributing them internationally is an endless nightmare that includes privacy laws, copyright, sui-generis rights, patents, secrets and more. Without diving deeper into legal issues, let’s focus on practical examples to clarify why the distribution of the training dataset is not spelled out as a requirement in the concept of Data information.

The Pile, the open dataset used to train the very open Pythia models, was taken down after an alleged copyright infringement, currently being litigated in the United States. However, the Pile appears to be legal to share in Japan. It’s also unclear whether it can be legally shared in the European Union.
DOLMA, the open dataset used to train the very open OLMo models, was initially released with a restrictive license. It later switched to a permissive one. On further inspection, DOLMA appears to suffer from the same legal uncertainties of the Pile, however the Allen Institute has not been sued yet.
Training techniques that preserve privacy like federated learning don’t create datasets.

All these cases show that requiring the original datasets creates vagueness and uncertainty in applying the Open Source AI Definition:

If a dataset is only legal in Japan, is that AI Open Source only in Japan?
If a dataset is initially legally available but later retracted, does the AI go from being Open Source to not?
- If so, what happens to the applications that use such AI?
If no dataset is created, then will any AI trained with such techniques ever be Open Source?

Additionally, there are reasons to believe that OpenAI, Anthropic and other proprietary systems have been trained on the same questionable data inside The Pile and DOLMA: Proving that’s the case is a lot harder and expensive though. This is clearly a disincentive to be open and transparent on the data sources, adding a burden to the organizations that try to do the right thing.

The solution to these questions, draft v0.0.8 contains the concept of Data information, coupled with code requirements to obtain the expected result: for end users, developers and deployers of AI systems to be able to reproduce an Open Source AI.

Understanding the concept of Data Information

Data information, in the draft Open Source AI Definition, is defined as:

Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

Read that from the end: The intention of Data information is to allow developers to recreate a substantially equivalent system using the same or similar data. That means that an Open Source AI must disclose all the ingredients, where they’ve been bought and all the instructions to prepare the dish.

This is a solution that came out of the co-design process, where reviewers didn’t rank the training datasets as high as they ranked the training code and data transparency requirements.

Data information and the code requirements also address all of the questions around the legality of distributing data and datasets, or their absence.

If a dataset is only legal in Japan or becomes illegal later, one should still be able to recreate a dataset suitable to train an equivalent system replacing the illegal or unavailable pieces with similar ones.

AI systems trained with federated learning (where a dataset isn’t created) can still be Open Source AI if all instructions and code are released so that a new training with different data can generate an equivalent system.

The Data information concept also solves an example (raised on the forum) of an AI system trained on data licensed directly from Reddit. In this case, if the original developers released enough information to allow another AI developer to recreate a substantially equivalent system with Reddit data taken from an existing dataset, like CommonCrawl, it would be considered Open Source AI.

The proposed alternatives

While generally well received, draft v0.0.8 has been criticized by a few people on the forum for putting the training dataset in the “optional requirements”. Some suggestions and pushback we’ve received:

Require the use of synthetic data when the training dataset cannot be legally shared: This technique may work in some corner cases, if the technology evolves to be reliable enough. It’s expensive and untested at scale.
Classify as Open Source AI systems where all their components are “open source”: This approach is not rooted in the longstanding practice of the GNU project to accept system library exceptions and other compromises in exchange for more Open Source tools.
Datasets built by crawling the internet are the equivalent of theft, they shouldn’t be allowed at all, let alone allowed in Open Source AI: This pushback ignores the reality that large data aggregators already have acquired legally the rights to accumulate that same data (through scraping and terms of use) and are trading it, exclusively capturing the economic value of what should be in the commons. Read Towards a Books Data Commons for AI Training for more details. There is no general agreement that text and data mining is equivalent to theft.

These demands and suggestions are hard to accept. We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones. We need a Definition that contains positive examples from the start so we can practically demonstrate positive qualities to policymakers.

The discussion about data, how to generate incentives to create datasets that can be distributed internationally, safely, preserving privacy, is extremely complex. It can be addressed separately from the Open Source AI Definition. In collaboration with Open Future Foundation and others, OSI is designing a series of conferences to tackle the data governance issue. We’ll make an announcement soon.

Have your say now

The concept of Data information and code requirements is hard to grasp at first. But the preliminary results of the validation phase confirm that the draft v0.0.8 works as expected: Pythia and OLMo both would be Open Source AI, while Falcon, Grok, Llama, Mistral would not (even if they used OSD-compatible licenses) because they don’t share Data information. BLOOM and StarCoder would fail because of field-of-use restrictions in their models.

Data information can be improved but it’s better than other solutions proposed so far. As we get closer to the release of the stable version of the Open Source AI Definition, we need to hear from you: If you support this concept please comment on the forum today. If you don’t support it, please try to propose an alternative that at least covers the practical examples of Pile, DOLMA and federated learning above. Help the community move the conversation forward.

Contributions of Open Source to AI: a panel discussion at CPDP-ai conference

Stefano Maffulli — Tue, 04 Jun 2024 09:00:00 +0000

I participated as a panelist at the CPDP-ai 2024 conference in Brussels last week where we discussed the significant contributions of Open Source to AI and highlighted the specific properties that differentiate Open Source AI from proprietary solutions. Representing the Open Source Initiative (OSI), the globally recognized non-profit that defines the term Open Source, I emphasized the longstanding principle of granting users full agency and control over technology, which has been proven to deliver extensive social benefits.

Below is a glimpse at the questions and answers posed to me and my fellow panelists:

Question: Stefano, please explain what the contribution to AI from Open Source is, and if there are specific properties of Open Source AI that make a difference for the users and for the people who are confronted with its results.

Response: The Definition of Open Source Software has existed for over 25 years; That doesn’t apply to AI. The Open Source Definition for software provides a stable north star for all participants in the digital ecosystem, from small and large companies to citizens and governments.

The basic principle of the Open Source Definition is to grant to the users of any technology full agency and control over the technology itself. This means that users of Open Source technologies have self-sovereignty of the technical solutions.

The Open Source Definition has demonstrated that massive social benefits accrue when you remove the barriers to learning, using, sharing and improving software systems. There is ample evidence that giving users agency, control and self-sovereignty of their technical choices produces a viable ecosystem based on permissionless innovation. Multiple studies by the EU Commission and Harvard researchers have assigned significant economic value to Open Source Software, all based on that single, clear, understood and approved Definition from 26 years ago.

For AI, and especially the most recent machine learning solutions, it’s less clear how society can maintain self-sovereignty of the technology and how to achieve permissionless innovation. Despite the fact that many people talk about Open Source AI, including the AI Act, there is no shared understanding of what that means, yet!

The Open Source Initiative is concluding a global, multi-stakeholder co-design process to find an unequivocal definition of Open Source AI, and we’re heading towards the conclusion of this process with a vastly increased knowledge of the AI machine learning space. The current draft of the Open Source AI Definition recognizes that in order to study, use, share and modify AI, one needs to refer to an AI system, not a single individual component. The global process has identified the components required for society to maintain control of the technology and these are:

Detailed information about the dataset used to train the system and the code so that a skilled person can train a system with similar capabilities
All the libraries and tools used to run training and inference
The model architecture and the parameters, like weights and biases

Having unrestricted access to all these elements is what makes an AI an Open Source AI.

We’re in the final stretch of the process, starting to gather support for the current draft of the definition.

The most controversial part of the discussion is the role of data in the training. To answer your question about the power of big foreign tech companies, putting aside the hardware requirements, the data is where the fight is. There seem to be two views of the world on data when it comes to AI: One thinks that text and data mining is basically strip mining humanity and all accumulation of data without consent of the rights holders must be made illegal. Another view of the world is that text and data mining for the purpose of training Open Source AI is probably the only antidote to the superpowers of large corporations. These camps haven’t found a common position yet. Japan seems to have made up its mind already, legalizing unrestricted text and data mining. We’ll see where the lawsuits in the US will go, if they ever get to a decision in court or, as I suspect, they will be settled out of court.

In any case, data, competence and to some extent hardware, are the levers to control the development of AI.

Open Source has been leveling the playing field of technologies. We know from past experience with Open Source software that giving people unrestricted access to the means of digital production enables tremendous economic value. This worked in Europe as well as in China. We think that Open Source AI can have the same effect of generating value while leaving control of the technology in the hands of society.

Question: Big tech companies are important for the development of AI. Apart from the purely technological impacts, there is also economic importance. The European Commission has been very concerned about the Digital Single Market recently, and has initiated legislation such as DSA and DMA to improve competition and market access. Will these instruments be sufficient in view of AI roll-out, thinking also of the recently adopted AI Act? Or will additional attention need to be paid?

Response: Open is the best antidote to the concentration of power. That said, I see these legislations as the sticks, very necessary. I’d love us to think also about carrots. We don’t want to repeat the mistakes of the past with the early years of the internet. Open Source software was equally available in the US and Europe but despite that, the few European champions of Open Source haven’t grown big enough to have a global impact. And some of the biggest EU companies aren’t exactly friendly with Open Source either.

Chinese companies have taken a different approach. But in Europe we have talents, and we have an attractive quality of life so we can get even more talents. Finding money is never an issue. We need to remove the disincentives to grow our companies bigger, widen the access to the internal EU market and support their international expansion, too.

For example, we need to review European Regulation 1025, on standardization to accommodate for Open Source. 1025 Regulation was written at a time when Open Source was considered a “business model” and information and communication technology standards were about voltages in a wire. Today, Open Source is between 80% and 90% of all software and “digital elements” comprise some part of every modern product. Even hardware solutions are dominated by “digital elements.” As such, the approach taken by 1025 is out of date and most likely needs a root-and-branch rethink to properly apply to the world today and the world we anticipate tomorrow.

We need to make sure that the standardization rules required by the Cyber Resilience Act are written together with Open Source champions so the rules don’t favor exclusively the cartel of European patent holders who try to seek rent instead of innovating. Europe has all the means to be at the center of AI innovation; It embodies the right values of diversity and collaboration.

Closing remarks: We think that Open Source is the best antidote to fight market concentration in AI. Data is where the concentration of power is happening now and it’s in the hands of massive corporations: not only Google, Meta, Amazon, Reddit but also Sony, Warner, Netflix, Getty Images, Adobe … All these companies have already gained access to massive amounts of data, legally. These companies basically own our data, legally: Our pictures, the graph of our circles of friends, all the books and movies…

There is a risk that if we don’t write policies that allow text and data mining in exchange of a real Open Source AI (one that society can fully control) then we risk leaving the most powerful AI systems in the hands of the oligopoly who can afford trading money for access to data.

Why datasets built on public domain might not be enough for AI

Stefano Maffulli — Tue, 07 May 2024 10:00:00 +0000

There is tension between copyright laws and large datasets suitable to train large language models. Common Corpus is a dataset that only uses text from copyright-expired sources to bypass the legal issues. It’s a useful achievement, paving the path to research without immediate risk of lawsuits. I also fear that this approach may lead to bad policies, reinforcing the power of copyright holders; not the small creators but large corporations.

A dataset built on public domain sources

In March 2024 Common Corpus was released as an open access dataset for training large language models (LLMs). Announcing the release, the lead developer Pierre-Carl Langlais says “Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.” The dataset contains 500 billion words in multiple European languages and different cultural heritages. It is a project coordinated by the French startup Pleias and supported by organizations committed to open science such as Occiglot, Eleuther AI and Nomic AI as well as being partly funded by the French government. The stated intention of Common Corpus is to democratize access to large quality datasets. It has many other positive characteristics, highlighted also by Open Future’s summary of a talk given by Langlais.

The commons needs more data

The debates sparked by the Deep Dive: AI process on the role of training data highlighted that AI practitioners encounter many obstacles assembling datasets. At the same time, we discovered that tech giants have an incredible advantage over researchers and startups. They’ve been slurping data for decades, have the financial means to go to court and can enter into bilateral agreements to license data. These strategies are inaccessible to small competitors and academics. Accepting that the only path to creating open large datasets suitable to train Open Source AI systems is to use sources in the public domain, risks cementing the dominant positions of existing large corporations.

The open landscape already faces issues with big tech and their ability to influence legislation. The big corporations have lobbied to extend the duration of copyright, introduced the DMCA, are opposing the right to repair, and have the resources to continue lobbying and sue any new entrant who they deem to get too close. There are plenty of examples showing an unequal advantage in protecting what they think is theirs. The non-profit Fairly Trained certifies companies “willing to prove that they’ve trained their AI models on data that they own, have licensed, or that is in the public domain,” respecting copyright law: who’s going to benefit from this approach?

Unsuitable for public policies

Initiatives like Common Corpus and The Stack (used to train Starcoder2) are important achievements as they allow researchers to develop new AI systems while mitigating the risk of being sued. They also push the technical boundaries of what can be achieved with smaller datasets that don’t require a nuclear power plant to train new models. But I think they mask the underlying issue: AI needs data and limiting open datasets to only public domain sources will never give them a chance to match the size of the proprietary ones. The lobby for copyright maximalists is always looking for ways to expand scope and extend terms for copyright laws, and when they succeed it is a one-way ratchet. It would be a tragedy for society if legislators listened to their sophistry and made new laws doing this based on the apparent consensus that creators need protection from AI.
The role of data for training machine learning systems is a divisive topic and a complex one. Having datasets like Common Corpus is a very useful way for the science of AI to progress with better sources. For policies, we’d be better off pushing for something like the proposal advanced by Open Future and Creative Commons in their paper Towards a Books Data Commons for AI Training.

CRA standards request draft published

Simon Phipps — Thu, 02 May 2024 12:19:03 +0000

The European Commission recently published a public draft of the standards request associated with the Cyber Resilience Act (CRA). Anyone who wants to comment on it has until May 16, after which comments will be considered and a final request to the European Standards Organizations (ESOs) will be issued. This process is all governed by regulation 2012/1025, which will be discussed in a future post.

The publication of this draft is important for every entity that will have duties under the CRA, namely “manufacturers” and “software stewards.” Conformance with the harmonized standards that emerge from this process will allow manufacturers to CE-mark their software on the presumption it complies with the requirements of the CRA, without taking further steps.

For those who depend on incorporating or creating Open Source software, there is an encouraging new development found here. For the first time in a European standards request, there is an express requirement to respect the needs of Open Source developers and users. Recital 10 tells each standards organization the following:

“where relevant, particular account should be given to the needs of the free and open source software community”

That is made concrete in Article 2 which specifies:

“The work programme shall also include the actions to be undertaken to ensure effective participation of relevant stakeholders, such as small and medium enterprises and civil society organizations, including specifically the open source community where relevant”

Article 3 requires proof that effective participation has been facilitated. The community is going to have to step up to help the ESOs satisfy these requirements—or corporations claiming to speak for the community will do it instead.

OSI applauds the Commission’s steps to include the Open Source community and will be pleased to work with the European standards organizations towards that initial goal of effective representation and consultation. Additionally, the OSI will:

Work with our Affiliates to identify additional suitable participants with relevant skills and experience, and make connections between them and the ESOs.
Assist the Commission in validating responses to Article 3.

Our goal is to ensure that the development and use of Open Source software is at best facilitated and at worst not obstructed by any aspect of the standards development process, the resulting harmonized standards, and the access and IPR terms of those standards.

A comparative view of AI definitions as we move toward standardization

Mia Lykou Lund — Fri, 09 Feb 2024 10:54:00 +0000

Discussions of Artificial Intelligence (AI) regulation will be heating up in 2024 with a provisional agreement for the EU AI Act having been reached in December 2023. The evolution of the EU AI Act is progressing toward a technology-neutral definition for AI to be applied to future AI systems. In the coming months, multiple states will agree on precise legal definitions, which reflect moral considerations of the role that AI will and will not be allowed to play in Europe for the very first time. And formally defining AI is an ongoing debate.

Precise definitions within a rapidly expanding field are perhaps not the first things that come to mind when asked about pressing issues concerning AI. However, as its influence grows, arriving at one seems essential when considering how to regulate it. Agreeing on what AI is–and what it is not–on a transnational level, is proving to be increasingly important. Online spaces rarely respect sovereignty, and the role of AI in public life is expected to increase rapidly.

Different countries and organizations have different definitions, though the AI Act is expected to provide some standardization, not only within the EU but also outside of it due to its influence. Other than providing a framework for businesses to operate within in the future, it further shows the anticipation of what, how and where AI will act and what it will develop towards. Let’s consider how different organizations and states currently are defining AI systems.

OECD

So far, the AI ACT’s definition of AI systems is expected to follow the OECD’s current definition. This currently seems to be the most influential definition and it reads as follows:

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.

Notably, the OECD’s definition has undergone changes from its first draft to the current one above. The removal of “human-based inputs” and the addition of “decisions” when referring to outputs reflects a potential for vastly limiting human-centred decisions and actions. While acknowledging that different systems vary in their autonomy, this change opens up the potential for full autonomy. This can be controversial, to say the least, and can be expected to feed into the growing concerns of AI alignment. As we await the EU AI Act, if they indeed adopt the same or even a similar definition, it will be interesting to see their definition of personhood, considering the removal of “human-based” under inputs.

ISO

The International Organization for Standardization has defined AI systems as follows:

AI:

set of methods or automated entities that together build, optimize and apply a model (3.1.26) so that the system can, for a given set of predefined tasks (3.1.37), compute predictions (3.2.12), recommendations, or decisions

Note 1 to entry: AI systems are designed to operate with varying levels of automation (3.1.7).

Note 2 to entry: Predictions (3.2.12) can refer to various kinds of data analysis or production (including translating text, creating synthetic images or diagnosing a previous power failure). It does not imply anteriority.

study of theories, mechanisms, developments and applications related to artificial intelligence (3.1.2)

AI System:

engineered system featuring AI (3.1.2)

Note 1 to entry: AI systems can be designed to generate outputs such as predictions (3.2.12), recommendations and classifications for a given set of human-defined objectives.

Note 2 to entry: AI systems can be designed to operate with varying levels of automation.

Here, there is a consideration of what kind of system is considered, notably an engineered one. This is interesting as previous definitions have been somewhat ambiguous about what technologies, in fact, will fall under such legislation. There is also a focus on the cooperation of different entities, not specified of human or otherwise. Notably, they do not mention the origin and what kind of input is being processed, though through “varying levels of automation” it can be inferred that it covers the balance between human or non-human inputs, thus offering varying levels of autonomy.

South Korea

South Korea also adopted their definition of AI system in their 2023 AI Act, and it reads as follows:

Article 2 (Definitions) As used in this Act, the following terms have the following meanings.

1. “Artificial intelligence” refers to the electronic implementation of human intellectual abilities such as learning, reasoning, perception, judgment, and language comprehension.

2. “Artificial intelligence technology” means hardware technology required to implement artificial intelligence, software technology that systematically supports it, or technology for utilizing it.

While not mentioning AI systems, they attribute human attributes, like perception, to an electronic entity. While not mentioning “decisions,” attributing human characteristics perhaps makes that point redundant, as it can be interpreted as an actor, acting on a similar level as humans. Further, they are expansive on what technology is considered AI, as even a cable providing power can, under their current definition, be classified as a piece of AI technology.

US Executive Order

In the last part of 2023, The Biden administration issued an executive order whereby they defined an AI system:

“a machine-based system that can, for a given set of human-defined objectives, make predictions, recommendations, or decisions influencing real or virtual environments. Artificial intelligence systems use machine- and human-based inputs to perceive real and virtual environments; abstract such perceptions into models through analysis in an automated manner; and use model inference to formulate options for information or action.”

Here, The Biden Administration merges human and machine-based inputs, highlighting the cooperation between the two actors. And while not legally binding, it shows intent. It shows more caution and perhaps skepticism regarding AI acting autonomously, as compared to any other of the major actors. Interestingly, the distinction between virtual and “real” (assuming this means physical, though the wording of it remains problematic) environments shows a similar skepticism to the scope and spheres that the Biden Administration is interested in AI occupying. This limits the controversial issue of potential autonomy present in previous definitions, though it limits communication between systems independently of human inputs, which can prove problematic in practice.

Answers we are excited to see

As we enter into an important legislative year for AI, we are looking forward to getting answers to the following questions regarding the legal definitions of AI systems:

What definition of personhood will accompany the AI systems definition in the AI Act? And what does this mean for the intellectual protection of something entirely made by an AI, considering that it allows for large amounts of autonomy? That is, if it indeed follows the same definition as the OECD.
What kind of technology will be considered to be AI? Will it range from Excel spreadsheets to LLMs? Are we considering “machine-based systems,” an “engineered system” or something else?
Will legislation be strong enough, or perhaps broad enough, to encompass the massive changes AI is currently undergoing? And what predictions can we infer that the EU is making on behalf of the future advancements of AI?

A historic view of the practice to delay releasing Open Source software: OSI’s report

Stefano Maffulli — Wed, 10 Jan 2024 15:00:00 +0000

The Open Source Initiative published today a new report that looks at the history of the business practice to delay releasing their code under freedom-respecting licenses. Since the early days of the Open Source movement, companies have experimented with finding a balance between granting their users the basic freedoms guaranteed by Open Source licenses while also capitalizing on their investments in software development. One common approach, albeit with many different flavors, is what this report calls “Delayed Open Source Publication” (DOSP) — “the practice of distributing or publicly deploying software under a proprietary license at first, then subsequently and in a planned fashion publishing that software’s source code under an Open Source license.”

The new report titled “Delayed Open Source Publication: A Survey of Historical and Current Practices” was authored by the team of Open Tech Strategies (Seth Schoen, James Vasile and Karl Fogel) based on crowdsourced interviews. Their research was made possible through a donation by Sentry and the financial contributions of OSI individual members.

Like the authors, I found that the historical survey revealed numerous surprises, and what I found even more intriguing are the new questions raised (see Section 7) that beg for more dedicated research.

I encourage you to give it a read and share it with others. We encourage feedback from the community: I hold office hours for OSI members and you can discuss this on Mastodon or LinkedIn.

Download the report.

Open Source AI: Establishing a common ground

Stefano Maffulli — Tue, 28 Nov 2023 13:00:00 +0000

The current draft v. 0.0.3 of the Open Source AI Definition borrows wordings from the GNU Manifesto’s golden rule stating:

If I like a program, I must be able to share it with others who like it.
The GNU Manifesto

The GNU Manifesto refers to “program” (not “AI system”), without the need to define it. When it was published in 1985, the definition of a program was pretty clear. Today’s scene around artificial intelligence is not as clear and there are multiple definitions for AI systems floating around.

The process of finding a shared definition of Open Source AI is only in its infancy. I’m fully aware that for many of us here this is trivial and this phase is almost boring.

But the four workshops revealed that a significant number of people in the rooms did not know the 4 Freedoms nor had any idea that OSI has a formal Open Source Definition. And this happened also at two Open Source-focused events!

Which definition of AI system to adopt

I don’t think the Open Source community should write its own definition of an AI system as there are too many dangers with doing that. Most importantly, adopting a vocabulary foreign to the AI world increases the risks of not being understood or accepted. It’s a lot more effective and will be more palatable to use a widely adopted definition.

The OECD definition of AI system

The Organisation for Economic Co-operation and Development (OECD) published one in 2019 and updated it in November 2023. OECD’s definition has been adopted by the United Nations, NIST and the AI Act may use it too.

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment
Recommendation of the Council on Artificial Intelligence Adopted on: 22/05/2019; Amended on: 08/11/2023

I discovered a 2022 document of the OECD with a slightly amended definition from the one of 2019.The 2022 OECD Framework for the Classification of AI systems removes the words “or decisions” from their previous definition, saying in the note 5:

Experts Working Group decided [“or decisions”] should be excluded here to clarify that an AI system does not make an actual decision, which is the remit of human creators and outside the scope of the AI system
2022 OECD Framework for the Classification of AI systems

The updated definition used by the Experts WG is:

An AI system is a machine-based system that is capable of influencing the environment by producing recommendations, predictions or other outcomes for a given set of objectives. It uses machine and/or human-based inputs/data to:

perceive environments;

abstract these perceptions into models; and

use the models to formulate options for outcomes.

AI systems are designed to operate with varying levels of autonomy (OECD, 2019f[2]).”
2022 OECD Framework for the Classification of AI systems

Surprisingly, the version amended in November 2023 by the OECD still uses the words “or decisions”.

The definition of AI system for US National Institute of Standards (NIST)

NIST AI Risk Management Framework slightly modified the OECD definition that includes the word “outputs”:

The AI RMF refers to an AI system as an engineered or machine-based system that can, for a given set of objectives, generate outputs such as predictions, recommendations, or decisions influencing real or virtual environments. AI systems are designed to operate with varying levels of autonomy (Adapted from: OECD Recommendation on AI:2019; ISO/IEC 22989:2022)
AI Risk Management Framework

The definition of AI system in Europe

To complete the picture, I also looked at the EU. In a document from 2019, in the early days of the legislative process, the expert group on AI suggested: https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence:

Artificial intelligence (AI) systems are software (and possibly also hardware) systems designed by humans that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from this data and deciding the best action(s) to take to achieve the given goal. AI systems can either use symbolic rules or learn a numeric model, and they can also adapt their behaviour by analysing how the environment is affected by their previous actions.

As a scientific discipline, AI includes several approaches and techniques, such as machine learning (of which deep learning and reinforcement learning are specific examples), machine reasoning (which includes planning, scheduling, knowledge representation and reasoning, search, and optimization), and robotics (which includes control, perception, sensors and actuators, as well as the integration of all other techniques into cyber-physical systems).
High-Level expert group on AI: Ethics guidelines for trustworthy AI

It’s worth noting that this definition is not used in the AI Act. The text of the EU Council suggests this one be used:

artificial intelligence system’ (AI system) means a system that

receives machine and/or human-based data and inputs,

infers how to achieve a given set of human-defined objectives using learning, reasoning or modelling implemented with the techniques and approaches listed in Annex I, and

generates outputs in the form of content (generative AI systems), predictions, recommendations or decisions, which influence the environments it interacts with;

which seems to be quite similar to the OECD text.

Why we need to adopt a definition of AI system

There is agreement that the Open Source AI Definition needs to cover all AI implementations and not be specific to machine learning, deep learning, computer vision or other branches. That requires using a generic term. For software, the word “program” covers everything, from assembly, interpreted to compiled languages. “AI system” is the equivalent in the context of artificial intelligence.

“Program” is to software as “AI system” is to artificial intelligence.

In the document What is Free Software, the GNU project describes four fundamental freedoms that the “program” must carry to its users. Draft v. 0.0.3 similarly describes four freedoms that the AI system needs to deliver to its users.

In v. 0.0.3 draft there was debate on the wording of the freedom 3 — freedom to modify. For software, that’s the freedom to modify the program to better serve user’s needs, fix bugs, etc. Draft v. 0.0.3 says:

Modify the system to change its recommendations, predictions or decisions to adapt to your needs.
Draft v.0.0.3

The intention to specify what the object of the change is to establish the principle that anyone should have the right to modify the behavior of the AI system as a whole. The words “recommendations, predictions or decisions” come from the definition of AI system: what does the “system” do and what would I want to modify?

That’s why it’s important to say what it is we expect to have the right to modify. Tying that to an agreed-upon definition of what an AI system does is a way to make sure that all readers are on the same page.

We can change the wordings for that bullet point but I think the verb “modify” should refer to the whole system, not individual components.

We’re trying to adopt a definition of an AI system that is widely understood and accepted, even though it’s not strictly correct scientifically. The Open Source AI Definition should align with other policy documents because many communities (legal, policy makers and even academia) will have to align too.

The newest definition of AI system from the OECD is the best candidate, without the words “or decisions.”

Next steps

I met with the Digital Public Goods Alliance in Addis Ababa on November 14. I expected to encounter a different assortment of competences than the ones I’ve met so far, and that was true. How far we are from consensus on basic principles is something I’m contemplating before releasing draft v.0.0.4 and move on to the next phase of public conversations. For 2024 we’re planning a regular cadence of meetings (online and in- person) and a release roadmap leading to a v. 1.0 before the end of the year. More to come.

To trust AI, it must be open and transparent. Period.

Cristin Zegers — Thu, 14 Sep 2023 15:00:00 +0000

[SPONSOR OPINION]

By Heather Meeker, OSS Capital

Machine learning has been around for a long time. But in late 2022, recent advancements in deep learning and large language models started to change the game and come into the public eye. And people started thinking, “We love Open Source software, so, let’s have Open Source AI, too.”

But what is Open Source AI? And the answer is: we don’t know yet.

Machine learning models are not software. Software is written by humans, like me. Machine learning models are trained; they learn on their own automatically, based on the input data provided by humans. When programmers want to fix a computer program, they know what they need: the source code. But if you want to fix a model, you need a lot more: software to train it, data to train it, a plan for training it, and so forth. It is much more complex. And reproducing it exactly ranges from difficult to nearly impossible.

The Open Source Definition, which was made for software, is now in its third decade, and has been a stunning success. There are standard Open Source licenses that everyone uses. Access to source code is a living, working concept that people use every day. But when we try to apply Open Source concepts to AI, we need to first go back to principles.

For something to be “Open Source” it needs to have one overarching quality: transparency. What if an AI is screening you for a job, or for a medical treatment, or deciding a prison sentence? You want to know how it works. But deep learning models right now are a black box. If you look at the output of a model, it’s impossible to tell how or why the model came up with that output. All you can do is look at the inputs to see if its training was correct. And that’s not nearly as straightforward as looking at source code.

AI has the potential to greatly benefit our world. Now is the first time in history we’ve had the information and technology to tackle our biggest problems, like climate change, poverty and war. Some people are saying AI will destroy the world, but I think it contributes to the hope of saving the world.

But first, we need to trust it. And to trust it, it needs to be open and transparent.

As a consumer you should demand that the AI you use is open. As a developer, you should know what rights you have to study and improve AI. As a voter, you should have the right to demand that AI used by the government is open and transparent.

Without transparency, AI is doomed. AI is potentially so powerful and capable that people are already frightened of it. Without transparency, AI risks going the way of crypto–a technology with great potential that gets shut down by distrust. I hope that we will figure out how to guarantee transparency before that happens, because the problems AI can help us solve are urgent, and I believe we can solve them if we work together.

—-

OSI has gathered a group of leaders who will be presenting ideas around the topic of AI and Open Source in our upcoming Deep Dive: Defining Open Source AI Webinar Series. Registration is free and allows you to attend and ask questions at any or all of the sessions taking place between September 26 and October 12, 2023. REGISTER HERE today!

Modern EU policies need the voices of the fourth sector

Simon Phipps — Tue, 11 Jul 2023 13:00:00 +0000

Traduit en français.

It’s good news that the European Commission is now considering the value and needs of Open Source in its policy deliberations. What’s not as good is that it does so through the wrong lens. The Commission needs to extend its consultations, Expert Groups and other work to include and consider the fourth sector.

Post-industrial society comprises three sectors in the worldview undergirding the European Union:

The commercial sector includes industrial, extractive, service, logistic and administrative companies. They are represented by industry and trade associations, by consulting and lobbying companies and more.
The labor sector includes workers of all kinds – industrial, skilled, research, educational, managerial, entrepreneurial and more. They are represented by trade unions, professional bodies, guilds and more.
The consumer sector comprises everyone spending their personal wealth at all scales. They are represented by consumer associations, civil society organizations, religious organizations and more.

Internet changed everything

But the internet has driven change over the last 50 years from which has arisen the World Wide Web and hence the Open Source movement, which in turn have catalyzed many open culture movements related to technologies. The wave of open has produced many phenomena – good, bad and pending judgment – including the gig economy, open knowledge communities like Wikipedia and the Internet Archive, technology giants like Facebook and Google, open software stacks and supply chains and much, much more. The roles people play in this open wave do not fit comfortably into the three post-industrial sectors.

For example, an individual would be expected predominantly to fall within the consumer sector, with a section of their life represented in the labor sector. But an Open Source developer can be innovating and creating soft goods (commercial sector) which are assembled (commercial sector) or used (consumer sector) by others. A video streamer may be creating new copyrighted works of great value (commercial sector) that are widely viewed (consumer sector). An author or musician can now create their own compelling brand without becoming an employee of a publisher.

The fourth sector lacks representation

This introduces a new fourth sector. It comprises individuals, often connected and facilitated by ad-hoc or charitable communities, playing the roles of the commercial, labor and consumer sectors in varying mixes all at the same time. The fourth sector is poorly represented by the entities and roles associated with all three of the other sectors. That’s inevitable; each fourth sector role will fuse together an aspect represented and an aspect confronted by any of the entities and roles dedicated to the three traditional sectors.

This means that a consumer association won’t advocate well for Open Source developers because an aspect of their existence is classified as commercial. A streamer won’t be well represented by a trade union because they embody both consumer and commercial aspects. And so on. As a result, existing consultation mechanisms used by legislators are guaranteed to fail. When they try to deal with Open Source by expressing the understanding they have gained of proprietary software, they will keep causing collateral damage — as we have seen in the Cyber Resilience Act (CRA) and many times previously. The need will increase as regulation tries to control, account for or promote the activities of the fourth sector without consulting it.

One significant reason this has been happening for such a long time already is the lack of a term to use to raise the issue. That’s why I am proposing to call this sector of European society the “fourth sector.” It extends well beyond Open Source, covering any new, citizen-centric economic activity which is hard to have represented with only the existing commercial, labor and consumer lenses. Let’s tell the Commission and other governments that it’s time to care about the fourth sector, which is the driving force for all the changes they want to embrace — or control.

This article first appeared on Webmink in draft.

Recap/Summary of the Digital Market Act workshop in Brussels

Carl-Lucien Schwan — Thu, 09 Mar 2023 20:00:39 +0000

This Monday, I was in Brussels to attend a stakeholder workshop for the Digital Market Act (DMA) organized by the European Commission. For those who aren’t familiar with the DMA, it’s a new law that the European Parliament voted on recently and one of its goals is to force interoperability between messaging services by allowing small players the ability to communicate with users from the so-called gatekeepers (e.g., WhatsApp).

I attended this meeting as a representative of KDE and NeoChat. NeoChat is a client for the Matrix protocol (a decentralized and end-to-end encrypted chat protocol). I started developing it with Tobias Fella a few years ago during the covid lockdown.

I learned about this workshop thanks to NLNet, who funded previous work on NeoChat (end-to-end encryption). They put Tobias Fella and me in contact with Jean-Luc Dorel, the program officer for NGI0 for the European Commission. I would never have imagined sitting in a conference room in Brussels, thanks to my contribution to Open Source projects.

I work on NeoChat and other KDE applications as a volunteer in my free time, so I was a minor player at the workshop but it was quite enlightening for me. I expected a room full of lawyers and lobbyists, which was partially true. A considerable amount of attendees were people who were silent during the entire workshop, representing big companies and mostly taking notes.

Fortunately, a few good folks with more technical knowledge were also in the room. With, for example, people from Element/Matrix.org, XMPP, OpenMLS, Open Source Initiative (OSI), NlNet, European Digital Rights (EDRi) and consumer protection associations.

The workshop consisted of three panels. The first was more general, and the latter two more technical.

Panel 1: The Scope, Trade-offs and Potential Challenges of Article 7 of the DMA

This panel was particularly well represented by a consumer protection organization, European Digital Rights, and a university professor, who were all in favor of the DMA and the interoperability component. Simon Phipps started a discussion about whether gatekeepers like Meta should be forced to also interop with small self-hosted XMPP or Matrix instances, or if this would only be about relatively big players. I learned that, unfortunately, while it was once part of the draft of the DMA, social networks are not required to interop. If Elon had bought Twitter earlier, this would have probably been part of the final text too.

From this panel, I particularly appreciated the remarks of Jan Penfrat from the EDRi, who mentioned that this is not a technical or standardization problem, and pointed out that some possible solutions like XMPP or Matrix already exist and have for a long time. There were also some questions left unanswered, like how to force gatekeepers to cooperate, as some people in the audience fear that they would make it needlessly difficult to interoperate.

After this panel, we had a short lunch, and this was the occasion for me to connect a bit with the Matrix, XMPP and NlNet folks in the room.

Panel 2: End-to-End Encryption

This panel had people from both sides of the debate. Paul Rösler, a cryptography researcher, tried to explain how end-to-end encryption works for the non-technical people in the audience, which I think was done quite well. Next, we had Eric Rescorla, the CTO of Mozilla, who also gave some additional insight into end-to-end encryption.

Cisco was also there, and they presented their relative success integrating other platforms with Webex (e.g. Teams and Slack). This ‘interoperability’ between big players is definitively different from the direction of interoperability I want to see. But this is also a good example showing that when two big corporations want to integrate together, there are suddenly no technical difficulties anymore. Cisco is also working on a new messaging standard (which reminds me a bit of xkcd 927) as part of the MIMI working group of the IETF that they have already deployed in production.

Next, it was the turn of Matrix, and Matthew Hodgson, the CEO/CTO at Element showed a live demo of client-side bridging. This is their proposed solution to bridging end-to-end-encrypted messages across protocols without having to unencrypt the content inside a third-party server. This solution would be a temporary solution; ideally, services would converge to an open standard protocol like Matrix, XMPP or something new. He pointed out that Apple was already doing that with iMessage and SMS. I found this particularly clever.

Last, Meta sent a lawyer to represent them. The lawyer was reading a piece of paper in a very blank tone. He spent the entirety of his allocated time telling the commission that interoperability represents a very clear risk for their users who trust Meta to keep their data safe and end-to-end encrypted. He ignored Matthew’s previous demo and told us that bridging would break their encryption. He also envisioned a clear opt-in policy to interoperability so that the users are aware that this will weaken their security, and expressed a clear need for consent popups when interacting with users of other networks. It is quite ironic coming from Meta who, in the context of the GDPR and data protection, was arguing against an opt-in policy and against consent. As someone pointed out in the audience, while Whatsapp is end-to-end-encrypted, this isn’t the case for Messenger and Instagram conversations, which are both also products of Meta. The lawyer quickly dismissed that and explained that he only represented Whatsapp here and couldn’t answer this question for other Meta products. As you might have guessed, the audience wasn’t convinced by these arguments. Still, something to note is that Meta had at least the courage to speak in front of the audience, unlike other big gatekeepers like Microsoft, Apple and Google who were also in the room but didn’t participate at all in the debate.

Panel 3: Abuse Prevention, Identity Management and Discovery

With Meta in the panel again, consent was again a hot subject of discussion. Some argued that each time someone from another server joins a room, each user should consent so this new server can read their messages. This sounds very impractical to me, but I guess the goal is to make interoperability impractical. It also reminds me very much of the GDPR popup, in which privacy-invading services try to optimize using dark patterns so that the users click on the “Allow” button. In this case, users would be prompted to click on the “Don’t connect with this user coming from this untrusted and scary third party server” button.

There was some discussion about whether it was the server’s role to decide if they allow connection from a third-party server or the user’s role. The former would mean that big providers would only allow access to their service for other big providers and block access to small self-hosted instances. The latter would give users a choice. Another topic was the identifier. Someone from the audience pointed out that phone numbers used by Whatsapp, Signal and Telegram are currently not perfect as they are not unique across services and might require some standardization.

In the end, the European Commission tried to summarize all the information shared throughout the day and sounded quite happy that so many technical folks were in the room and active in the conversation.

After the last panels, I went to a bar next to the conference building with a few people from XMPP, EDRi, NlNet and OpenMLS to get beers and Belgian fries.

This article first appeared on CarlSchwan.eu

Images of Brussels Workshop by Carl Schwan