Transcript – Open Source Initiative

Episode 6: transcript

Ariel Jolo — Thu, 09 Feb 2023 00:00:00 +0000

EPISODE 6: How to secure AI systems

“BD: Now we’re in this stage of, ‘Oh my, it works.’ Defending AI was moot 20 years ago. It didn’t do anything that was worth attacking. Now that we have AI systems that really are remarkably powerful, and that are jumping from domain to domain with remarkable success. Now, we need to worry about policy. Now, we need to worry about defending them. Now, we need to worry about all the problems that success brings.”

[INTRODUCTION]

[00:00:36] SM: Welcome to Deep Dive AI, a podcast from the Open Source Initiative. We’ll be exploring how artificial intelligence impacts free and open-source software, from developers to businesses, to the rest of us.

[SPONSOR MESSAGE]

[00:00:50] SM: Deep Dive AI supported by our sponsor, GitHub. Open-source AI frameworks and models will drive transformational impact into the next era of software, evolving every industry, democratizing knowledge and lowering barriers to becoming a developer. As this revolution continues, GitHub is excited to engage in support toy size, deep dive into AI and open source and welcomes everyone to contribute to the conversation.

[00:01:18] ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[EPISODE]

[00:01:22] SM: Welcome to this special episode of Deep Dive AI. Today, I’m joined by co-host, Director of Policy at the Open Source Initiative, Deb Bryant. Welcome.

[00:01:31] DB: Thanks.

[00:01:32] SM: This is a new experience for me having a co-host. In this episode, we’re talking to Dr. Bruce Draper. He’s a Program Manager of the Information Innovation Office at the Defense Advanced Research Projects Agency, which is hard to pronounce, but it’s also known as DARPA. Much easier to pronounce for me. He’s a professor at Colorado State University in the Department of Computer Science. He has an impressive curriculum in areas that range from knowledge-based vision, to reinforcement learning, evaluation of face recognition, unmanned ground vehicles and more. Especially at DARPA, he is responsible for the GARD Project. This is an acronym that stands for guaranteeing AI robustness against deception.

Dr. Bruce, if I understand correctly, GARD’s objective is to develop tools to understand if a machine learning system has been tampered with, how wrong am I?

[00:02:29] BD: Well, it’s designed to develop tools that will defend an AI system against opponents who are trying to defeat it. So developing tools to make AI systems more robust, particularly against malintention adversaries, and to try to make those tools available to the larger community, so that all the AI systems that are out there will hopefully be secure.

[00:02:52] SM: Wonderful, so it sits perfectly into DARPA, the defense mission. Speaking of missions and stuff of DARPA. What is the mission of DARPA? People are familiar with it. I’m sure that our listeners remember ARPANET and all the research that came out of it. But what has DARPA done for us recently?

[00:03:11] BD: Well, of course, so DARPA was of course founded many years ago when Sputnik was launched. It terrified the American government and the American government decided did not want to be surprised again by technological change. The role of DARPA, our official mission is to anticipate and prepare for technological change and disruption. Along the way, we’ve done a lot of things. You mentioned the old ARPANET, things like global positioning satellites, was another hit of ours. If you ask the question, what have we done for you lately? I think mRNA vaccines are recent and very powerful example of work done at DARPA.

[00:03:46] SM: I didn’t know of the mRNA tied to DARPA, which kind of leads into the open-source aspect of this. So DARPA developed this technology that went into the vaccines, but then how these technologies being monetized and shared with the rest of the world?

[00:04:05] BD: Well, in the case of the mRNA vaccine, Moderna started as a spinoff from DARPA. DARPA got that technology out into the community through the pharmaceutical industry, with the hope of being able to vaccinate large numbers of people, which turned out to be really important with the onset of the COVID crisis. Similarly with the what we’re trying to do here with the GARD program, is sort of get technology out there that will be robust and safe against an adversary, before adversaries attack. So that as AI becomes more and more part of our everyday processes, and we’re all sitting in self-driving cars and all the rest, we can be reassured that the AI systems will work and work correctly.

[00:04:44] DB: Is open-source license a technology then a strategy to get your research or the product research out into the environment more quickly?

[00:04:55] BD: Absolutely. What we’re trying to do is get defensive AI, AI that will be robust against attack out into the community as quickly as possible before large scale attacks start happening. We’ve already seen small scale attacks on commercial entities and another entities. We would like to be able to sort of get out and you can’t really – inoculate is a strong word. But we’d like to go out and get defensive tools in people’s hands as quickly as possible before large scale attacks start happening.

[00:05:22] SM: How important is the open-source aspect of this? In other words, how important it is to the mission of DARPA and the mission of GARD project, to have technology that is available with licenses that basically provide no friction to use or no limitations.

[00:05:41] BD: So the mission of DARPA is, of course, to protect the United States and more generally, the free world. If you think about a scenario in the relatively near future, you can imagine a city where almost all or significant percent of the cars on the roads are self-driving cars. Now, those are not military systems as a privately owned, that’s your car, and my car, and all that kind of stuff. But if an adversary can tie that up, an adversary can defeat that and start causing crashes all over city, they could tie up any major city they wanted to. Not by attacking the military structure, but by attacking the civilian infrastructure. It’s not important to us that we defend not just the military systems, but that we defend all systems that are out there.

[00:06:25] SM: This is a very fascinating scenario, because a lot of the technology that we have right now without being a machine learning system is extremely vulnerable anyway. So what’s the difference between a machine learning system and a general IT computer science system?

[00:06:42] BD: Well, as you know, cybercrime has become a major problem and all kinds of systems out there are vulnerable. I think part of what happened is, we went out many years ago and started networking all the computers together and creating this great interconnected digital world that we now live in, before we really thought about the problem of what was going to happen with malicious people with malicious intent. The result was, the people doing cyber defense are constantly playing catch up.

People are constantly attacking old systems that don’t have all the latest defenses and that sort of stuff. What we’re hoping to do in the case of AI systems, is to get out in front, right? AI systems are there in the media a lot, you’re seeing them more and more often, but they’re not yet as widely spread as the sort of more conventional systems. But they’re going to become much more widely spread. They’re going to become something that everybody uses. So we’d like to go and get some defenses out into the world before the adversaries can get in front of it.

[00:07:43] SM: So you’re basically seeing a future that is already painted, right? In your mind, there is absolutely no question that the world has changed with the introduction of these new, more modern AI machine learning tools, reinforcement learning and all the pieces of it. They’re absolutely going to take over. There’s not going to be another sunset or upset like in previous generations of AI tools. Do you see it different this time?

[00:08:08] BD: I think what we’ve seen before, people talked a lot about AI summers and AI winters, right? Growth periods for AI and then periods where AI dropped back. But what’s interesting is it, with every summer, it sort of got a little more prevalent and that sort of stopped going for a while. Now, what I think we’ve finally done is we’ve hit a tipping point. I mean, just think about all the publicity we’ve had in the last years, whether it’s ChatGPT, or DALL-E, or any of these other programs that have gotten so much publicity. These systems are becoming very, very powerful. They’re becoming very, very capable. That can be a wonderful thing if they’re used well. It can be less wonderful if they’re used poorly. Part of our intent is to make sure that they’re not being attacked, and they’re not being tricked into things that they shouldn’t be doing.

[00:08:52] DB: So I was wondering if you’re going to talk a bit about – before you dive into the detail on the technology itself, I know in our earlier discussions, you’ve described an interest in broadening your community or engagement. Can you describe what an ideal ecosystem would look like? What kind of stakeholders are looking to join the process? We can kind of help frame this for some of our listeners.

[00:09:15] BD: We are interested in sort of creating two communities here. First thing I want to say is that all the research done under GARD is open source, public, available to everyone. What we’re really trying to do is create two sets of tool sets that are available to the broader community. One is designed for developers based around a tool set known as ART, the Adversarial Robustness Toolkit. The idea there is to give the people who are building AI systems a set of tools, give them access to the most current up-to-date defenses, give them access to all the standard attacks so that they can test their system and give them all the tools to build a system that will be as robust as possible. That’s one part.

The other part is a tool called Armory. Armory is targeting not the developers, but the T&E folks. The idea behind Armory, is we want the people who are testing and evaluating AI systems, whether they were developed in house, or whether they were purchased from another source. Nonetheless, most large projects will have a T&E group. That’s a different set of tools. We want to build tools that will get the T&E group, allow them to test how well defended a system is or conversely, how vulnerable might it be. So we’ve got these two sets of tools, one based on ART, which are targeting developers, one based around Armory, which are targeting the T&E folks.

[00:10:40] SM: So T&E is testing and evaluation group?

[00:10:44] BD: Yes.

[00:10:45] SM: So you’re saying that all these tools are released with an open-source license and they’re publicly available? How do you deal with international collaboration? What kind of collaboration do you see happening?

[00:10:56] BD: We view this as an open-source project open to everyone. In fact, one of the key developers at ART is IBM, including their team based in Ireland. Our hope is to make tools widely available and we’re not trying – this is not a uniquely American project. This is supposed to be an open-source project. We want everyone to have tools that are as safe as possible. It’s a very interconnected world, and we’re all buying software from each other, and transferring software among each other. If the US software is safe, but the, I don’t know, Canadian software has holes, what good is that when we put them all together? We have to make these tools available for everyone.

[00:11:35] SM: For sure. I got to say that, the only thing was a lot simpler a couple of years ago, and at least in my poor man’s mind, and between Europe and United States, everybody’s friend. We’re living in a global world that seems that has been challenged more recently. Do you see any of that friction happening in your world?

[00:11:56] BD: I think that friction impacts everywhere. I think our model is – certainly my model is that, what we want is as open a world as possible that empowers all the individual developers and all the individual people. Because in an open society, we do well if all the individuals have power. I think there are some societies that now that prefer a much more closed, centralized model. The reason we’re going with an open-source model is to support a model that does well in the free world.

[00:12:27] DB: I’m agreeing with a thesis that an open-source model is a great way to develop collaboration. The problems that the GARD program is addressing are ubiquitous. It is a global problem along with the authority pieces, but that I think it’s a great approach. I appreciate sitting on in today’s interview, my personal history includes working as a government practitioner, and over the years, I’ve worked with a lot of federal agencies. It isn’t until most recently, I think that the general public has become aware of how prevalent open sources in the federal government, they’ve been doing it for decades. But with a high challenge we’re experiencing, especially in cybersecurity, we’ve heard directly from public agencies in hearings, that it’s not just a matter of security and defense. It’s also a matter of innovation. So it’s an interesting time to see projects like this that are critical, using this particular model. That was actually how it captured my imagination, and about 2000. I thought the model itself was really more important than the software. The way the product was produced add as much value. I could see that in the government environment, so I have a lot of respect for the project that Bruce is running today.

[00:13:37] SM: I want to go back a little bit to the technology first. You said that every summer of AI has been a little longer and the winter is shorter. What do you think is the key element that triggered this summer to be longer?

[00:13:52] BD: Well, obviously, the onset of deep neural networks, which have since been expanded to also include things like transformer networks, diffusion models. All of this work was a coming together of what had been a lot of very basic mathematical research with the GPUs and other processors have finally allowed all that work to be done at scale. Also, frankly, the internet which made sufficient amounts of data available. This has just led to an AI summer that has impacted everything from computer vision to language, to planning, to reasoning. So many areas are being advanced now by this current AI summer.

[00:14:34] SM: On one hand is the basic research on mathematicians and mathematics. Hardware is another piece and third element, it will be data. Am I getting it right? That would be absolutely correct. With all these changes, math is fairly available, or at least you can study hardware and data start to get more complicated. Can you give us a little bit of an overview of who the partners are in current project? What does it take to become a contributor to a project like ours, something that is so deeply complex?

[00:15:11] BD: I think we have 15 partners in GARD. We have a lot of performers. The way you get involved with this is actually quite easy. So we have a website called gardproject.org. You can go to that project, it will take you – if you’re a T&E person to the Armory tools. If you’re a developer person, it will take you to the ART tools. We also have sections there, you can have all the tools in the world, you also need the intellectual background. So we have a section of tutorials put together by the folks at Google on defensive AI, and how to make good use of these tools and provide background. And we have a set of datasets provided by another one of our partners at MITRE to make it easier to test, and run and evaluate these tools as well. So we’re trying to make data and tutorials available, as well as these two large block sets.

Anyone can come to gardproject.org, go to our GitHub repository, start accessing these tools. I should tell you that the way we’re starting now, in terms of where most of the developer work is, most of the algorithms and algorithmic pieces that we’re providing through the art toolbox. Most of those have been developed by university partners, most academic partners. We have a few companies in there, but most of those people had been academic, where most of the work being done on the testing and evaluation side, the T&E side, most of that work has been done by companies like IBM, and MITRE and a company called Two Six, because that tends to be something that is often more of a corporate function.

But again, any researcher who wants to get involved is encouraged to get involved in either of those two communities whatever their role is. Let’s get involved, let’s all work together, let’s make the most secure systems that we can.

[00:16:57] DB: As a recovering university researcher, I feel obligated to ask this question. Does DARPA make available research grants for universities that might be interested in engaging the project or would that come from other funding strings?

[00:17:11] BD: No. DARPA, we are a funding agency. We don’t do the research in house. We fund other people to do the work. In the case of a project like guard, the majority of that work is being done at universities, under funding from DARPA. Those universities, although most of them are American universities, international universities can and do apply as well.

[00:17:32] SM: There’s one thing that strikes me as very interesting in what you’re trying to do with the garden project, is to talk about security and safety very early on in the development of these machine learning systems. Because it took me or not just me, I’m not a software developer, but it took me a long time to start to get into my head, ingrained the concept of security. I think that the general public is also not very yet still needs to get familiar with having password managers and very tiny little security related things. But for GARD, it’s really central and it looks, sounds to me like it’s really ahead of the curve, like lesson learned from the internet times where everything was open, and accessible and security was an add on. What made the AI community concerned so much that they need to invest immediately into early on?

[00:18:26] BD: Well, it’s funny. The first papers on adversarial AI occurred in the academic literature and around the 2015 timeframe. Then very, very quickly, it developed to the point that we were having actual AI attacks with commercial implications as early as 2017. There was a company called Syvance that made malware detection software. It was one of the first victims that I know of where people went in and were able to do an adversary, because they were using an AI system to determine whether a piece of software was malicious or not. Makers of malicious software, went in, did an AI attack, figured out how to fool their system, and then went in and were able to attack their customers.

It went very quickly from being something of only academic papers, to something that we saw being used in practice. We decided very quickly it was going to be necessary for us to find a way to defend against it.

[00:19:29] SM: What other scenarios keep you up at night?

[00:19:32] BD: One of the scenarios that keeps me up tonight is the self-driving car scenario. The reason that one keeps me up at night, is that right now, most people when they use an AI system, it’s not safety critical. Yes, I lead an AI system on Netflix, recommend what movie to watch. But if it recommends a bad movie, nobody does. Right? And indeed, one of the reasons why I think there has not been more work on defensive adversarial AI out of Silicon Valley, is because most of the things that AI systems today are being used for are not necessarily safety critical. But that’s going to change and the self-driving cars are simply, I think the first example of something that we’re going to see out in the public that is safety critical. If someone ruins my movie recommendation system, it’s inconvenient, but it’s not a disaster. But if someone disables the brakes on my car, that’s a completely different story.

[00:20:31] SM: This means that you are dreaming of self driving cars in the streets mixed with non-self driving cars in humans.

[00:20:40] BD: I am imagining, because we already have them, railroad trains that are pretty much completely digitally controlled. I think the reality is that AI is so good, it is so cost effective. When it’s not being attacked, it’s so safe, that we’re going to rely on AI to do more and more things for us that are important and that are safety critical. As long as we can defend them, that will be a good thing. But there is a nightmare scenario, where we all become dependent on AI systems that are vulnerable to attacks. That’s what we’re trying to avoid against.

[00:21:14] DB: Self-driving car is a great example. It’s also an industry that’s starting to embrace open-source software development models, operating systems. What’s the opportunity to be able to concurrently deploy these kinds of defensive AI systems really at the ground level, where we have companies like GM, and Daimler, who’ve publicly committed to using open source in their strategy in the car. Some of these are less mission critical. We’re talking about the entertainment systems, but I agree, I see it coming down the pipe. How do you co-develop those things concurrently? So by the time you get to market with a true self-driving vehicle that’s commercially available, you’ve also made it safe in that way.

[00:21:56] BD: This is more of the developer side, there’s the developer and the T&E side. What we want to do with the developer side is have this running art toolkit. What we hope will be happening is, there’s always a game of cat and mouse. You come up with a better defense, someone tries to come up with a better a tactic to get around it. What we’re hoping is that, if everyone sort of adopted the ART toolkit, and they’re using these tools than ever. Then when there’s a new attack that comes out, there will be a whole community of people out trying to develop a new defense against that attack. And because everyone’s hopefully using the ART interfaces, as soon as that defense is created, it can rapidly be promulgated across all the people who might be using it. I don’t think it’s possible to come up with one defense that will be perfect forever. That sort of silver bullet has never happened in cyber defense. I don’t really anticipate it happening in AI defense. But what we do want to do is make sure that all these commercial systems have the best-known current defenses on them, and that they’re tied into this ecosystem, so that when newer, better defenses become available, they can immediately be downloaded and incorporated.

[BREAK]

[00:23:07] SM: Deep Dive AI is supported by our sponsor, DataStax. DataStax is the real-time data company. With DataStax, any enterprise can mobilize real-time data and quickly build the smart, highly scalable applications required to become a data-driven business and unlock the full potential of AI. With AstraDB and Astra streaming, DataStax uniquely delivers the power of Apache Cassandra, the world’s most scalable database, with the advanced Apache pulsar streaming technology in an open data stack available on any cloud. DataStax leaves the open-source cycle of innovation every day in an emerging AI everywhere future. Learn more at datastax.com.

[00:23:47] ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[INTERVIEW CONTINUED]

[00:23:50] SM: What do you see the role of policymakers in this field? All policymakers in Europe and the United States are perfectly aware of all the risks included in AI and machine learning systems and they’re starting to regulate the. Even though there may not be agreement in the academic groups about what needs to be done. What’s your feeling from your point of view of these policies, the draft policies that are circulating?

[00:24:20] BD: I think there are a lot of policies circulated and I don’t know that I’m well qualified to speak to the strengths or weaknesses of particular ones. But I do want to bring into this conversation is this notion of the T&E tools we’re developing. Because one of the things that we have to know in order to set any policy reasonably is what are the risks, how well defended is something. There’s always a tradeoff. You give up a little bit of accuracy to get a system that’s more robust, right? How much are we giving up? How much robustness are we getting? You can’t begin to have a sensible policy if you don’t know what the risks are in your ability to defend against those risks. Part of what we’re hoping we can do with the Armory tool is give the testing and evaluation folks some way to measure how much risk are they taking, if they take this AI system, and they use it in a particular way, how vulnerable is it to attack?

Like I say, some systems are not mission critical. It may be fine to go out with a system that is perhaps a little less robustly defended. Other systems are safety critical and need to have absolute state of the art. I think of government policymakers, I also wonder about insurance companies. If you’ve got a vulnerable self-driving car, that’s a real threat to the insurance companies. They might end up having to pay out if things go badly. So I think there are a variety of players both on the government side, and on things like the insurance side and the large company sides, who all have a vested interest in trying to make sure that these systems do have some extensive regulation and yet, don’t cripple the industry. I don’t want to get to a situation where we put so many regulations on that we can’t use AI. I don’t think that’s to our advantage either.

So I don’t know where the policy sweet spot is. I don’t even know if all the right players are in the game yet. But I want to create a set of testing and evaluation tools that will give them something that they can measure, that they can start to use to make sensible policy.

[00:26:22] DB: I have to say that would be a great contribution. We don’t know really where the most risk is. We don’t have a great inventory of what we own. There’s a lot of work ahead. I want to ask though, if you have a general sense of any gap you think that needs to be addressed today, in addition to obviously providing information to create more informed policy. Do you see an area of vulnerability that you think might be good public discussion for regulatory addressing?

[00:26:52] BD: I’m not sure. Let me instead answer a slightly different question to the one you asked, but one that I think that I can answer in a way that is more usable.

There’s an intellectual or an academic hole, which is that, we’re getting better at the practice of defending these systems and that’s what we’re trying to do with these toolboxes. We’re getting better at evaluating it. That’s what we’re trying to do with Armory. We still don’t have, however, what I would call a good deep theoretical understanding of what the threats are, and what the limits of the threats are. This is really getting back to sort of our deep understanding of these networks and the theory of adversarial AI. GARD actually has a – part of the GARD program is specifically designed to try to develop and push the theory of defensive adversarial AI.

I talked about that less, because unless you’re a PhD researcher at a university, at this point, we’re producing papers, I’m trying to advance this fundamental mathematics and there’s still a long way to go. But right now, the practice is ahead of the theory and it would be really nice to have something like what the encryption folks have, where you can talk about the length of the encryption vector, and how much security advisory. We don’t have that equivalent yet.

[00:28:05] SM: There is one thing that I noticed in AI practitioners that are really well aware of the dangers and the damages that unleashed AI can do to the world. Is this something that GARD is also looking at the dangerous uses of its own technology and machine learning models in general?

[00:28:28] BD: That’s a very broad topic. In the case of OpenAI, they were not worried about what an adversary that might do. They were worried about how their system could be used when it was operating correctly. That’s a very real – where do we want to use AI? Where do we not? What are the limits? Those are very real questions for ethics and other foreign sources, and best addressed by regulators and policy experts. We’re really looking at the question of adversaries and their ability to defeat an AI so that it doesn’t do what it’s supposed to do. We’re not concentrating in this particular program, on systems that work well. We’re trying to figure out how these systems can be broken by an adversary and how do we stop that from happening? The systems behave as advertised? There are separate policy questions as to where you want to use systems that behave as advertised.

[00:29:21] DB: So what have we not touched on, Bruce, that you think would be interesting or important for any institution, or organization, even individuals interested in participating or evaluating the GARD project?

[00:29:34] BD: First of all, we’ve touched on this briefly, but I want to invite people to the gardproject.org website. Depending on whether you’re doing T&E or you’re doing developers. Look at ART, look at Armory, look at the tools, get involved in this community. One of the things that we haven’t discussed is what DARPA does is we get in, we try to make an impact in an area and we try to create something and it will sustain itself. Then we get out and we do other things. The GARD project as a funded entity by DARPA will end next year. It will end in 2024. Our hope is by then that we have an active international open-source community that will continue this work on and allow this work to continue, even without direct DARPA support.

So that’s our goal. That’s why we think building this community is so very important. That it has to be a sort of self-sustaining. This is not something that we’re inflicting on the world. This is something that we’re hopefully trying to give to the world in the hope that people will look at it and see the value and want to build systems that behave as advertised.

[00:30:42] DB: That’s very consistent with a goal most community development is for a community to be kind of self-sustaining. Do you see DARPA having any other ongoing role or will it just be complete? In other words, will you have someone at DARPA that would continue to be a liaison or a super connector? Or do you see it being holistically moved to a new community, if things go as you hope?

[00:31:07] BD: We have some performers who are particularly IBM, for ART and Two Six for Armory, that will continue to work on the projects after 2024, and hopefully be a sort of organizing force behind the community. They’re both really expert, very good technical people, I think that they will be great. I think DARPA as always will look and see where there are problems. Our role at DARPA is to see where there’s a problem that isn’t being addressed. If the problems that this community is picking up and working on, we will let the community work on. If there’s something critical that we think is not being worked on, well, then DARPA may come in and try to address that problem.

[00:31:47] SM: From the type of partners that you would like to promote the projects to, are there any preferences? Do you need more academic contributors, or more corporations, government players or agencies in other parts of the world?

[00:32:04] BD: Well, first of all, I don’t want to discourage anybody. I want to have everybody as involved as possible. I think there are two particular groups that we look at, we look a great deal to the academic community, for the developer level work for the coming up with new algorithms, new defenses, things like that, we look more to the corporate community for the teeny level work. That’s something that tends not to happen in academia, so we’re hoping that we can get the industrial players to step up there. And we’re also hoping we may get governments to get involved and play a role at that level.

[00:32:38] SM: Are there any other projects that do something similar to what GARD is doing? Competitors, so to speak.

[00:32:45] BD: There are a number of smaller projects, particularly in the academic world, but also within government. As far as I know, GARD is the largest project and that’s why we’re trying to push the open source of it. There’s the old joke that, if you have one standard, it’s useful. If you’ve got 20, it’s not. What we’re really hoping to do, is to sort of build this around these two tools. Because at the moment, they have the largest uptick. For example, ART was recently recognized by the Linux Open AI Foundation, as one of its graduated projects for its degree of activity, and the number of people starting to use it in their work.

So that’s great. We want to sort of encourage that. Again, also, work through things like the Linux Open AI Foundation, right? Work through these other organizations that exist within the open-source world to make sure that we have an ongoing and viable community.

[00:33:42] SM: How did you get involved into this project? What caught your interest into going in research for adversarial AI?

[00:33:52] BD: Well, I came out of the computer vision and machine learning community. So he’d done a lot of work on the intersection between machine learning, and computer vision when I was an academic for many, many years at Colorado State University. That’s the point where these first adversarial papers started to come out from. There are these famous examples where if you added a little bit of noise to a picture of a panda, the system all of a sudden thought it was a gibbon, or you put a sticker on a stop sign, and your self-driving car thought it was a speed limit sign instead. I happened to be working in the general area where the sort of first attacks came out of. So then, when I got to DARPA, and the question was, how do we help the nation? How do we help the free world? What needs defending? I said, “Okay. This is an area.” And it turned out that another PM had just started this project up and had to leave DARPA unexpectedly. So I stepped in shortly before the project started, and have just enjoyed working with this community tremendously.

[00:34:54] SM: What’s next for you?

[00:34:56] BD: Well, what is next for me for the next half year is continuing to run these programs, and also continuing to make sure that that smooth handoff happened. For those listeners who don’t know, one of the ways that DARPA stays fresh is no one’s allowed to be a DARPA Program Manager for more than five years. So we all come in, we do a tour of duty, we try to have as big as impact as we can in a short period of time, and then we go back and return to wherever we came from. In my case, that’s the academic world. So I will be returning back to the academic world in half a year’s time or so, but there will be other DARPA PMs to come on and continue this type of work.

[00:35:35] DB: If you would have known what’s happening today, 10 years ago, would you have expected this evolution? I was involved in supercomputing early on, and AI was just something that was interesting to talk about, but it was – it was in its winter, I think. What do you think has been the most significant change in both its opportunity, its promise and also its threat?

[00:36:00] BD: It’s really funny, because for so many years, the question in AI was, could we make anything work? Or could we make anything work in a way that was reliable enough that you would ever let it out of the laboratory. Now, we’re in this stage of, “Oh my, it works.” That’s why issues like defending Ai have suddenly become important. Defending AI was moot 20 years ago. It didn’t do anything that was worth attacking. Now, I say we spent the first 25 years trying to figure out how to see.

Now, the question is what to look at. It’s the same sort of thing. When AI wasn’t powerful, when AI could only do very niche things, we didn’t have to worry so much about defending it. Bad actors or the financial business models, there are lots of things that didn’t matter. It was just a miracle when something worked. Now that we have AI systems that really are remarkably powerful, and that are jumping from domain to domain with remarkable success. Now, we need to worry about policy. Now, we need to worry about defending them. Now, we need to worry about all the problems that success brings.

[00:37:16] DB: Well, I’ll be watching with interest the science fiction literature field, because this is all the stuff that science fiction, the best was made of. Machines would take over and the things that we do. So now we all know that they can do that. What’s next, then?

[00:37:31] BD: They can also be wonderful partners. They can also be – most of the work that I do at DARPA is using AI to make people better, stronger, smarter, more capable. I think there’s an awful lot of us who are really interested in using AI not to replace people, but to improve them, and make them more capable, and to empower individuals. That’s certainly my interest. So, there are risks involved, but there are also just wonderful opportunities.

[00:37:57] SM: Like many other new technologies involved, striking that balance between what it can achieve and what damage it can do.

[00:38:04] BD: I encourage people, gardproject.org. It’s like everything else in open-source software. The more people that are involved, the more brains we get on the project and the more eyes to make sure these systems are being used properly, to know how secure they are or aren’t, so that we know whether or not we want to put them in a particular critical role. The more people we have involved in that, so it isn’t just one or two people making their “expert opinion.” But it’s a large community of people, all with different backgrounds and all different expertise, getting involved. That’s what we’re looking for. I think that’s what will give us the most robust AI in going forward.

[00:38:45] SM: I’m particularly grateful for this conversation, because we covered in these podcast a lot of other threats that are coming from AI machine learning, like discrimination in data sets, or other damaging uses and improper proper uses, of properly working systems like you were saying. But this adds another layer of complexity and another layer of policymaking needs to be taken care of.

[00:39:12] DB: Now, I really appreciate the insight into a very valuable topic, and exposure to your project. I wish you great success on the project. Thank you.

[00:39:21] BD: I understand, this is your 25th anniversary for us. So thank you. Congratulations. The open software movement is such an important thing to spreading technology across the world. So thank you for your work as well.

[00:39:37] SM: Thank you.

[END OF INTERVIEW]

[00:39:38] SM: Thanks for listening. Thanks to our sponsor, Google. Remember to subscribe on your podcast player for more episodes. Please review and share. It helps more people find us. Visit deepdive.opensource.org, where you find more episodes, learn about these issues, and you can donate to become a member. Members are the only reason we can do this work. If you have any feedback on this episode, or on Deep Dive AI in general, please email contact@opensource.org.

This podcast was produced by the Open Source Initiative, with the help from Nicole Martinelli, music by Jason Shaw of audionautix.com, under Creative Commons Attribution 4.0 International license. Links in the episode notes.

[00:40:20] ANNOUNCER: The views expressed in this podcast are the personal views of the speakers and are not the views of their employers, the organizations they are affiliated with, their clients or their customers. The information provided is not legal advice. No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[END]

Episode 5: transcript

Ariel Jolo — Tue, 13 Sep 2022 00:00:00 +0000

“MZ: In order to train your networks in reasonable time schedule, we need something like GPU and the GPU requires no free driver, no free firmware, so it will be a problem if Debian community wants to reproduce neural networks in our own infrastructure. If we cannot do that, then any deep learning applications integrated in Debian itself is not self-contained. This piece of software cannot be reproduced by Debian itself. This is a real problem.”

[INTRODUCTION]

[00:00:47] SF: Welcome to Deep Dive AI, a podcast from the Open Source Initiative. We’ll be exploring how artificial intelligence impacts free and open-source software, from developers to businesses, to the rest of us.

[SPONSOR MESSAGE]

[00:01:01] SF: Deep Dive AI supported by our sponsor, GitHub, open-source AI frameworks and models will drive transformational impact into the next era of software, evolving every industry democratizing knowledge, and lowering barriers to becoming a developer. As this revolution continues, GitHub is excited to engage in support toy size, deep dive into AI and open source and welcomes everyone to contribute to the conversation.

ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[00:01:33] SF: This is an episode with Mo Zhou, a first year PhD student at Johns Hopkins University, official Debian developer since 2018. He’s proposed the Machine Learning Policy for Debian research recently. He’s interested in deep learning and computer vision and among other things.

[00:01:53] MZ: Hello, everyone.

[00:01:54] SF: Thanks for taking the time to talk to us. I wanted to talk to you in the context of Deep Dive AI. I would like to understand a little bit better the introduction of artificial intelligence of what it means for free software and open source, what are those limitations? What new things it has been introducing? What kept you interested to volunteer and think about machine learning in the Debian community?

[00:02:21] MZ: Well, actually, artificial intelligence is a long existing research topic. I think we can split your questions into small ones, so I can handle this.

[00:02:34] SF: Absolutely.

[00:02:36] MZ: Where should we start? Let’s start from a brief introduction of what is artificial intelligence. In the last century, there was already some research about artificial intelligence. You may have heard some old news that computers can play chess with human players and human players are beaten by computers. That’s a very classical example of artificial intelligence. Back in time, artificial intelligence involves many well manually crafted things like if you design a computer program that can play chess with you, there is basically a searching algorithm that searches for a good play for the next step based on the current situation on the check board. There are many manually crafted things.

Recently, there are some factors that bring some changes to the artificial intelligence research community. The most important two factors are big data and the increase of hardware capacity. There are lots of hardware that is capable of parallel computing like GPUs, and FPGAs. These hardware are very important, without them the recent advancements of deep learning is impossible.

[00:04:05] SF: Right. So basically, you’re saying that the old chess playing games, they had a database of possible moves, and what they were doing, they were searching quickly between possible alternatives and evaluating the best option?

[00:04:19] MZ: Yeah. That is classical algorithm. Nowadays, if you’re looking to AlphaGo, that’s very different from the past algorithm.

[00:04:29] SF: Right. AlphaGo is the automatic player for Go.

[00:04:34] MZ: Yeah.

[00:04:35] SF: Which is now a lot more complex than chess from my memory.

[00:04:39] MZ: Yeah. Basically, recent algorithm can handle very, very complicated situations. I can give you a very simple example. Imagine that you’re a programmer. Now I present you two images, one with a cat and one with a dog. How do you write a progra that can classify the two images and how you, which is dog, which is cat? So basically, recent artificial intelligence can handle such complicated scenario, and is much more capable than what I have said.

[00:05:16] SF: Right. Okay, so how do they do that?

[00:05:18] MZ: The recent advancements are based on two factors, big data and computational capability. Let’s start from big data. If you want to do some classification of the cat and dog images, first, you have to prepare a training data set. For example, you take 100 photos of various kinds of cats, and another 100 photos of various kinds of dogs. Then you can label all the images you have collected. Then this is called a training data set.

[00:05:59] SF: The training data set is basically the raw pictures plus some metadata describing them that a human puts on.

[00:06:08] MZ: Exactly. Given such a data set, we then construct a neural network. This neural network is composed of many, many layers, such as convolutional layers, nonlinear, activation layers, and fully connected layers. Almost all of these layers comes with some learnable parameters, where the knowledge the neural network has learned is stored, okay. Given such a neural network and you input a image into it, and it will give you a prediction, it will predict whether it has a cat or a dog. Of course, without training, it will make wrong predictions and that’s why we have to design a loss function to mirror the discrepancy between its real output and our expectation. Then, given such a loss function, we can do back propagation and stochastic gradient descent. After that process, the neural network will gradually learn how to tell which image is cat, and which image is dog.

[00:07:23] SF: Okay, so software that runs in your phone, that tells you whether you’re snapping a picture of a dog or a cat, that software in the past, if we were talking about non-AI systems, traditionally, if you took a picture of something, you stored it on your computer, you snap the picture. The software was not involved into doing anything, but storing and retrieving it from the file system. Now, if you add a search engine inside that application that detects your pet in your collection of pictures, we’re adding a little bit of complexity. The neural network that has been trained to detect cats and dogs, now, if we wanted to distribute that piece of software inside Debian, or inside one of the few free software, mobile open-source systems to help retrieve our pictures, what do we need?

[00:08:20] MZ: Actually, we need lots of things, especially if we are doing distribution of free software. If we create a artificial intelligence application, we will need data. We’ll need the code for training neural network. We will need the inference code for actually running the neural network on your device. Without any of them, the application is not integral. None of them can be missing.

[00:08:52] SF: The definitions that we have right now for what is complete and corresponding source code, and how can it be applied to an AI system to an application like this that detects pictures of dogs?

[00:09:04] MZ: Well, actually, the neural network is a very simple structure, if we don’t care about its internal. You can just think of it as a matrix multiplication. Your input is an image and we just do lots of matrix multiplication, and it will give you a output vector. This is simply the things happened in the software. Both training code and the inference code are doing the similar thing.

Apart from the code, the data is something that can change. For example, we can use the same training and inference code for different data set. For example, I released a code for cat and dog classification problem, but you can decode and you say, “Oh, I’m more interested in classifying flowers.” Then you can collect new data sets about different kinds of flowers and use the same code to train the neural network and do the classification by yourself.

If you want to provide a neural network that performs consistently everywhere, you also have to release the pre-trained neural network. If you are releasing free software that also requires you to release the training data as well, because free software requires some freedom that allows you to study, to modify or to reproduce the work. Without any training data, it is not possible to reproduce the neural network that you have downloaded. That’s a very big issue.

Nowadays, in the research community, people are basically using neural networks that are trained on non-free data set. All of the existing models are somewhat problematic in terms of license.

[00:11:10] SF: Why is this happening? Do you know? Do you have any sense?

[00:11:12] MZ: Yeah, the reason behind this is very simple. Because to train a functional neural network, you have to collect many, many data. For example, you want to make a face recognition application. Then you have to collect face data. Then who can collect such large-scale data set? It’s only big companies can do this. It is very, very difficult for any person to do this.

[00:11:43] SF: It’s definitely not something like an amateur can do in their spare time in their bedrooms.

[00:11:51] MZ: Yeah. Nobody can do a large-scale data set. For example, nowadays, the most popular data set in the artificial intelligence field is called image net. It contains more than 1 million images with 1,000 classes. If you want to do a free software alternative, you need lots of people to do the labeling work and the image correction.

[00:12:18] SF: Of course, because now this image net dataset, I’m assuming is not available under a free open source, or free data, open data license.

[00:12:28] MZ: Yeah. It is not free. It is basically for academic purposes only. There are lots of pre-trained models across the Internet. Basically, everyone can use them and download them and use them. There are potential license problems behind this.

[00:12:48] SF: Because you’re saying that this database has images and labels, it’s a time consuming process to apply them and classify images this way.

[00:12:59] MZ: Yeah, it is very time consuming and costs lots of money.

[00:13:03] SF: Of course. How about text-based data in other types of data that is not images?

[00:13:10] MZ: Well, you mentioned text. That’s another interesting topic, because recent advances of artificial intelligence has brought significant change into research area. The first research area is computer vision. It is about like what we have said, you classify cat and dog images. Another field is computational linguistics, or natural language processing. It has lots of applications, such as machine translation. For example, the Google Translate, it is based on neural networks. Now, text-based data is relatively easier to collect, because you know, we can simply download the whole Wikipedia dump as a training data. It is since they’re license and is free.

[00:14:02] SF: Right. You still need to classify, you still need to do other passes, or?

[00:14:08] MZ: Well, it depends on what kind of test you want to deal with. For example, if you want to do the machine translation, then you can simply download the – for example, the English version of Wikipedia and the Chinese version of Wikipedia. Then, as long as you can find the English and Chinese centers correspondence, you have already got a usable machine translation training set.

[00:14:39] SF: We have these datasets that are proprietary and hard to distribute. There are trained models that are being distributed that depend on this original dataset. Now, one of the rights of users of free and open-source software is that they can modify software to fix bugs. If we have a model that has some difficulties in identifying European faces from African faces, for example, in a face recognition algorithm, or some other issue with dogs and cats, do we need to have as recipients of the software to fix this bug, start from knowledge?

[00:15:18] MZ: Yeah. Actually, the question you have mentioned is a very good question. For example, if we train a face recognition network, and in some cases, if your training data set contains only a few, for example, Asian face, and your network will expectedly performs that on such Asian face. This is a notorious and famous issue called data set bias. It is a cutting-edge research topic. People are working on this. I think, this issue will be overcome sometime in the future. This problem exists.

If we want to deal with such issue nowadays, what we can do is to collect more data. For example, your neural network behaves that, on cat data. Then you just simply collect more cat data, and train your neural network again. If you want to do this, you will find that to train the neural network yet, you need the original training set, so you can put more images into it. You also need the training code to produce a new neural network.

[00:16:40] SF: That’s a really important thing to understand. In order to modify an existing model, we need to have access to not only to the original data set, but also the software to train it. We need to know about how that training model has been configured to train. What do we need to tweak in there? Do we, for example, in the input parameters into the training set. Do we know if we see black dogs are constantly misinterpreted as cats? Do we know how to retrain the system in order to give a better answer on that front, besides just giving it more data?

[00:17:21] MZ: In the research community practice, apart from the training data set, we also collect a validation data set. The validation set is basically the same setting as the training data set, but there are new images and new labels. The two data sets are not overlapping. If you do training on the training set, your neural network has not synced any data in the validation data set. After your training process, you can do test of your neural network on the validation data set. If the performance is good on both training and validation data sets, then this neural network is good enough. After you have adjusted the neural network, you will also do the validation process to make sure the neural network you have obtained is sensible.

[00:18:18] SF: How predictable is the result of retraining? If I change the parameters, the input parameters of retraining the data set, do I know that I fixed the bug, or how will I know?

[00:18:31] MZ: Actually, this process requires some background knowledge and some experience. If you are a engineer in the related field, you will find it is very easy, because if you obtained a copy of code that is known to be working well, basically, you will not encounter any trouble. As long as you don’t change too much coding cited, or significantly change the parameters, like learning rate or something alike.

[00:19:04] SF: Let’s assume that we have the original data set, we have all the elements to retrain the model. Now, let’s go on the hardware level. You said, we need storage for sure and we need fast storage. Then the computation side, what else do we need?

[00:19:20] MZ: If you search for deep learning framework on the Internet, you will find many, many solutions that works virus of hardware platform, like mobile phones, tablets, personal computers. These frameworks are designed to be not specific to any hardware. What you can gain from some powerful hardware is the speed. For example, if you drop the same neural network on your personal computer with a strong GPU, and it may run several 100 times faster than your mobile phone. If you are a researcher in this field, you will quickly figure out this speed issue is critical, because if you train a neural network on CPU, it may require several years. If you got a strong GPU, it only takes several for hours. This is ridiculous.

[MESSAGE]

[00:20:20] SF: Deep Dive AI is supported by our sponsor, DataStax. DataStax is the real-time data company. With DataStax, any enterprise can mobilize real-time data and quickly build the smart, highly scalable applications required to become a data-driven business and unlock the full potential of AI. With AstraDB and Astra streaming, DataStax uniquely delivers the power of Apache Cassandra, the world’s most scalable database, with the advanced Apache pulsar streaming technology in an open data stack available on any cloud. DataStax leaves the open-source cycle of innovation every day, in an emerging AI everywhere future, Learn more at datastax.com.

[00:21:00] ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[INTERVIEW CONTINUED]

[00:21:05] SF: Speaking of Debian and going back into the free software concerns and open-source community concerns about training data sets, regarding the hardware, one of your papers from a few years back was mentioning the difficulty in getting access to accelerated CPUs, GPUs and some functions inside some of these processors that were not readily available inside Debian. Can you elaborate a little bit on that?

[00:21:32] MZ: Debian is a open source and a free software community. It is very strict in the practice of free software, because our official infrastructure are all based on free software. In order to train neural networks in reasonable time schedule, we need something like GPU and the GPU requires non-free driver, non-free firmware. It will be a problem if Debian community wants to reproduce neural networks in our own infrastructure. If we cannot do that, then any deep learning applications integrated in Debian itself is not self-contained. This piece of software cannot be reproduced by Debian itself. This is a real problem.

[00:22:32] SF: I totally understand it. I mean, for me, Debian has always been the lighthouse that you look after. If you want to know if a package is really giving users the freedom to run, modify, copy and distribute.

[00:22:47] MZ: Yeah. That is very strict in this regard.

[00:22:51] SF: Right. You’re basically adding a new element of restrictions in what a fully open-source AI system can be. You’re putting hardware as a piece of this element, because it’s fine if you up to a point, you have the data set, you have the training model and parameters and all that stuff. You have all the source code and you still cannot retrain your system to fix a bug, unless you have 10 years to wait for, then you have a problem. What are the efforts to try to overcome this issue with the hardware drivers?

[00:23:31] MZ: Yeah, this is a tough topic for the open-source community. Lots of endeavor are put into Nvidia driver reverse engineering. Nowadays, the free driver of Nvidia GPU is still not available for CUDA computation. The CUDA is what we needed for training in neural network.

[00:23:57] SF: Also, the recent announcement from Nvidia still does not really help on the AI training front.

[00:24:04] MZ: That’s just helped a little bit. Nvidia has lots of software. The open-source driver is only a tiny bit of the whole ecosystem.

[00:24:17] SF: Okay, and how about other hardware manufacturer, like LD new announced chipsets from Google, from Apple. They seem to mention the fact that they have some AI capabilities, some AI instructions in there. What do you think of those?

[00:24:33] MZ: There are lots of new hardware manufacturers. You mentioned Google, right? They have their own Tensor Processing Unit. Currently, I don’t see any of such TPU available on the market. Personally, we cannot buy it. There is no way for individual free software developers to look into such thing. You also mentioned Apple. Yeah, they have done very good advertising on their new chips, but their corresponding ecosystems are not free. This is also a tough issue if you want to port your free software onto these platforms.

Basically, I think the big companies are responsible for doing this and there is no way for individual developers to do it. Apart from Apple, there are also AMD and Intel. The two manufacturers are releasing open-sourced computing software in order to compete with Nvidia. Currently, Nvidia’s CUDA computing software is dominant in this market. AMD has released their [inaudible 00:25:48] as a competitor. Recently, Intel also came up with one API to compete with Nvidia. Nowadays, only Nvidia is providing proprietary software solution for deep learning.

I think there is still a very long way to go for AMD and Intel, because Nvidia’s product is very mature at the current stage. This role cam and Intel’s one API are still very new. Our market still need some time to verify their new product to see whether they work or not.

[00:26:29] SF: Right, right, right. Oh, it happened in the past that smaller architectures that were more open, eventually took over just with the work of large groups, like Debian and other in the open-source world. Starting to think about the future, what does the future look like to you? What would you like to see inside Debian, an ideal scenario?

[00:26:51] MZ: I have to say, my opinion is a little bit of pessimistic, because there are various drum obstacles, if we want to do some hardware support, or data center support. The two factors just requires lots of money to do. That is difficult even for big companies. What I am expecting in the free software community is that we can continue to provide a solid system for production, for research. We can support these applications and deep neural network frameworks. We can do this very well. As long as our users want to train your network, they may have to rely on external software, such as some random code downloaded from GitHub, or something like –

[00:27:47] SF: What kind of licensing schemes are more popular in the AI research community?

[00:27:54] MZ: Well, based on my own experience, the most popular license among this research community is Apache 2. Some of them are BSD style license, or MIT style license. Well, this license are very popular among the research community. if you are interested in some research paper, and you’ll find the corresponding code and the code is basically open source. The problem still stems from the corresponding training data, because many useful data sets are not free software. You got free software training code and inference code, but the data is not.

[00:28:39] SF: Yeah, so we go back into the fact that there is no clear understanding of the copyleft concept is not applied, or is not common among AI applications.

[00:28:53] MZ: Yeah. There is not a clear understanding on this issue. Many researchers just released their neural network, but what license should we give to the trained neural network? Basically, nobody can answer this. We know there is a problem if we don’t clearly state a license.

[00:29:17] SF: In an ideal world for you, what’s an open-source AI?

[00:29:21] MZ: Well, I’m Debian developer, so I stick to Debian’s free software guideline. We pursue for software freedom. As long as I get a free software AI application, I expect that I am able to download the training data set. I can study the training code, the inference code, and I can reproduce the neural network and I can also modify the neural network. That’s what I am expecting. I know this is very hard to achieve in the foreseeable future.

[00:30:00] SF: Yeah. Okay. It’s good to set the bar high and hope for the best. At some point, we’ll get there. That includes also getting free drivers in order to run training models in a significant short time. All right, Mo. It’s been a pleasure. It’s been a pleasure to talk to you. I think we have covered a lot of ground. You helped us understand what’s an open-source system. You helped us understand what an AI system is, what components we need to watch for, from the training data sets to the model itself, and the hardware required to run it. Thank you. Thank you very much. What are your plans for the future? What are you working on?

[00:30:41] MZ: I have not completely decided yet. I love doing research. I enjoy the research progress. Because by doing research, you are exploring the borderline of human knowledge. I really enjoy this process, as long as we can make some progress. Because you’re studying something that nobody knows. You are the first one on the earth to know that new knowledge. This is very exciting.

[00:31:10] SF: It is exciting. Thank you very much, Mo Zhou.

[00:31:13] MZ: Yeah. Thank you for your time.

[END OF INTERVIEW]

[00:31:16] SF: Thanks for listening. Thanks to our sponsor, Google. Remember to subscribe on your podcast player for more episodes. Please review and share. It helps more people find us. Visit deepdive.opensource.org, where you find more episodes, learn about these issues, and you can donate to become a member. Members are the only reason we can do this work. If you have any feedback on this episode, or on Deep Dive AI in general, please email contact@opensource.org.

[00:31:59] ANNOUNCER: The views expressed in this podcast are the personal views of the speakers and are not the views of their employers, the organizations they are affiliated with, their clients or their customers. The information provided is not legal advice. No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[END]

Episode 4: transcript

Ariel Jolo — Tue, 06 Sep 2022 00:00:00 +0000

“DGW: Some people just want to download the software and make porn with it. And if they don’t know how to program, and there is that restriction, that stops them. That’s a meaningful impediment. It’s a speed bump. It doesn’t stop you going down the road, but it makes it harder, it makes it slower, and it stops some harm.

Even within the dictates of open source and free software, I think that we can think a little bit more creatively about how we can build restrictions about what users we think are immoral or unethical into our software, and not see it as black and white.”

[INTRODUCTION]

[00:00:37] SM: Welcome to Deep Dive AI, a podcast from the Open Source Initiative. We’ll be exploring how artificial intelligence impacts free and open-source software, from developers to businesses, to the rest of us.

Deep Dive AI is supported by our sponsor, GitHub. Open source AI frameworks and models will drive transformational impact into the next era of software, evolving every industry, democratizing knowledge, and lowering barriers to becoming a developer. As this evolution continues, GitHub is excited to engage and support OSI’s Deep Dive Into AI and open source and welcomes everyone to contribute to the conversation.

[00:01:18] ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[INTERVIEW]

[00:01:23] SM: Today we’re talking with David Widder, a Ph.D. student in the School of Computer Science at Carnegie Mellon University. He’s been investigating AI from an ethical perspective, and specifically is studying the challenges that software engineers face related to trust and ethics in artificial intelligence. He’s conducted this research at Intel Labs, Microsoft, and NASA’s JPL, Jet Propulsion Lab.

David, thanks for joining us today to talk about your research, and what you’ve learned about AI and ethics from the developer’s viewpoint. Welcome, David.

[00:01:55] DGW: Thank you so much, Stefano. I’m excited to be here. And I am grateful for the opportunity.

[00:02:00] SM: Tell us about why you chose ethics in AI as the focus of your research. What do you find in this topic that is so compelling?

[00:02:08] DGW: We might all agree that, for better or worse, AI is changing our world. And as we begin to think about what ethical AI means, especially as a lot of the, as I see it, ethical AI discourse is driven by powerful companies, governments, and elite universities, I think there’s a risk in the way this discourse plays out.

The problems we study are not those which affect the most marginalized, which are often left out of the decision-making of tech companies and things like that. They’re the problems that are faced by these systems of power. They’re are the problems that are most salient to these people. And the solutions we make are barely going to sort of threaten these powerful interests. There are things that sort of are important and make meaningful changes, but don’t sort of levy fundamental critique.

I like taking a step back to be a little bit more critical of the narratives around AI ethics that emerge and ask, “What are we missing? What is going unsaid?” Based on what we’re focusing on, what isn’t being focused on? And that’s how I like to drive my research.

[00:03:03] SM: You recently presented one of your published papers, the one titled Limits and Possibilities of “Ethical AI” in Open Source. And you focused on deep fakes with your coauthors, Dawn Nafus Dabbish and James Herbsleb. For those of us who need a little bit of background, what is a deep fake? And give us a little bit of examples of how this technology is used for good or bad reasons.

[00:03:29] DGW: Essentially, a deep fake is a video where the likeness of one person is superimposed or swapped, faked, onto the body of another person. So, you put the face from one person on to the face of another and they can be in a video. So, sometimes this is sort of more innocent for fun uses. You can – It’s parody, political parody. You might have seen deep fake Obama or deep fake Tom Cruise. Sometimes it’s for art. There’s actually some overlap between what we think of as a deep fake and what is computer graphics and like the movie avatar and things like that. And these are more interesting uses.

Now, where it gets a little bit more tricky is sort of in the middle, where we start thinking about what is political parody and what is fake news? There might be what might seem like parody for one person might actually fool other people, if you’re parodying powerful politicians or leaders.

What I think is actually understudied is the really difficult uses, the really nasty uses, the really damaging uses, which, unfortunately, constitute the vast majority of deep fakes. One study found that the vast majority of deep fakes portray women in non-voluntary pornography. So, they’re superimposing the likeness of someone you might know, like a celebrity, onto a pornographic actor. Then this can lead to anxiety, job loss, and health issues, and employment issues.

[00:04:43] SM: It’s terrible, because if I understand correctly, also, this technology is becoming so much better that it’s hard to distinguish the original from a fake. So, gullible or inexperienced people might be tricked into believing it’s true.

[00:04:59] DGW: Totally. When we talk about this technology getting better and like becoming more convincing, you’re totally right. It is increasingly easy to use. It’s increasingly accurate in the way it spoofs or fakes the video. And I think that raises an important question. It’s hard to think of how to make the tech better in a way that changes the way it harms because the harm is inherent in the way it’s used.

It’s not like a technical improvement to the tool is going to like reducing bias or fixing privacy leak doesn’t really make sense in this context. It’s how it’s used. When it’s getting better, it’s hard to think of how that also fixes ethical issues or how that also addresses ethical issues.

[00:05:33] SM: Okay, let’s talk about this ethical issue. Like, your paper is focused on open source. But these deep fake technologies are available in both proprietary and open source forums. And the ethical ramifications seem to play out a little bit differently, whether we are in a corporate situation, proprietary, versus within the open source community. What do you see are some of the contrasts between these two approaches? And let’s start from the corporate scenario and focus on big tech. What are the responsibilities of these technology companies have when they develop or use? Or what kind of power and control do they have?

[00:06:12] DGW: They have a lot of power and a lot of control even when they perhaps don’t want to acknowledge that. A good example is Google’s project, Maven, which was a contract they had with the Department of Defense to use, I believe, computer vision to help improve the targeting of drones, warfighting drones. And I think we often – These big tech companies try and conceive that their technology is neutral. And we just provide tools for people to use for good or bad. But this is a case where like they knew what it was being used for. They knew, even if it wasn’t technically being used to kill, they knew that it could help make this more efficient. And I think a good example of tech workers mobilizing and speaking up was the backlash to this project.

Now, I since learned that I think it has been reinstituted in some way. And we’re seeing this, too, in the ways that technology companies are increasingly investing in responsible AI and ethical AI research. Trying to find ways to remove bias from systems. Trying to find ways to make these systems more understandable and more interpretable. And I think that’s good. But I think we also have to pay attention to how these systems are ultimately used, rather than just how they are built or how they’re implemented.

To summarize, there’s a lot of control and a lot of power they have. And I think we have to be careful to investigate where they choose to use that control and where they choose to invest in these kinds of ethical AI questions.

[00:07:29] SM: To clarify, project Maven was not a deep fake technology. It was more of a general AI computer vision tool.

[00:07:36] DGW: Yeah, yeah, absolutely. And to talk about – I mean, that’s a fair point. And to talk about the difference between corporate or centralized proprietary technology in the deep fake context, there are plenty of closed source proprietary deep fake services, running as software as a service, where you can upload a video and upload some source imagery and have a fake provided to you without being able to access the source code, or being able to access sort of the intermediate steps there.

And because it’s centralized, there actually is the opportunity for some more control in that case. They can put in filters for pornography and such in a way that isn’t felt in open source. And I think we’re going to get to that.

[00:08:14] SM: Basically, proprietary systems and corporations have the choice to pick who they partner with, who their customers are, and also select the possibilities inside the tools themselves, like the kind of output they can prepare.

In contrast, an open source developer community, that includes a numbers of volunteer developers. Your research has uncovered a quandary here. It seems that the open source licenses limit the developer’s sense of responsibility and control of how their software is used.

[00:08:46] DGW: There’s artifacts of the licenses and sort of the way people conceive of open source and free software licenses, the non-discrimination for fields of endeavor, non-discrimination to people and groups, and free software is freedom zero, that sort of legally dictate with licenses that follow these mandates, legally dictate that you can’t discriminate or can’t control how downstream developers or downstream users use your software. So, this is different because, oftentimes, software companies that don’t license their software as open source, they have that contractual control over how and who uses their software, and often for what even they can control that.

Whereas an open source that control isn’t felt by virtue of these licensing strategies common in open source software. And this can be good for a lot of things. For example, and one of the proprietary deep fake tools was found to have a crypto miner embedded in it that was stealing power and cycles from the people who used it. With open source, it’s not all bad news for this. But it illustrates an important distinction and what kinds of harms can be prevented in open source versus in a proprietary company context.

And in the open source case, I think it’s important to realize like what starts as a legal dictate in the licensing world often will then filter into the culture. And it’s not just me pointing this out. But it filters into the culture in a way that developers don’t feel like they can or should even be able to control how their software is used by virtue of the licenses that they’re used to using.

[00:10:11] SM: Which is an argument that we often make at the Open Source Initiatives, that the legal mandates inside legal licenses and copyright licenses are just one layer that incorporates social norms and collaboration norms that are cultural inside the projects, rather than mandated legally. And actually, the licenses are just the tool. But your research also addresses two sentiments that some of your respondents and about how they have this limited agency on what they produce. The first is the notion of this technical inevitability. Like, a sentiment that basically the genie’s out of the bottle.

[00:10:49] DGW: That’s a powerful one. The idea that – I’ll read a quote from my participant. One said, “Technology is like a steam engine. It’s just getting better, faster, and more powerful.” And as if this happens naturally. As if this happens without any human making it happen. This idea that technology naturally gets better or naturally improves, the historians of technology, philosophers of technology, have critiqued this as a thing that is used to remove one’s sense of personal agency or that limits one’s sense of personal agency, technological inevitability.

In the open source case, in particular, some of my participants, when they were building this deep fake tool, were saying things like, “Deep fake software will only continue to get better.” And there are competing projects. There are other open source projects that are trying to do the same thing we are. So, even if I chose to stop working on this, even if I withheld my labor, this algorithm or kind of system will continue to improve. Deep fake realism will continue to improve. Even our project without even my labor.

And I think that’s true to an extent, but it might not improve as fast. And it might not improve as quickly. And there might not be these kinds of technological innovation. So, I think that we have to be critical of the idea of technological inevitability because of the way it can seem to limit one’s own personal agency. You know as well as I do that a lot of open source projects need more labor. They need more help. So, if they chose to take that labor and chose to take that volunteer effort and put it somewhere else, that actually can make a difference.

[00:12:13] SM: This is a very interesting topic because it also overlaps with my memory of the conversations around PGP, Pretty Good Privacy algorithms and tools, that were for a long time considered weapons by the United States. And even encryption was considered a weapon, right? For a long time, the commerce was not safe outside of the boundaries of the United States, because higher-level encryption schemes were not available.

And as of today, there’s still that dichotomy between the people who want to have encryptions on their phones to balance it out with threats of terrorists using the same kind of technology. But the other perception that you have identified among the developers is that of technological neutrality. So, another perception that you have identified among developers is the technological neutrality, the notion that if someone paints something offensive, you can’t blame the paint manufacturer.

[00:13:11] DGW: Guns don’t kill people. People kill people. These are real cultural issues in the United States, right? The idea of technological neutrality is a real – Not just in the United States, but an idea that looms large and the political discourse. Guns don’t kill people. People kill people. We’ve heard that before. That’s a nice sort of soundbite, but I think we need to think harder about it. Because guns are designed to do a certain thing very well. So even if they don’t literally kill someone, I think they make certain things easy.

[00:13:35] SM: Automatically, yeah. Autonomously. Right.

[00:13:38] DGW: Yeah, exactly. Which, if that happens, we have a whole another set of problems. But they are designed to make certain things easier and make certain things harder. An example I like to give is you can throw a gun as if it were a Frisbee, but it’s a pretty bad frisbee, and it’s not going to be very fun. And you can kill someone with a Frisbee. But it’s not designed to do that. And it’s going to be pretty hard.

The idea of technological neutrality is, I think, in many cases false. I’m not the first to point this out. There’s been a lot of scholars of science and technology studies who asked Do Artifacts Have Politics? Langdon Winner. And what it comes down to, and at least in my view, is that there’s a connection between the way you design a thing or the way you implement a system, and what kind of uses are afforded by that system.

Affordances, the connection, the glue in the middle, the things that you design, the way you design your system to make certain things easier and certain things harder affects how something is used. And that then unpacks and challenges the idea of technological neutrality.

In the deep fake example, or in the open source example more generally, I heard people argue that even if we were to put, for example, technical pornography restrictions into our software, because it’s open source, anyone could go and then just take those out. Anyone could go and remove those. Now, that’s true. In a literal sense, open source is open. And if you have the programming knowledge that can take out restrictions that are built into code.

But as one of my participants pointed out, not everyone has that knowledge. Some people just want to download the software and make porn with it. And if they don’t know how to program, and there’s is that restriction, that stops them. That’s a meaningful impediment. It’s a speed bump. It doesn’t stop you going down the road. But it makes it harder, it makes it slower. And it stops some harm.

Even within the dictates of open source and free software, we can think a little bit more creatively about how we can build restrictions about uses we think are immoral or unethical into our software, and not see it as black and white. It’s not always going to be we stop all misuse, or we just don’t stop any. But there are shades of grey in the middle. Challenging the idea of technological neutrality is the way we begin to see those shades of grey.

[00:15:42] SM: This is a great point, because so many times I talk to developers who have that black-and-white approach. They’re trained to think in mathematical terms. Like, if A happens, then B is the consequence.

[00:15:54] DGW: And I think the norms will differ in every community. I think some communities will be more comfortable taking a more restrictive approach. Leaning more into the sort of trying to help guide people to certain socially beneficial uses and away from certain socially concerning uses. And I think that just acknowledging the broad spectrum of gray that is there is going to be really, really important. I think you’re right to route our discussion in concrete examples of harm or concrete examples in the world because that makes the stakes feel appropriately high.

If we adopt the, “Well, we can’t stop all harmful use. So, we may as well just leave the ethics up to the user and not try,” I think that is concerning. Because even if we stop a few or like 10% of harmful uses, that still appreciably changes the harm that is wrought to individuals. Like, for women who have had nonvoluntary pornography made of them, that’s a big number. If that stops you from having a fake porn made about you, and that stops you from losing your job or developing anxiety, like that’s a real thing. Thinking of framings that aren’t we stop all harm, or we don’t even try, is probably a good way to develop the conversation in this area and get away from the idea of technological neutrality.

[00:17:08] SM: Your research also reveals an interesting dichotomy, how transparency and accountability of open source may differ between implementation and use. With respect to the ethical AI, why is the open source great for implementation purposes, but not so great with respect to use?

[00:17:26] DGW: This is something we’ve all kind of known for a while, but not sort of named. I’m not going to like pretend I thought of it. But I think what we name in our paper is a spectrum between – Or a continuum, as we call it, between implementation-based harms and use-based harms.

Implementation-based harms are things you can fix by building the software differently. And a good example in an open source case is the idea of recidivism prediction algorithms. Algorithms which seek to predict whether someone who’s accused in the criminal justice procedure will recidivate, will recommit a crime.

And these systems are in many cases – Well, most cases, I’d say, biased. Whether because of the data they use, or the way they train their algorithm or the way they’re even just employed in a certain context. And if they’re the sort of implementation-based harms, if there are data issues, or implementation issues in the code or bias issues in the code, making these open source can allow more people to scrutinize them. Can allow more eyes to ask questions of this system and find implementation harms. And in wider – Not just the recidivism case. But in wider cases, perhaps find privacy leaks, or find unchecked bugs, ethical bugs so to speak, that can be fixed by increased scrutiny.

Because of the way that open source is open and freely inspectable and freely – Like, anyone can, in many cases, submit a pull request or submit a change to fix these kinds of bugs for implementation harms or harms from implementation, I think that open source does particularly well.

[00:18:58] SM: Definitely, this is true. It has been true for a long time when we talked about software in the very clean software sense, like 90 styles and non-AI, because we’ve been talking the inspectability or the reproducibility of code. The fact that you can download the source code and recompile it on your own and prove that it actually get the same deterministic results after running it.

When I talk to developers of AI systems, I get that fuzzier answer to that, because the inspectability itself becomes a little bit more convoluted depending on the model or the algorithms. I totally understand your point. In general, the systems we’d like them to be – We should be able to inspect them and review and make sure that especially before we put them in charge of making decisions for us.

[00:19:42] DGW: I agree. I think we should maybe contrast this with use-based harms. Sort of the other side of the continuum, open source allows transparency in the source code. And therefore, accountability for implementation harms. You can know who added what feature. And if there’s an issue, you can fix it.

For use-based harms, where open source is released online for anyone to use for any purpose, there’s not a lot of traceability into who uses it for what. There’s not transparency into uses. And there’s not accountability for those uses. That filters out back to the people who developed it.

An example of that would be the deep fake case, right? It’s not a problem with how the tool was built, or how it was implemented. It’s how it’s used. And because it’s open source, anyone can use it for harm. And so therefore, this is where it’s a little bit more problematic, a little bit more concerning. Open source allows some use-based harms to go unchecked without the same level of transparency into how it’s used and the level of accountability into harms arising from those uses.

[00:20:38] SM: Then there are the terms about transparency. Like, we don’t have, right now, norms or legal obligations yet from this. There is some research going on. The European Union has already started looking into AI a little bit more, just like the European Union has been looking at data mining and starting to regulate it. These transparency and accountability issues that float around AI and art, and they are object of your studies too, they are being regulated.

[00:21:09] DGW: I’m glad there’s more focus on these because I think that we have a very narrow and rehearsed view of what transparency and accountability might mean.

[BREAK]

[00:21:19] SM: Deep Dive AI is supported by our sponsor, DataStax. DataStax is the real-time data company. With DataStax, any enterprise can mobilize real-time data and quickly build the smart, highly scalable applications required to become a data-driven business and unlock the full potential of AI. With AstraDB and Astra Streaming, DataStax uniquely delivers the power of Apache Cassandra, the world’s most scalable database, with the advanced Apache Pulsar streaming technology in an open data stack available on any cloud.

DataStax leads the open source cycle of innovation every day in an emerging AI everywhere future. Learn more at datastacks.com.

[00:21:59] ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[INTERVIEW CONTINUED]

[00:22:03] SM: You shared with me also, contrasting your research, another paper you wrote with NASA. Comparing the deep fake paper and the NASA paper gave you another perspective on how developers see AI and how they end up trusting. Tell us a little bit about that.

[00:22:20] DGW: A brief sort of highlight of the paper is I had the privilege of being at a NASA site while they were developing and beginning to use an auto coding tool, a tool to build software automatically for an upcoming space mission. I was a fly on the wall and very grateful for that privilege.

But there’s a question of trust there, right? Because if you’re trying to use this new tool that automates some programmers’ labor, does it in an automated way, do you trust it? These people, as am I, they’re space nerds, right? Like, we have pictures of like the past missions on our walls. We have, you know, statues. And sort of the idea of wanting this mission to be successful is extreme. NASA is what some literature call a high-reliability organization. The stakes are big, you know?

They were developing this framework, which by the way was open source. And then also, trying to use it for an upcoming space mission. And I think this is useful in contrast to the deep fake case, because in the NASA case, everyone can agree what a harm would be. If the spaceship blows up, if it crashes, if it stops working when it’s outside the orbit of Earth, that’s a bad thing. And that looks bad for me. That looks bad for you. That looks bad whether you’re the one creating the software or the one using it. There’s normative agreement around what is good and what is bad. It helps illustrate the deep fake case because there wasn’t normative agreement around what was good and what was bad.

The community developing the tool had strongly set norms that you’re not allowed to use this for non-voluntary pornography and other harmful uses. But there’s not normative agreement in every case between the people developing the tool and the users of that tool. They weren’t able to control or necessarily even engage with reached normative agreement with the users of the tool.

What is bad for the community, and certainly, they took steps to set these norms in a positive direction, which I think is great and a thing that the wider open source community can learn from. But they weren’t able to always reach normative agreement with the myriad users and nameless users, many users who weren’t in their organization and thus can’t be reached in the way that they could at NASA. And there was that normative agreement at NASA.

[00:24:33] SM: The fact that they had this control, and they knew who the users are.

[00:24:38] DGW: The fact that they were just like already on the same page about what was a good use and what was a bad use to begin with. Maybe there’s someone way out there. Maybe a different state actor might disagree that an American space mission succeeding is a good thing, and they might seek to damage it. But at least within the organization developing and using the software, there was normative agreement around what was a good use and a bad use, or a good outcome and a bad outcome, in a way that didn’t always exist between the developers and users in the deep fake case.

[00:25:05] SM: The fact that, now, the community has been talking about coding assistive technologies, like Copilot or Code Whisperer. And NASA already had something that was even more production ready. It’s actually shipping code that goes into flight missions. Not just a prototype.

[00:25:22] DGW: Absolutely. I know, right? I think that’s super exciting. And that is part of the reason why they really needed to trust it, is because like – I mean, I’m sure you and I have experimented with Copilot and written a few lines, and it’s fun. And the stakes are high. Like, it really needs to work. This trust is especially important in these kinds of contexts, trust in your tools. And that’s what we were trying to study in that paper.

[00:25:45] SM: For the record, I’m not a developer. My experiments with Copilot are pretty much the same as fooling around and trying to see whatever it spits. But I’m not able to judge.

[00:25:55] DGW: And while we’re on the thing of Copilot, this is a little bit of an aside, but I think Copilot is concerning for open source. And I mean, I think it’s exciting in many ways. But I think there’s particular concerns I have. Because is it valid? Is it okay to train a proprietary system on open source code? If you don’t, then license it using the terms of the data you are training it on. And what I mean by that is, in many cases, Copilot will generate licensed text verbatim. MIT license, permissive licenses, copyleft licenses, like GPL. But it will generate a license text verbatim, which shows that it can spit out licensed code verbatim in a way that may or may not respect the licenses that it’s spitting out. This is a legal gray area right now. I’ve talked to lawyers who are much smarter than me and also are actually lawyers. But that’s a much wider conversation of like what does the idea of Copilot, especially in the code sense, hold for open source when it’s unclear whether it’s following the license restrictions?

[00:26:58] SM: Those are really important questions that the community is asking about Copilot, Whisperer and other tools that I’m sure are being in development that we don’t know yet. I guess, fundamentally to me, it is a fairness issue. When a developer wrote code and published it, made it available to the world, and adopted a shared agreement of saying, “I give it to the world with the promise that also other users receive the same rights.” Then we didn’t know. None of these developers in the past or anybody who wrote anything and published it on the Internet had any understanding that their body of text, their creation, would be used to train a machine that would be doing some other things. Whether it’s DALL-E with pictures and images, or Copilot, or whether it’s GPT 3 spitting out poems and short web pages. And it’s a new thing. It’s a new right. There is this new right of data mining that has been codified by the European Commission already. In the US legal system, I don’t think there is an equivalent. But probably there will be something that looks like. Whatever we have contributed in the past, if we don’t want it to be available for corporations or anyone to use for training data sets, we have to make an action.

And what goes with all the pictures that we have uploaded in the past on data sets like Flickr using the norms that we thought was fair? Like, with Creative Commons, we said, “Okay, Creative Commons Attribution Share-Alike. I give it this picture to you, as long as you keep it the same as it is and you share it with the same attributes with the same rights to others. And now, that picture is my face, is being used to train a system that detects myself going shopping or going to a protest in the street. Is that fair? Honestly, I don’t have an answer for that. And I think we, as a society, we need to ask ourselves, what have we done? What kind of world do we want to live in? What are the conditions to balance the power of regular citizens with those of developers and other actors?

[00:29:09] DGW: You raise a really interesting question about the difference between laws and norms. Because oftentimes, norms are things we all feel and experience and kind of expect. And that may or may not be in line with the current legal regime, or they’re just, as you raised, may not be a settled matter in law.

When I uploaded – When I first got Facebook, I don’t know, I was 13 or something. And I was uploading pictures of like a birthday party or something. Am I okay with ClearView AI using that to build a facial recognition system and sell it to law enforcement and other agencies? Like, I think if you’d asked me then I would have gone, “What’s facial recognition?” But also, no. I mean, I hadn’t heard of that case. But I think it’s really scary if like the default is going to be set towards like companies and governments have the right to scrape your data and use it for whatever, unless you take a specific action not to. Because like you and I know about this, right? But I think the average person who doesn’t have the luxury of free time to talk on podcasts.

[00:30:13] SM: But also, it’s almost impossible to exercise that right to opt-out, because so many megabytes and gigabytes of pictures have already been uploaded. I lost track of all of where I put my pictures. And then the services that used to exist, now don’t exist anymore. Where did it go? It’s an interesting world that we live in.

Going back to yourself 13 years ago, I mean, when you were 13, were you even aware that you were basically training a machine by uploading a picture and adding a tag of your friends and building a history of your faces changing over the ages too?

[00:30:48] DGW: Well, was I like, explicitly aware? Probably not. But I will point out that I was quite a nerd. So, every time I was drawing a box around my face to like, tell Facebook, I was in it every time I was drawing a box around my friends’ faces to get them to know that I put a photo on them, there is that kind of question about, again, talking about what kinds of solutions companies will develop and why? There’s kind of that question where it’s like, “Yeah, this is kind of useful to me, and that I can like know which pictures I’m in.” It’s going to be useful for the people developing the feature too. It’s going to be useful for Facebook to know who my social network is so they can sell ads to me that are more precise. So, maybe not as 13-year-old David. But like, there’s always that kind of like, “What’s going on here?” And that I think we all tend to have in our curiosity.

[00:31:32] SM: That’s why I think it’s important to talk about ethics in AI because the responsibilities of corporations and the control that they have also means that they have power that needs to be balanced out.

[00:31:45] DGW: I think you’re totally right. As we begin to talk about ethical AI, if we let companies only drive this conversation, and we don’t look to open source, and we don’t look to public sector organizations, then I think we’re going to get a very particular idea of what ethical AI is and what kind of problems there are, that is going to be driven by the interests of big tech.

[00:32:06] SM: For the broader open source community, what do you think are the key takeaways? As we frame the discussion around AI and ethics, what are your thoughts about how to bring the best future for AI and helping it become more trustworthy for us?

[00:32:22] DGW: The big question that I hope to raise by the papers is that I think we need to start this conversation. I think we don’t know yet. And I’m not going to pretend that my paper has a definitive answer to your question. But I’d be happy to start. I like to think of this kind of as good news and bad news. And we’ll start with the bad news, because it’s nice to end on an optimistic note. I think we have to realize that putting software that can be used for harm online and letting anyone use it for anything can be concerning, can cause harm in ways that proprietary closed source software does not. And I think we need to talk about that more. I think we need to recognize that.

And the devil is in how you actually, once you recognize that, what you do about it. But I think, again, going back to our earlier discussion about the gray area, it’s not black and white. It’s not like you either make it closed source and write contracts for who can use it for what and license it that way, or you make it open source. There’re areas in the middle. As a community I studied, that you can set norms. Even if you’re completely open source, you can set norms about how you find it that your community is okay with the software being used. You can elevate socially beneficial cases and educate about harms arising from harmful cases.

There’s also the ethical source movement, which is using licenses to bar certain kinds of uses. And there’s a lot of discussion about whether this constitutes an open source – Technically an open source license or not. But I think that the higher-level takeaway I take from that movement is that you can use licensing in many ways, or you can at least use licenses to influence norms in many ways. And I think that’s just something for further discussion. I don’t think it’s something to be cast out of hand outright.

What are ways that we can find to influence, if not outright control, how the open source software release in the world can be used? And I think that’s sort of the bad news, acknowledging that there is harm from the way we release possibly harmful software that can be used for harm freely available online for everyone to use.

Now, towards the optimistic case. By virtue of focusing on open source in this conversation, I think we haven’t talked about some of the harms from big tech in this case. There’s a profit incentive. And I can cite so many studies. So much great research has shown the difficulty of doing ethical AI when you’re driven by a profit motive. When you’re working with a manager who wants you to do certain things, not others. When you don’t have the ability to address an ethical harm or change norms in a way that you think would be helpful.

I think the open source strength in this area is what it’s always been, which is the broad diversity of communities that can set their own norms, that can refashion these norms, as a way to experiment on what ethical AI might mean in a way that is not dependent on the for-profit context in private companies. This radical sense of experimentation in open source is also a promising way to think about what ethical AI mean, or what it could mean in different contexts.

[00:35:09] SM: The experimentation, I’m all about that. I’m all in favor. And I think that we are in the early stages of new things. And if we don’t play, if we don’t play with different variations, if we don’t feel ourselves as flexible, then we’re not going to be making much progress. Any closing remarks or something that you want to share? Like, what’s something you’re working on for the future?

[00:35:31] DGW: I would love to discuss this research with anyone listening. I want this to start the conversation. I’m under no conception that I have all the answers I want to learn from everyone. I would also love to connect on Twitter. And that’s where I discuss a lot of my research, my art, my activism. I’m Davidthewid on Twitter. And I’d love to learn from you there and engage with you there.

And as I’m still doing my Ph.D., I don’t have an escape to that yet. And I’m beginning to study what AI ethics might look like in a supply chain. Acknowledging the fact that software is not developed all at once in one organization. You remix stuff. You take bits from there. You take modules from there. And that all comes together. And that means that AI ethics has to account for that reality in the fact that it’s a supply chain problem, the same way ethics has always been a supply chain problem in the physical product space. That’s what I’m working on. If anyone has thoughts on that, too, I’d love to talk.

[00:36:24] SM: Thank you. Thank you, David.

[00:36:26] DGW: I’ve loved this conversation. Thank you.

[OUTRO]

[00:36:27] SM: Thanks for listening. And thanks to our sponsor, Google. Remember to subscribe on your podcast player for more episodes. Please review and share, it helps more people find us. Visit deepdive.opensource.org where you find more episodes, learn about these issues. And you can donate to become a member. Members are the only reason we can do this work.

If you have any feedback on this episode, or on Deep Dive AI in general, please email contact@opensource.org This podcast was produced by the Open Source Initiative, with the help from Nicole Martinelli. Music by Jason Shaw of our genetics.com under Creative Commons Attribution 4.0 International License. Links in the episode notes.

[00:37:09] ANNOUNCER: The views expressed in this podcast are the personal views of the speakers and are not the views of their employers, the organizations they are affiliated with, their clients, or their customers. The information provided is not legal advice. No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[END]

Episode 3: transcript

Ariel Jolo — Tue, 30 Aug 2022 00:00:00 +0000

[00:00:01] Connor Leahy: When a human says something, there’s all these hidden assumptions. If I tell my robot to go get me coffee, the only thing the robot wants to do is to get coffee, hypothetically. It wants to go and get the coffee as quickly as possible, so it’ll run through the wall, run over my cat, throw grandma out of the way to get to the coffee machine as fast as possible. Then, if I run up, “No, no, no. Bad robot.” I try to shut it off. What will happen? The robot will stop me from hitting the off button, not because it’s conscious or it has a will to live. No, it will simply be because the robot wants to get the coffee. If it’s shut off, it can’t get me coffee. It will resist. It will actively fight me to get me coffee, which is of course silly.

[INTRODUCTION]

[00:00:45] Stefano Maffulli: Welcome to Deep Dive AI, a podcast from the Open Source Initiative. We’ll be exploring how artificial intelligence impacts free and open-source software, from developers to businesses, to the rest of us.

[SPONSOR MESSAGE]

[00:00:59] SM: Deep Dive AI is supported by our sponsor, GitHub. Open-source AI frameworks and models will drive transformational impact into the next era of software, evolving every industry, democratizing knowledge, and lowering barriers to becoming a developer. As this evolution continues, GitHub is excited to engage and support OSI’s deep dive into AI and open source and welcomes everyone to contribute to the conversation.

[00:01:26] ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[INTERVIEW]

[00:01:31] SM: Welcome, Connor Leahy. Thanks for taking the time. Connor is one of the founders of EleutherAI, a collective group of researchers of artificial intelligence. He’s also Founder and CEO of Conjecture, a startup that is doing some interesting research on safety of AI. We’ll talk more about this. Welcome, Connor.

[00:01:50] CL: Thanks so much for having me.

[00:01:52] SM: Let’s start by explaining a little bit the history of EleutherAI. How they did it came to be and how did you come up with this idea?

[00:02:02] CL: The true story of how EleutherAI came about, during the pandemic, back in 2020, so everyone’s bored to tears, stuck at home. I was hanging out on an ML Discord server. It’s a chat server. There’s some paper that got published talking about GPT-3 models, big model training or whatever. I basically said like, “Hey, guys. Wouldn’t it be fun to do this?” Then someone else replied, “This, but unironically,” and the rest was history.

It very much started as just a fun hobby project of just some bored hackers. They were just hanging around, looking for something fun to do. At the time, the GPT-3 model was becoming well known. The paper was actually published a bit earlier. Now, the API was now becoming accessible to some people, so people were noticing this is really cool. We can always create things with it. It was very interesting with GPT-3 as a very specific AI model. it was just unprecedentally large. It had this huge supercomputer to build this model. That was a very interesting technical challenge.

It was also very interesting – the model, the final model, GPT-3 was really interesting for a lot of reasons. Basically, it started as more of a joke. Just like, we’re bored, let’s just mess around. We don’t have big supercomputers, so we didn’t expect to get very far. Yeah, things went a lot further than expected. We started to get more and more interest and awesome resources. We started to gather some models and stuff. Frequently, we started taking this more seriously. We thought, thinking more seriously about, what do we actually want to do? This is actually a good thing to do, etc., etc.

[00:03:40] SM: You basically put together a band of hackers and programmers, researchers from different places all around the idea of creating an alternative to OpenAI’s models?

[00:03:51] CL: That’s not how I would describe it. No, no. The name is very much tongue-in-cheek. Of course, we were all sad that we didn’t have access to GPT-3. Because GPT is cool. OpenAI is a big for-profit company with billions of dollars research and money and computers, and whatever. The goal of EleutherAI very much was always to be a group of independent researchers doing interesting work and, hopefully, useful work for the world. In particular, one of the reasons we thought this work was very promising is me and many other people at EleutherAI think that artificial intelligence is the most important technology of our time and, as it becomes more and more powerful and it can do more and more tasks, it will become a more and more powerful and dominant force in our society. It is very important to understand this technology.

That’s one of the reasons I now have a startup, where we work on researching safety of AI systems, how to make them more reliable, how to make them safer, how to make them not do things we don’t want them to do, which is a big problem with AI and only it will become more of a big problem. With EleutherAI, we saw, basically, was an arbitrage opportunity. We saw that there is a lot of cool research to be done with large models, and also, important research trying to understand these models, how do they work internally? How do they fail? And so on. There’s a lot of these opportunities.

Building an actual model like this is extremely expensive and technically difficult. You need very specific kinds of engineering skillsets. It’s very, very expensive. But, once you have built such a model, using it for experience is much, much cheaper, like magnitudes of order. We saw this opportunity that we could pay this one-time cost, in order to make this technology more accessible for academic researchers, safety researchers, people with less resources, that might be able to do valuable research with this kind of artifact.

[00:05:51] SM: Help me understand a little bit better what’s going on. There’s always this myth that only the very large corporations, or research institutes like, I don’t know, NASA-CERN can have the processing power and the money and the data and the knowledge to train these large models. You started from a large model, right? How did you get the first model built? Then, how do you progress?

[00:06:17] CL: It’s three main things that go into building a large model, which is data, engineering, and compute. Depending on what model you’re building, data may or may not be a bottleneck. For the kinds of models we were building, these are language models. Data really is much of a bottleneck. It’s still a pain to get the data together and whatnot and this is why we have pulled together. Our data center’s got compiled, which, and we also released. It’s not really a bottleneck.

The engineering can be a bottleneck in the sense that it’s not trivial, especially back then. Nowadays, there’s more open-source libraries and stuff that make this training of large models easier.

[00:06:53] SM: Wait. When you say back then –

[00:06:54] CL: Two years ago.

[00:06:55] SM: Two years ago. Okay, we’re not talking 30 years ago.

[00:06:58] CL: No. Like two years ago, this was very difficult. Even one year ago, this was still more difficult than it is today.

[00:07:04] SM: What changed?

[00:07:05] CL: Companies like Nvidia and Microsoft released a lot of the code, with libraries such as DeepSpeed and Megatron that make this stuff easier. It’s still not easy. Also, Facebook released the FSDP library and Fairseq, which helps training large models. It’s still not at all easy. The engineering is less hard than it was at the time. At the time, there was a few dozen people in the world who really knew how to make these and they existed only in these large corporations. I think, that’s still the case, that there’s maybe a few 100 people who really have experience building large models, hands-on and have a – a lot of ML, it’s like alchemy. It’s like dark magic. You have to know all the secret tricks to make things work. It’s getting better, but it’s still quite tricky.

The third component that goes into these kinds of models, compute is actually the biggest bottleneck. The amount of computation that goes into building something like GPT-3 is massive. It’s not like you can just run this on your CPUs. You need massive clusters of GPUs, all interconnected with high-end supercomputing grade hardware. You can’t do this on standard hardware. You need the supercomputer grade stuff, which is very expensive and quite tricky to use sometimes.

With EleutherAI, we had moved several, I’d say, phases. The first phase, we got our compute from what is called the TPU research cloud, which is a project from Google to give academic access to some of the TPU chips, which are specific chips for training ML models. They were quite generous with us with giving us access to pretty large amounts of these chips for doing our research. Our first models that we released, GPT Neo models were trained on this, including also, GPT-J, which was a later model that was also done by [inaudible 00:08:50] who was a AI contributor at the time.

We then, later, started working with a cloud company named CoreWeave, who are specialized GPU provider. We basically had a deal that we will help them test their hardware, debug things and stuff. In return, they’ll let us train our models on some of the hardware they were building. That resulted in the GPT Neo X model, which is the largest model we’ve released at this time. We have some potential new partnerships going on in the background right now. We’ll see if anything comes of that or not.

[00:09:26] SM: If I understand correctly, you’re saying that the engineering pieces are becoming more simpler, commoditized, almost, because of the releases of the big companies. They’re releasing code. Data, you mentioned, it’s not that much of an issue. It’s hard, but we’re talking text for these models. Yeah, we don’t get into the multiple petabytes of storage necessary. Then the third is seeing the hardware pieces. From the text perspective, the data that goes into the model, the training, how do you acquire it? What kind of volumes we’re talking about? Where do you get the data to start from?

[00:10:05] CL: For text data in particular, I’m not too familiar with other modalities. You need truly stupendous amounts of text. Rule of thumb is, you want a terabyte of raw text, which is a truly, unimaginably large amount of text. That is billions and billions –

[00:10:25] SM: Compressed? Compressed or uncompressed?

[00:10:28] CL: This is uncompressed. This is uncompressed. It’s a terabyte of uncompressed text is what you want to aim for or something. I think, the pile is uncompressed about 800 gigabytes of text, which is enough.

[00:10:39] SM: The pile is the starting point, the data, the raw data.

[00:10:43] CL: That is the dataset that we build for training our models, and we released. If you need to get 800 gigabytes of text of various sorts, that’s a place you can get it quite easily. The way we build the pile was a lot of it comes from Common Crawl, which is a huge just dump of Internet sites. I forgot who made it, but it’s massive petabytes of scraped websites and stuff that we then post-process, which we filter out spam, and then filter out the text from the HTML, and stuff like that. Then, the other part is a massive amount of curated data sets. We took lots of datasets that existed.

For example, we took data sets from Payton’s, or with the – or from various chat rooms, or whatever. I don’t remember everything’s in there. There’s a lot of medical texts, just like all PubMed. There’s a large amount of publicly available scientific documents, papers in biomedicine. Also, we took from arXiv, which is this pre-publication server, which has a huge amount of physics, math, computer science papers. We scraped all of that. Furlough that into text. The pile compared to other datasets is weighted more heavily towards scientific, technical data, less on the social media chats. There’s some of that too, but it’s much less. It’s not focused on that. A lot of it is very technical documents and such.

[00:12:06] SM: Right. You didn’t take Wikipedia, or material books from –

[00:12:09] CL: Oh, no. Wikipedia is in there, too. Yeah, there’s all kinds of stuff. You can read the paper, which, I think 40 datasets, 20, or 40 datasets are in there from all kinds of sources. Wikipedia, I think in total is a few gigabytes, maybe, of text, maybe four or something. You need a lot.

[00:12:28] SM: Right. You store it somewhere on the cloud. That’s not a big deal for now.

[00:12:32] CL: Funny story about that. It is currently hosted by BI, which is a pirate data hosting service. There’s literally a guy called The Archivist. I don’t know what his real name is. He’s completely anonymous. I think he might be an international fugitive. I don’t know. He’s just like, whenever we need to host something, we just tell him and he’s like, “Yeah, no problem.” Just hosted for us. That’s how we host our datasets, at least in our models, because he just has infinite storage. It’s a fun hacker story.

[00:13:01] SM: Sort of underlying the nature of the group. You have now, the pile, you have the trained model, you have the hardware, of course, you’ve got the competence to do all of this, and you have created a bunch of models that are somewhat replicating, or playfields and alternatives in some extent to OpenAI. You’re doing a completely different approach compared to them, though. They’re not releasing their models. They are keeping them behind an API for safety reasons. Or at least, that’s the story. Why are you releasing it? Are you not afraid?

[00:13:36] CL: That is a very good question, and the answer is, of course. Of course, you should be concerned when you build a new technology that has unprecedented capabilities and you use it, either as an API or you deploy it or you make it public, all of these things are things that should be considered. There is this meme that sometimes exists inside the scientific community that, as a scientist, you’re like, you have no obligations to the downstream effect of your work.

I think, that’s obviously bullshit. In my heart, I’m just like, I just want to science all the time, just build all the things and who cares? It’s fine. Let the politicians sort out how to use it. That’s just not how the world works. It’s not a good way to think about this. There’s sometimes this belief that people think EleutherAI’s stance is all things should be public, all the time, always. That is not our stance, and it’s never been our stance. I understand why people are confused about this. There are also several other groups that are vaguely associated with us, that is their opinion. I strongly disagree with them.

It’s always been the case with EleutherAI that we think there are some specific things in this specific instance, which we think it is net positive for these specific reasons to be released. I think in this specific instance of these specific things going on right now, it is more net positive for these models, various language models of various sizes, to be accessible for researchers to do certain types of research, than it would be not to be. We said from the very beginning, if I, for example, had access to a quadrillion parameter model, or something that’s completely unprecedented, we would not release it. Because who knows what that thing can do?

It does not seem a good idea to just dump something that no one knows about. There’s a very specific argument that we believe 99% of the damage done by GPT-3 was done the moment the paper was published. As the saying goes, the only secret about the atomic bomb was that it was possible. Then people are like, “Well, what if Russian disinformation agents use it?” I’m like, the paper is out there. If a few hackers in a cave can build these kinds of models, you think the Russian government can’t? Of course they can. Of course, they can just buy a supercomputer and train this stuff.

I think there are downstream effects of EleutherAI that may not have been a good idea. The way I see things and people disagree with me about this, is I think, very powerful AI is coming very, very soon. Human level AI is coming quite soon. I expect it, for various technical reasons, to have many properties in common with the models we’re seeing today. I don’t expect a full shift. I very much disagree with people who say like, “Oh, we’ve made no progress towards human level AI. These things aren’t intelligent. They won’t make any progress.” I fully disagree. I think these people are not paying attention, or are confused about what these things are actually capable of.

I expect that – studying these technologies that currently exist is very, very important. This is arbitrage opportunity. The models we released are much smaller, and much less capable than GPT-3. GPT-3 and many other groups have also had models of similar capacity internally and such. Now, there’s open-source models of GPT-3 size anyways, like the OPT models and the Blue model. It’s always been a very contingent truth is that we’re like, okay, releasing models like this will have some unknown consequences. People might use them for spam. People might use them for something I hadn’t even thought about before. Maybe someone will come up with some new use of this model that I had never thought about that was actually bad. Maybe they’ll come up with very positive uses. I don’t know.

I think, reasoning about how new technologies will affect the world is very hard ahead of time. I think there’s two conflicting parts inside for me. The one part is that, historically speaking, generally, every new technology, people are afraid of but then, when it’s actually deployed, it’s actually good. It’s in retrospect, really, I’m glad this technology was – Imagine if people tried to make electricity illegal, because, well, people could shock themselves, so we should have a license to have electricity in your home, or something like that. Obviously, that would have sucked.

[00:17:30] SM: There was that debate.

[00:17:30] CL: That debate did exist. I think we as modern people are quite happy that the optimists won that one. That’s a fully legitimate argument. That’s not a silly argument to be made. I think that’s a good argument to be made. There’s also the other one, which is the – there’s the argument to be made, hey, there’s some very specific risks we can see that are not hypothetical. Now, we can debate about how do these risks measure up against each other? For this very specific technology, right now, of these language models, I think the optimist’s side, in my opinion, has somewhat of an upper hand. That doesn’t mean that this applies to all technology.

Some people would say, “Well, okay, so you like electricity, Connor? Well, what about nukes? Those use electricity, right? You okay with those?” I’m like, whoa, whoa, whoa, whoa. Slow down. Yes, those do use electricity, but that’s a whole different class of thing. That’s what I mean when I say, it was never my or EleutherAI’s stance that everything always should be released all the time. Because who knows? Maybe tomorrow, OpenAI creates some model that has some crazy capability or some really dangerous capability that is super scary, and they shouldn’t release that.

Basically, at some point, someone will create an AI system that is truly dangerous, that’s actually dangerous, Not just spam or something, but is truly dangerous. I don’t know what that system is going to look like. I don’t know who’s going to make it, but it’s going to exist. I think, it would be great if it’s not impossible for them to not release that. I think, it would be great if we can accept that maybe some things we should be careful about. Whether or not it applies to this specific situation that we have in front of us today.

[00:19:14] SM: It’s probably not easy to imagine, but it could be something that slips out of a lab, the same way that the first internet worms were not contained by mistake and ended up creating new carriers for infections inside computers. When we chatted another time, you mentioned scary scenarios of AI is unleashed with a pile of money attached to them to do maximization of shareholder’s value.

[00:19:41] CL: Yeah. That’s one of the scenarios, for example, I take relatively seriously. What does AI do? Generally, the way, for example, we have game playing AI, right? Usually, this is what’s called reinforcement learning. The way this usually works is we have some functions that can score, and you can train the AI to do whatever actions maximizes the score. We get a high score in [inaudible 0:20:02] or whatever. If we just straightforwardly extrapolate this, where we’re going right now. Look at the AI technologies today versus two or three years ago. Nowadays, we have AI that can just type in a sentence and they’ll generate a full photorealistic image of anything you can imagine.

You have these GPT systems that can write full stories, or chat with you like a person, or you can have – or like the Minerva system recently, which can solve incredibly difficult math problems. It’s as good as a human, or even better. This incredible sophistication that did not exist two or three years ago. Now, let’s say that just continues. Let’s just take a naive attempt. We’ll say, okay, things went this fast the last two years. Let’s just imagine the next two years go just as fast, or faster, and then the next two years after that, and the next two years after that, and the next two years after that. Something has to give.

Either progress is going to slow down for some reason, or we’re going to see some crazy systems really, really soon. Systems that can optimize for very complex goals, that we can have assistance that we tell them to do a thing, and then they can log on to the Internet and just do those tasks. Now, we imagine we have these systems, they’re more and more powerful. Now, say we have some really big corporation, Google, or OpenAI, or whatever. The biggest system of these kinds ever, something that’s so powerful, it’s smarter than humans. It runs a million times faster. It’s read all books in history. It can do perfect IMO gold medal in mathematics, etc., etc. Then you give it some goal like, okay, make maximum profit. What will such a system do?

I think, if you meditate on that question a bit, the obvious things are not always good, or even mostly not good. I mean, if you’re trying to maximize share price of your company, well, why not just hack the stock exchange? Why not just put a gun to the stock exchange CEO’s head and say, “Increase my price right now.” Why not do all kinds of crazy things, blackmail people, or manipulate people, or that create a huge propaganda campaign?

[00:22:05] SM: Right. For humans, we have set norms and laws to prevent that.

[00:22:09] CL: Those also don’t always work. We have corporations doing illegal and bad things all the time. Well, now, let’s imagine we have a corporation that’s also a 1,000 times smarter than any other human. It’s better at hacking. It’s better at coming up with plans. It’s better at propaganda. It can generate images and videos and the voices, can impersonate anyone. These are all things that AIs already do. None of this is really – except the planning. None of this is really science fiction. We can already imitate voices. We can already generate arbitrary images.

There’s already hacking tools that use AI. There’s already math solving AI. All this stuff is already real. Now, we just have this put the parts together in our head and extrapolate. Then, it’s pretty clear that this could pretty easily lead to some pretty scary scenarios really quickly.

[00:22:55] SM: Absolutely. Yes. Without going even, as you were saying, too far out in the future, there are already cases where we can’t really distinguish the actions of a human versus the ones of an AI.

[MESSAGE]

[00:23:11] SM: Deep Dive is supported by our sponsor, DataStax. DataStax is the real-time data company. With DataStax, any enterprise can mobilize real-time data and quickly build the smart, highly scalable applications required to become a data-driven business and unlock the full potential of AI. With AstraDB and Astra streaming, DataStax uniquely delivers the power of Apache Cassandra, the world’s most scalable database, with the advanced Apache pulsar streaming technology in an open data stack available on any cloud.

DataStax lives the open-source cycle of innovation every day, in an emerging AI everywhere future. Learn more at datastax.com.

[00:23:51] ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast

[INTERVIEW CONTINUED]

[00:23:56] SM: Now, I wanted to go back a little bit to the power of AI and the risks that it poses and talk about the mitigations. What you think we should be doing as a society to make sure that these systems don’t spin out of control, but don’t stop the progress. We can’t say, don’t do AI anymore.

[00:24:14] CL: Telling people not to do AI is hopeless. People can’t coordinate our own stuff like that. It’s way too profitable. There’s this archetype of the scientist where a problem, even if it’s bad, it’s just too sweet and has to be solved. There’s quite a number of people in the AI community, who themselves have admitted at several points in time, where they’re like, “Yeah, this might be dangerous, but I can’t help myself. It’s just too cool. I have to do it.” John von Neumann is quite known for having said stuff like that. Several modern AI people I’m not going to name have said similar things in public in the past.

Obviously, just shutting down AI or something, it’s neither feasible nor desirable. AI is also the most powerful technology of our time to improve our lives, to allow us to address tons of problems that we are currently facing. I think a massive amount, maybe even the majority of problems in our society, the bottlenecks to solving them is more intelligence. If we could just solve science faster, if we could just develop cures faster, if we could just do all these things faster and more efficiently, we could improve society immeasurably.

Imagine if our scientists just worked a 100 times faster. That would be insane. We would live in just in such an incredible world. That would make the world so much better, more than almost anything else in the history of mankind, if we could do this. Clearly, as tempting as that is, it is a double-edged sword. AI is a tool. It is not good nor evil. It’s just a tool. It’s just a technology. It is a system that can be both good and bad. I think abuse is definitely a possibility, but I’m far more concerned about the accident-type scenario.

We have totally well-intentioned people trying to build, like an AI scientist or something. They just aren’t careful, or just not aware that this could go wrong in some way, and accidentally build a system that does something totally different. Before they notice that something’s wrong, it’s already escaped onto the internet.

[00:26:08] SM: We have Skynet. No, that’s a movie.

[00:26:11] CL: The government is already – and militaries are already eyeing AI everywhere. Think of this as, first, there is a technical solvable problem, which is something called the alignment problem, which is the problem of how do we get an AI system to actually do what we want? This sounds trivial, but it’s actually really hard. Because what a human actually wants isn’t usually what he says, when a human says something.

There’s all these hidden assumptions. If I tell my robot to go get me coffee, the only thing the robot wants to do is to get coffee, hypothetically. It wants to go and get the coffee as quickly as possible, so it’ll run through the wall, run over my cat, throw grandma out of the way to get to the coffee machine as fast as possible. Then if I run up, “No, no, no. Bad robot,” and I try to shut it off, what will happen? The robot will stop me from hitting the off button. Not because it’s conscious or it has a will to live. No, it will simply be because the robot wants to get the coffee. If it’s shut off, it can’t get me coffee. It will resist. It will actively fight me to get the coffee, which is of course, silly. Of course, it’s silly.

You can imagine how systems that are deployed in the wild could get this property of, well, if it’s maximizing profit, well, shutting it down will not maximize profits. It better have a few backup copies running in the cloud, so it can’t be shut off. Then you have all these kinds of scary scenarios. What I personally focus on, it’s also what I do at Conjecture, is focusing on this technical problem. Okay, how could I even build a system that understands, do not do those things, that lets itself be shut off? That understands that, when I say, “Get the coffee,” I also mean, “Don’t run over grandma,” that understands that these are what I mean by that. It doesn’t do crazy, insane things whenever I ask for normal things. This is a really hard technical problem. Really hard. It sounds easy, but the deeper you go into it, the more you’re like, “Oh, shit. This is genuinely difficult and confusing as hell.” Because humans are confusing, right?

[00:28:28] SM: Right. Yeah, yeah.

[00:28:29] CL: We want all kinds of weird things. Humans are confusing, and the world is confusing. Yeah, things are complicated. The one thing I would want is just have a few more smart people working on this problem. I’m not even saying everyone should work on this. I’m not even saying everyone should drop everything to work on this. A few top professors could consider working on this problem. It’s a pretty cool problem. It’s an important problem. It’s clearly something that top AI professors would be perfectly suited to work on, and somehow very few are working on this. There’s a very small number of people working on this problem.

[00:29:06] SM: If I understand correctly, you’re thinking of the Asimov’s law of robotics embedded inside AI?

[00:29:14] CL: Unfortunately, the law, the three laws of robotics are selected for making interesting stories, not for actually working.

[00:29:20] SM: Absolutely.

[00:29:22] CL: Obviously, that was [inaudible 00:29:22]. Something like that that does work would be great.

[00:29:27] SM: Something that works that can prevent, if I may try to summarize to see if I understand correctly, you’re basically thinking at solving a problem of embedding some safeguards inside the code itself, inside the machines themselves, so that we can predict and we can expect that, if the lever of shutdown is pulled, it actually shuts down.

[00:29:49] CL: That’s what’s known as the stop button problem. If someone would find a solution to stop button problem, I would be over the moon. I would be so happy, because it’s actually genuinely very hard. It’s very, very hard to make a robot that is truly indifferent to being shut off. Because usually, what happens is either they try to resist being shut down, or they become suicidal and instantly shut themselves off. It’s very hard. No one knows how to do this. No one knows how to build a robot that doesn’t care, that will let you shut it down, that will not resist you, but also, would shut itself down. No one knows how to do this currently, mathematically.

I don’t think that’s the whole problem. I think there’s more problems, like inferring human preferences, like all these unspoken things. Having the conservatism, avoiding robots doing some crazy things, whatever. To be clear, I say robots, I don’t actually expect it to be robots. Like, artificial systems that are GPT-3 programs. Just robots is more evocative. Yeah, I think there’s a bunch of problems here that we just really don’t have answers to, but it seems like we should be working on.

The stop button problem is a pretty clear problem that just – more people should be trying to solve this. I think, in the whole world, there’s maybe 200 people working on this problem in total, as far as I’m aware, which seems like, there should be a few more people take this problem seriously. If they find out there’s a simple solution to it, great, awesome. Then I’m the happiest man alive. Let’s go. Currently, the way things look, there’s a lot of these problems that we don’t have answers for, and that’s kind of scary.

[00:31:19] SM: Yeah, it’s interesting, it’s probably not as sexy as others.

[00:31:21] CL: Yeah. It’s much funner to build the bigger system that solves all the problems, and it’s faster than all the other ones, and you make a lot of money off of and you raise a lot of VC money. Of course, it’s more fun to build bigger and bigger and bigger things. I totally get that. I am guilty of this myself in the past. Ultimately, if the thing doesn’t do what you want it to do, that’s going to be a problem.

[00:31:46] SM: Sounds a little bit like the same problem that computer software has with security. It’s always an afterthought, because it’s net cost rather than something that’s immediately perceived as bringing value.

[00:31:56] CL: Yeah. It’s funny, you bring that up. If I had one message to the wider world who may or may not listen to this, I think one of the biggest things, like one group of people that I wish would work on this problem, and as far as I could tell, aren’t going to work this problem, is security hackers. People working computer security applying their minds to AI safety is a clear fit. It’s a classic security problem. How do we get these systems to behave the way we want them to and not the way we don’t want them to? It’s a classic security problem. It’s a very novel, hard problem. You have to solve all these kinds of new challenges here. It seems like a perfect fit for the computer security community. I would love to see more people from the computer security world trying to tackle this problem.

[00:32:38] SM: How, for example, hackers should fix a bug inside a model?

[00:32:42] CL: Well, currently, we don’t know. Someone should try.

[00:32:45] SM: Okay.

[00:32:46] CL: We’re at the point where we have these super complicated systems, GPT, used in the wild and whatever, we just have no idea what’s going on inside of them. We have some ideas. It’s not like we can look at the code. That doesn’t tell us anything. Not really. There’s all these weird things happening internally. What is the computation internally doing? It’s not possible currently that we say, “Oh, we see a failure case in our model.” And we’re like, “Oh, that’s not good.” Then we’d go into the model and fix it. We can’t currently do this.

I don’t think this is a fundamental problem. I think if we develop the tools and the technologies, this is a thing we could learn how to do. There’s already some very early work in this direction. David Bau’s lab at MIT, for example, has published a paper not too long ago, where they managed to edit memories of language models. They, for example, made a GPT model believe that the Eiffel Tower is in Rome, instead of Paris, which is incredibly cool. That’s incredibly cool.

This is obviously how we should develop tools. Tools like this, where we can look at the memories, or edit them, and we can see how the internals of these models work. That’s a lot of what we do with Conjecture. We work on interpretability research. We try to take the inner parts of these networks, decompose them into understandable bits, and then see how can we see where failure modes come from? How can we edit these things? How can we manipulate them? How can we test them for safety features and so on?

This is a very early work. If you’re a young career researcher looking for some low-hanging fruit that haven’t yet been plucked, there is just an orchard. There’s a massive orchard of low-hanging fruit in interpretability and AI safety. I have such a huge list of projects that I wish we could do. I just don’t have enough time and I don’t have enough engineers to do. I think, it’s incredibly promising.

[00:34:32] SM: This is pretty awesome because you’re basically leaving us with a positive note by saying that one of the concerns that other speakers and people I’ve talked to have highlighted is how incredibly opaque these systems are. Once you build the model, you have a hard time, unless you retrain, which could be expensive. I don’t know if your output from the GPT-3 like is too sarcastic, or abuses of commas and does not know how to use punctuation correctly. How do you fix it without having to retrain the whole thing? Which is, as you were saying, it’s expensive. You’re basically saying that there are ways; there’s research going into the direction of looking into these artificial synapses, connections and tweak them in a way that we can predict or fix.

[00:35:23] CL: Yeah. There’s this meme that’s been around in the AI community for quite a while that neural networks are complete black boxes. It’s impossible to understand what’s going on. That is just false. That is just not true. I have overwhelming evidence that is just completely false. There is so much structure inside of neural networks. There are so many things you can understand inside of them. It doesn’t mean it’s easy. This is a very nascent level of research. I was skeptical about this too, two or three years ago. Now that I’ve actually worked on the problem for a while, and I’ve seen other people, I’m like, wow. Every time we put the effort into it and try to take apart and look at the different parts, there’s so many low-hanging fruit. There’s so much to be found. There’s so few people working on this problem.

There’s really just a handful of groups in the whole world really saying like, “Nope. We’re just going to try. We’re going to try to take these things apart.” I expect, over the next couple years, including some of the work, hopefully, from Conjecture, will show the computational primitives and the internal structure of these things that will allow us to look much more selectively understand what is going inside them. Can we edit these things as such? Will it be perfect? No, probably not.

[00:36:31] SM: Probably not.

[00:36:31] CL: I think there’s a massive amount of promise here that we’re just starting to unearth. I think, there’s real reason for optimism there. Will this solve the whole safety problem? Of course not.

[00:36:44] SM: Right. It’s a step.

[00:36:45] CL: It’s a really promising step forward. It’s like, we at Conjecture work on this problem quite a lot. We’re hoping to be actually publishing one of our results pretty soon, which I’m pretty excited about. We just really try to look at it and we found all this structure and all these pieces that you can understand and you can take apart. There are some groups out there that really are taking it seriously. I’m very optimistic that we will understand neural networks as white boxes, or way more, very soon.

[00:37:10] SM: That’s great to hear. What kind of resources do you need in order to understand that problem? We were saying that to train, you need a lot of data, engineering capacity, and compute to investigate inside the neural network. What do you need?

[00:37:26] CL: Creativity. You need to be creative because it’s a new field of research. Every time we come into a new field of research, you need to be creative. You need to come up with new ways of thinking about a problem. Luckily, you need much less resources than to train these models, because you use pre-trained models. You will need some GPU or something to do some of the research. That’s unfortunately, just the nature of ML research.

There’s a ton of research you can do by using EleutherAI models, for example, so you don’t have to retrain them from scratch. You can just use the feeder, and you can study the internal parts and do lots of interesting operations. The one thing you will definitely need is you need to actually learn the math. You need to actually know linear algebra, and you have to actually look at the internals of the model.

In the ML, we’ve gotten lazy. We’ve gotten lazy. We let the PyTorch handle all the linear algebra and stuff, and then we lose a real deep tacit understanding what’s going inside the model. Actually looking inside the models, what did these numbers mean? How do they combine? What’s the linear algebra here? This is not grad school level math. This is all undergrad level math, but really understanding what is going on inside the network and what is actually going on inside. Being comfortable with undergrad level, linear algebra and stuff like that is, I think, just incredibly undervalued. If you’re a postdoc, massive IQ, statistician, or algebraist, why not just take a shot at neural networks and see what the structure is inside of them? If you’re a formal systems PhD, and you have all this knowledge about formal languages and computability and whatever, why don’t you take a look at a GPT network and see what’s a transformer’s encode internally? Where the complexity properties? All these things I expect to lead to very interesting research.

[00:39:01] SF: Wonderful. This is a call for the next generation of geeks out there, math and linear algebra is what will prevent us from getting into Skynet situation. Alright. Connor, thank you.

[00:39:15] CL: My one shout out is: try your shot at linear algebra, interpretability, understanding neural networks. Shout out to the computer security world out there. Your skills are more required than ever, and I think this is going to be an extremely valuable field. Conjecture is not currently hiring. Hopefully, in the near future, we will be hiring. If you’re someone who’s very interested in interpretability, safety and/or an experienced computer security expert, you would be someone we might be want to talk to. Please feel free to reach out to us, conjecture.dev.

[END OF INTERVIEW]

[00:39:49] SF: Thanks for listening. Thanks to our sponsor, Google. Remember to subscribe on your podcast player for more episodes. Please review and share. It helps more people find us. Visit deepdive.opensource.org, where you’ll find more episodes, learn about these issues, and you can donate to become a member. Members are the only reason we can do this work. If you have any feedback on this episode, or on Deep Dive AI in general, please email contact@opensource.org.

This podcast was produced by the Open Source Initiative, with help from Nicole Martinelli. Music by Jason Shaw of audionautix.com, under a Creative Commons Attribution 4.0 International license. Links in the episode notes.

[00:40:31] ANNOUNCER: The views expressed in this podcast are the personal views of the speakers and are not the views of their employers, the organizations they are affiliated with, their clients, or their customers. The information provided is not legal advice. No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[END]

Episode 2: transcript

Ariel Jolo — Tue, 23 Aug 2022 00:00:00 +0000

“AT: We know that a lot of the technological stack of AI systems is open. It’s based — funded on open code. That doesn’t solve any of the problems of the black boxes we discussed of possible harms. I think we need to take the spirit of open source, of openness, but really look for some new solutions.”

[INTRODUCTION]

[00:00:22] SM: Welcome to Deep Dive: AI, podcast from the Open Source Initiative. We’ll be exploring how artificial intelligence impacts free and open-source software, from developers to businesses, to the rest of us.

[SPONSOR MESSAGE]

[00:00:37] SM: Deep Dive: AI is supported by our sponsor, GitHub. Open-source AI frameworks and models will drive transformational impact into the next era of software; evolving every industry, democratizing knowledge, and lowering barriers to becoming a developer.

As this evolution continues, GitHub is excited to engage and support OSI’s deep dive into AI and open source and welcomes everyone to contribute to the conversation.

No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[INTERVIEW]

[00:01:08] SM: Welcome, everyone. Today we are meeting with Alek Tarkowski, Director of Strategy at the Open Future Foundation, a European think tank for the open movement. He is a sociologist an activist and a strategist, active for a long time in social movements. He is also on the board at Creative Commons. Welcome, Alek. Thank you for giving time to us.

[00:01:30] AT: Hello, Stefano. Thank you for the invitation.

[00:01:33] SM: Let’s start talking about the artificial intelligence, how it’s affecting common life. How are these applications that you see being deployed into society? How is that affecting real-life people?

[00:01:46] AT: We live certainly at an interesting time, a time of technological change, which probably has been the case for as long as I’m an adult, for 30 years, but there is a sense of there’s something new, right, that these so-called AI technologies are really different from the previous waves of internet technologies.

Now, I think the interesting thing is that you said we can see them. Actually, I think the trick is that in many cases, we do not see them. There’s a lot of also confusion, what is in AI technology, and what — and there’s some also confusion between what is already happening and what we’re expecting to happen, which I think is typical of these emergent technologies. They function somewhere between fiction, prototype, deployment, and mainstream, right? That’s the curve they’re on.

[00:02:32] SM: Some magic here and there.

[00:02:33] AT: There’s a lot of magic and a lot of people like to sprinkle a lot more magic than there is. I also like a term that’s sometimes used alternatively to artificial intelligence, which is automated decision-making. It sounds a bit technical, it of course, means something different. I think it’s a nice term to deploy. Basically, automated decision-making says there are situations where humans no longer decide about you, that there are some systems that do that. In this situation, let’s say were traditionally some bureaucrat in the city would decide maybe which school your child will go to or whether you are eligible for some social support. More and more often, this will be done by automated systems, of which artificial intelligence systems are one specific category.

[00:03:17] SM: Yeah. It’s really fantastic how that term conveys very clearly what are we talking about. We’re talking about AI system that is making decisions for you. It’s really clear, rather than the magic.

[00:03:29] AT: Yeah. I like it, because it also connects this conversation to the futuristic conversation or the on AI with issues were already aware of. There are many situations where you don’t need advanced technologies, but still decisions are made, not by humans. There are some cases in Poland where for instance, decisions about providing unemployment benefits was made by what was dubbed the automated system. There was even a big investigation made by the Panopticon Foundation. They discovered that the system is actually an algorithm that can be described in a spreadsheet.

It was a really super simple system. It gathered input in the shape of 10 questions. The public official asks the person applying for the benefits and turned out some very basic functions to say yes or no. The question was, so is this okay, because there wasn’t really any AI hiding inside? There was suspicion that there is, but it wasn’t confirmed. I think the answer was, yes, it’s still an issue. More importantly, more and more often, these systems are actually AI-powered, right? There is some component of machine learning happening inside. We should probably expect that in the coming years, there’ll be more and more such systems.

[00:04:40] SM: Yeah. You introduce a very important piece there where you discover that the system in Poland for unemployment actually had a very simple algorithm that could inspect and was easy to understand and maybe even fix to spot unfairness or mistakes with proper AI systems that are really not available. It’s one of those obscurities that inside neural networks especially, are hard to diagnose or hard to investigate in order to predict their outcomes, too. In this case, it’s we have real life being impacted by automated decision-making. How are regulators approaching this issue? Do they know? Are they noticing?

[00:05:23] AT: At least in Europe, they do. I think it’s a good time to do it, because I believe that these AI systems are not yet deployed on mass. You make a valid point. That’s the big challenge with them. The issues are the same as with all the automation, but the ways of addressing them are a lot more harder, because there’s this, I think, beautiful symbol of the black box, right? That hides things inside. That’s exactly the case with their system. Their complexity basically makes them much harder to analyze, assess their impact, and so on.

I participated in a study done by AlgorithmWatch, a German Foundation, called Automating Society, which looked at cases of the deployment of such AI systems in Europe. To be honest, there are too many, yet at least ones that are publicly known. Here, a caveat is probably needed. You don’t find them if you look, let’s say in cities, in small companies, in a national government, but obviously, then there are these huge platforms, which we know are extensively and more and more employing machine learning mechanisms, but in a way that is not really clear. We all use search engines that by now are so almost certainly AI-powered.

We use social networks that filter content, most probably with AI technologies. Again, the black box appears, there’s no certainty. It’s a bit of a weird moment, you can both say, “No, I don’t really see these technologies around me.” You can probably be just as correct in saying, “Hey, they’re everywhere. Any app you choose in your phone, they might be there.”

[00:06:56] SM: Right, exactly. I mean, from routing and orienting through the streets of an unknown city calling a cab. I’ve always wondered how much of my decisions to walk through a street or drive through a street were driven by advertisers at some point. The doubts remain because we really don’t know from the outside. There is no label applied to this application to say this is being influenced by this and that.

So you mentioned that the European Union seems to be extremely active on the AI front and they published the AI Act. Can you give us a little bit of an overview of what that is or what stage it is?

[00:07:34] AT: The AI Act proposal was published last year, as one in a series of several regulatory measures which is really a big European regulatory push on digital, right? We have the recently adopted Digital Markets Act and Digital Services Act, which regulate platforms. We have a whole package of data governance mechanism, which by the way, I think do connect with AI conversations. Then we have the AI Act. I mentioned this study Automating Society, which showed that there aren’t that many systems yet deployed, because I think it shows this is exactly the right moment to have a conversation about AI regulation.

I’m happy it’s happening right now because, as we know, policies are deployed more slowly than technology. You need to give them time. I think Europe by passing these laws is giving itself time. The question is, of course, what kind of regulation? Is it a good regulation? Let’s maybe briefly go over the document what it proposes basically, I think the key category, there is risk. Really Europe has been developing an approach they call Trustworthy AI. Then this approach, the biggest question for AI to be trustworthy is whether it is risky or not. The regulation doesn’t really cover all cases of AI. It really focuses on two issues. What kinds of AI uses or technologies are so dangerous that they should be outright banned? What technologies or context in which they are used are high risk? So are risky to an extent that you really need to regulate their use.

This is basically what this Act tries to do. In terms of the users that are banned. It’s a very short list. It includes subliminal distortion which is a bit almost science fiction sounding category, but really, and this is interesting with the regulator believes there are uses of AI that can subliminally affect humans and this should be banned. But then the more realistic ones are banned on social scoring and the mechanisms that usually are described as being deployed in China where as a citizen, you get scored on how good the person you are. These are meant to be banned in Europe.

Banned on real time biometric identification, so o basically technologies that take data from all the cameras deployed in urban spaces and that lets you identify people in the real-time. These are meant to be banned, although there are some carve-outs basically for public security. Then the last category is technologies that exploit vulnerabilities. So find some ways maybe to make elderly people do something against their wish, making use of the fact that they don’t understand technology. These really high-level risks. You can argue whether it’s a good list or not. I am happy that Europe is thinking about banning for instance, social scoring.

[00:10:29] SM: That’s very interesting. I wonder how it will be classified if you have a five stars as a customer.

[00:10:35] AT: Then you get discounted if that would fall into that category or if it’s just the scoring for the government that is prohibited. I understand that it’s mainly for the government and also this idea that you’re starting to combine sources. This is I think, the big risk that you have some five-star rating, how well you party in the club, which I could imagine happening. Then that data is shared with your employer and further on goes on to determine your, I don’t know, health portfolio, right? All these scenarios, I think, that the dangerous one, but really, when you say social scoring the problem is that then someone really gives you a score. There are some information that for instance, in China, they’re considering that the score could determine whether you get a passport for instance, right? Some basic rights are limited.

[00:11:23] SM: Yeah. I’ve seen similar experiments also starting in some cities in Italy, where they were scoring citizens about how well they were behaving in certain settings, so that they could get discounts on taxes like trash, disposal, and things like that. I don’t know how much of the AI was involved there, but in any case, very scary propositions.

[00:11:45] AT: I think the broader category is that of the high-risk situations in here, some things they’re considering our use of AI in the context of employment. All kinds of work-related HR decisions, uses in education and related to vocational issues. Scoring of students, how well they study, trying to determine who they will be uses in law enforcement, such as these ideas of predictive policy. There are some sorts of high-profile cases from the US where they attempted to use previous data on crimes and arrests to determine who will commit the crime again.

People seem to love this scenario because they come out of science fiction movies, but they really are rife with risks that affect basically basic rights of citizens, right? If a system deems you guilty, basically, before you commit a crime, I think that’s really serious. Another area that is high risk is migration. I live in Poland, where we just have the huge wave of refugees from Ukraine. I think this suddenly becomes very relevant. These are people, again, who are very vulnerable and their ideas how to deploy systems that can really be, basically inhuman. The last category is justice and democracy, again a very fundamental issue that you shouldn’t toy with democracy using AI technologies.

[00:13:07] SM: This is one part of the AI Act. It’s about the technologies that are too dangerous.

[00:13:12] AT: The whole question is, so what regulation do you introduce? The proposal comes with a list of some measures that includes basically if you want it to summarize it three things, impact assessments and monitoring of deployment of these systems, that they’re not left alone, that someone asks questions, what will happen if I introduce in a school system these technologies? Second thing is transparency. You mentioned labels. Am I aware that there’s an AI system there? Am I aware of its decisions? Can I be maybe told on what basis it made the decision?

The third category is human oversight. In what situations for instance, you might be able to ask, hello, I would this decision to be reviewed by a human, right? Instead of a machine. It of course, gets more complicated, but basically Europe is thinking that when a use of AI can be called high risk, these sorts of regulations start to apply. Of course, the huge debate that started immediately and has been running for the last year is whether these measures are sufficient. There’s of course, a group of people who say no, we need much harder protections of persons, basic rights. Of course, there’s another column that says these measures were curb innovation too much. There’s a now quite intense policy debate happening on this.

[00:14:32] SM: Who’s participating in these debates? What kind of people? What kind of groups do you feel like in Europe are influencing the conversation?

[00:14:40] AT: In a way it’s almost a cliché like in most policy debates. You have the industry and the activists. These are two really strongest forces and I on purpose say activist, because I think the challenge with this regulation, like with many digital policies that is very hard to involve everyday people. Europe actually did a very interesting project last year called conference on the future of Europe, where it really gave voice to citizens through different means. You could submit proposals online, which then all were taken into account in the so-called Citizen assembly.

These situations where they really selected random Europeans served the function a bit like members of the Parliament. Of course, they didn’t pass laws, but hey agreed on the set of recommendations that were sent to the European Commission that the Commission promised at least to look at them. What happened there, which I think is very telling is that people were given really a broad range of issues. You’d be happy to hear that there was a lot of proposals on open-source policies that was somehow very strong, but ultimately, the outcome showed that average European really gets the message about privacy. There was an ask for privacy to be respected.

Still the basic issue of providing access to internet and technology is for people important, but other than that, they didn’t mention any of the things are being regulated. They don’t mention platforms. They don’t mention AI. They don’t mention data. Why? I think it’s just too complex. This role is played by civil society activists, who are mainly digital rights activists. The policies they focus on are those that will protect basic rights, protect citizens. I think they are a very strong voice in the debate.

Obviously, industry is the other voice, which has a very, I would say, by now obvious line. Usually they say regulation is bad. I don’t think anyone in Europe now will say that no regulation of AI is good. I think the sorts of these extremely risky scenarios, there’s agreement, they should be banned. But then the industry is very quickly ready to say that some of the measures around transparency, around disclosure of how it works is just going to be challenging for innovative business in Europe.

[SPONSOR MESSAGE]

[00:16:54] SM: Deep Dive: AI is supported by our sponsor, DataStax. DataStax is the real-time data company. With DataStax, any enterprise can mobilize real-time data and quickly build the smart, highly-scalable applications required to become a data-driven business and unlock the full potential of AI. With AstraDB and Astra streaming, DataStax uniquely delivers the power of Apache Cassandra, the world’s most scalable database, with the advanced Apache pulsar streaming technology in an open data stack available on any cloud.

DataStax leaves the open-source cycle of innovation every day in an emerging AI everywhere future. Learn more at datastax.com.

No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[INTERVIEW CONTINUED]

[00:17:38] SM: It looks like Europe is really taking different standards from the United States where no such regulations seem to be on the horizon, at least not at this level. I’m glad to hear, because the Open Source Initiative with this research, what we’re trying to do is to try to understand what are the frameworks that the same way that we have done with the Open Source software. We have provided a way to identify what are the basic needs for developers and citizens to enjoy life in digital space. We’d like to have something similar at least for AI to say, “Look, we can innovate, we can do regulation, we should really pay attention to these specific acts.”

It seems like there are some interesting patterns that are already emerging from these early conversations we had, where we want to be able to inspect, for example, the models and understand what these AI systems are really suggesting. Why are they coming with some decisions? We’ll need to keep on having this conversation. It’s not going to be simple to solve. In fact at this point, it will be interesting to understand since these are new technologies and they’re being introduced now in the markets, and they’re being regulated. Can you make examples of past regulations that have impacted new technologies as they were coming in?

[00:18:56] AT: In Europe, of course, when you ask that everyone immediately thinks of GDPR, the regulation that provides data protection rules. I think it’s been adopted over five years ago, but it’s a very good time to see what happens such regulation. You have to be really humble about the change and causes, right? It’s not easy to implement. It requires a lot of effort for to bring to life, and everyone is willing to admit it, sometimes it backfires, right? Even a very simple thing, there’s this technical term that’s thrown around in European policy debates, which is harmonization of law.

Basically, what this term says is that we have in Europe almost 30 member states, each one with different law systems and sometimes the EU passes laws that are unified for the whole of you, but sometimes the way it works is that they pass a directive, which then gets rid of adjusted to the local context. Then you can get to a point where you’re thinking you have one rule, for instance, for giving consent or for regulating AI in education, but you also and find out that it actually works completely different than Italy, Poland than the Netherlands. You get into some huge mess, that’s a challenge for any company that tries to build Pan-European business. It’s a challenge for citizens to understand. It’s a chance for policymakers.

Then you have simple lessons learned like try to harmonize it. Then try to have unified rules. But with this AI regulation, I think what is really interesting and maybe this goes back a bit to your question about, who’s present in the debate? What I’m really interested in is these rules for high-risk situations. Our rules are not just meant to protect citizens. I think there’s rightly so a lot of debate that basically as the questions, how are we going to be safe? How will our rights be protected? I think there are more questions that need to be asked and which Open Source Communities or Open Content Communities are very good at asking which is, how will we make this technology productive in a way that is at the same time sensible, reasonable, and sustainable?

I think this is the space where this question should be asked when you think of things like impact assessment or transparency, or labeling, because I think you can use the same tools, interesting ideas. Let’s say, registers of AI systems, in one approach they are just meant to limit this technology, right? The technology that is seen as risky, maybe even dangerous, but from a different perspective, it simply creates a framework in which can ask questions, how can this technology be used well? Right? Because I think this is something that can be forgotten when we only talk about risks, that there are positive uses of these technologies, there’s a huge promise that you can find ways of using data to the public benefit, but it requires smart regulation.

[00:21:43] SM: Right. We’ve heard of AI systems that are now capable of folding proteins are predicting how proteins fold, so that these are already helping researchers in biotech industry to investigate some more promising parts. So the machine is not really telling how to solve the biology problem, but it’s giving parts that seems to be more promising. It has some risks, and in fact, but there is some very helpful and very interesting technology in there. What you were saying, as Open Source and Open Content communities, what we are very good at doing it’s been for many years, we’ve been capable of doing that. It’s to find good uses, and put the good technology into the hands of many developers, many content creators.

[00:22:31] AT: I think, the trick here is that the goals are the same, but probably we need to think about new tools, right? I come from a tradition of open content of Creative Commons, which borrows heavily on the Open Source philosophy and methods and basically deployed this most basic, but also extremely functional tool which is an open license, right? A licensing mechanism has really been solving a lot of issues around access, around sharing of content of intellectual property of code.

I think, this is why this debate about AI is interesting is that at some point, it shows the limits of just saying open it up, right? We know that a lot of the technological stack of AI systems is open, it’s based funded on open code. That doesn’t solve any of the problems of the black boxes we discuss of possible harms. I think we need to take the spirit of open source of openness, but really look for some new solutions.

[00:23:26] SM: Absolutely. It’s the very basic difference that I see is that code and data are very, very clear, very clearly defined and very clearly separated in the context of traditional software. But when it comes to AI, then data becomes an input to a model, the software becomes something that consumes that model. They get entangled in a new way that we need to explore and understand more. We need to help the regulators to understand that too, because it’s early. There are very few activists that probably understand exactly what’s going on inside the systems and the impact is also different.

[00:24:03] AT: Which brings us to the issue of capacity, which I think is important. These policy debates in Europe often, I think they focus so much on the law itself is that, they give a sense to the participants of the debates that really the issues get solved by law, by regulation alone. I don’t think it’s true, because one thing that law is very good at is, for instance, protecting and reducing harms. One thing that’s very hard to do with law is building capacity. You cannot pass a law that says let’s have greater capacity in the public sector to understand AI, to deploy AI, to cooperate on shared systems that are open source and have machine learning and sandwich, I think is the scenario we’d like to see.

That’s why there’s always the second side of policies, which is about funding policies, research policies, which is a completely different field, right? I think they should be seen together, because I think the challenge, I’m sure all of the world faces, but the Europe in particular is how to add these added capacities. We know that these systems are deployed a lot faster by commercial actors, regulations like the AI Act aim to, which I haven’t mentioned, but the interesting thing, they’re targeted at those who deploy, develop and deploy these technologies. Not at the users, but basically at the creators, the companies, for instance, from which public sector will most probably lease these technologies in some public-private partnerships.

This is one side of the equation. This idea that you need to look what business does with these technologies, but then we look, let’s say, the public sector. If you think about this list of high-risk areas, if you hear about things like education, vocational training, a lot of this is public, right? Sort of, if you try to picture a school system. I really like to think about education and think about the skills of people there, not just an individual school, but let’s say the city-level school system. Maybe even in the ministry, really, their capacity to deal with complex system is low. Unless we raise it, they will basically be dependent on vendors.

Of course, you can have a scenario and this is the AI Act scenario where you then regulate the vendors to make sure they do good, but I’m really also interested in scenarios where we think about how do we build the capacity, right? How do these Open Source systems can be deployed in communities of in the case of education? Okay, it’s a bit hard to imagine that educators themselves will do it, but maybe there can be some specialized units and experts within the system, who have the capacity to understand these technologies and to work with them.

[00:26:37] SM: You mentioned that among the various conversations, various regulations that the European Union has deployed that there is one that is focusing on data. Can you talk a little bit about that, because I think it’s connected?

[00:26:50] AT: Indeed, that’s connected, but not too many people mentioned that policymaking is such a siloed endeavor. We have the AI crowd and then you have the data crowd. Of course, there’s some overlap, but it’s as if it’s two different realities and in the end, obviously, AI is fueled by data. Europe has a European strategy for data, which has several really bold elements. Once they find interesting, because they break with the logic that basically the markets will solve everything. The logic, they can treat property as data. There’s act that has already been passed, called the Data Governance Act. One that’s currently being discussed, called the Data Act, which has some really bold ideas behind them is this concept called Common Data Spaces.

It’s still a bit vague, but basically, Europe envisions that in key areas, let’s say health or transportation industry. You will have these interoperable shared digital spaces in which data flows between different actors, probably both private, public, and civic, maybe not completely openly. This will for sure not be open data, maybe not necessarily for free or in the freemium model, but nevertheless, also not in the model where everything is licensed and under appropriate tariff control.

This seems to be really influenced by ideas like the commons. Of course, the term here is used vaguely, but basically, where there’s some governance, some management of data as a common good as a shared good. As I said, the outlines are not clear, but I think this is a really fascinating idea. We get fascinated mainly by technologies. AI is fascinating terms, but sometimes these setups proposed by policies, I think are just as fascinating.

[00:28:36] SM: Also, it looks to me like Europe is the only country that has ratified the right to data mining. Is that in the copyright Copyright Act?

[00:28:46] AT: It is. Yes, in the copyright directive.

[00:28:48] SM: Right. Can you briefly highlight the right to data mining?

[00:28:52] AT: That’s another piece of the puzzle that comes from a different silo, but in the end, fits right into the AI conversation, because basically, the term data mining describes a lot of what you want to do with AI. You want to take a pile of big data and run all sorts of computational techniques that will give you new insights and new knowledge. These are increasingly methods that could be branded as AI, but are traditionally called text and data mining. This has been the big debate in the copyright directive. This has been framed as an intellectual property issue, because basically, what they try to solve is this challenge. There’s a lot of data available today, but you’re not allowed to use it, right? You can almost even scrape it from the internet, but someone might have copyrights to this date or some other kinds of rights.

Then the rules that were adopted are not broad enough to my liking, mainly because they limit this to non-commercial research, activities and institutions, right? It’s good for scientific research and it’s a much-needed freedom that basically universities and other research institutes need but again, if you put together all these ideas about what we could do with data. If we ensure access to data, probably this regulation could have been broader. I think within this new European data strategy, of course, there’s always question how these rules will play together, but there are some new ideas, how data will be shared.

For instance, there’s a really strong proposal that could make producers of IoT devices, electric scooters, voice assistants be required to share the data they collect, which could open up really a huge possibility to create really new uses and would really transform this market. Text and data mining, I think it’s a term that really strongly connects basically with research, while these other approaches don’t really focus on research, but also look for other ways of using data.

[00:30:48] SM: In the end, the copyright directive is allowing for data to be opened by default for the purpose of data mining, but only for non-commercial purposes and research purposes.

[00:30:58] AT: For research. Yes. Which always is the same question. It’s a big shift. It’s better than nothing, but is it big enough? I think the problem with policies, with the copyright directive that’s finished with these ongoing acts like AI Act is that you have a sense that these are once-in-a-generation situations, right? Basically, once, 20 years, you can expect this to happen. The Digital Services Act that just was passed, builds on top of the so-called Information Society Directive, which is now 20 years old, exactly 20 years. If you think that you have a chance, once in a generation, you really would to get it right. Then bigger challenge, you’d really like it to be future approved, because you want your law to work in the reality where technologies are deployed almost every year that can change the balance of things.

This is maybe the big question, also the big debate, it’s one thing whether you get the rules, right? But I think in almost every act, you need to put these provisions that make it future-proof. Okay, you might have a list of four scary uses of AI, but do you have a process for reviewing it? Will you in three or five or 10 years, come back and review that list or review your mechanisms for transparency? I think, if you don’t include that, there’s a high chance that basically technology will make your law obsolete.

[00:32:21] SM: Right. It’s a fine balance to maintain future proofing versus allowing innovation and regulating.

So from the perspective of the industry, so members of the Open Source Initiative and industry members, advocacy groups and individuals, what do you think we should be doing?

[00:32:38] AT: Well, first of all, you should engage in these policy debates. I hope that’s clear. I think some companies see these issues, but basically, it requires probably some redefining of what it means to be focused on open source, right? It’s just one piece of the puzzle. It’s a very important piece of the puzzle. I think we also need all to take responsibility for that bigger puzzle. This applies also to the committee I’m more familiar with which is the Open Content Committee. By the way, I really appreciate the work done, for instance, by Wikimedia Foundation, which I think has exactly this broad perspective. It thinks of knowledge and content, but it also thinks itself as a platform and engages in debates on platform regulation, but even more broadly, often thinks of the whole ecosystem.

I think this is what we all need to be doing, because there’s a need to take responsibility for that. I really also appreciate what you said that the Open Source Initiative and other industry actors are really trying to understand what openness means in this new technological reality. What does it mean for a to be open? Because I think on one hand requires reinforcing some well tested recipes. Open sourcing code is a really good idea, I think. But on the other it really requires some creative reworking of what it means for things to be open.

[00:33:56] SM: Have you thought about it? What does it mean to be open for you? What would be your wish for an open AI?

[00:34:03] AT: We are actually doing a project, maybe that’s worth mentioning, where we’re looking at a very specific case in a case that has been close to my heart, but also really important for the open content, for the creative commerce community, which is a case of AI training datasets. It’s a story that’s by now, almost 10 years old. It’s a story of huge numbers of photographs of people being used to build datasets with which facial recognition models and technologies are built. There’s a famous or infamous data set called MegaPhase, which has 3 million photographs packaged into a tool that’s an industry standard for benchmarking for deploying new solutions.

It’s also a system that’s quite controversial. When you dig deep inside it’s not entirely clear consent was given. A lot of people say that even though the formal rules of licensing of Creative Commons licensing were met, they see some problem. This uses are unexpected, risky. When you look at the list of users of this data set, you suddenly see military industry, surveillance industry, and people really have some disconnect between the ideas they had in mind when sharing their photos. Yes, agreeing that this will be publicly shared, but usually have a vision of some very positive internet culture. Then you find out that there are these uses that define scary, and I think it’s a case we’re investigating because this is exactly what we’d like to see that we have a discussion about law.

Also we have a discussion about social and community norms, because when you want to address these risks, I don’t think you can regulate it all. You can, of course, say what’s illegal, but beyond that, you really have to think about standards. That’s one thing I’d like to see. I’m connected with that, I think we don’t have specific solutions for a lot of issues, but there’s one way that gives a good possibility of attaining good outcomes and this is participatory decision making.

Really, a favor approach is that really tried to draw in different stakeholders, even individual users into the process in the UK. The other level is institute-organized, so-called Citizen Panels, on the use of AI, on the use of big data, on biometric technologies. We’re really people express their views on how they would like to see these systems work in their lives. They are not experts. They will not give you technical expertise, but it turns out if you explain to them the technology, they come up with pretty reasonable ideas about what kind of world they would like to live in.

[00:36:37] SM: That’s very important indeed to engage and to talk to regulators. I’ve learned a long time ago that doing that does not mean getting dirty.

[00:36:46] AT: The other way around, it’s great for regulators to also reach out to people and really treat them as partners and not just limit the conversation to some narrow group of stakeholders.

[00:36:56] SM: Indeed. Well, thank you very much for your time, Alek. Is there anything else that you would like to add that we think we haven’t covered?

[00:37:04] AT: I’m keeping fingers crossed that Europe will develop one more piece in its fascinating array of new regulations, which some people observe with fascination, some people observe with awe or fear. I think of policies as powerful as means of world-building. We like to think that it’s technologists who create the world today, but I think policy has a lot of opportunity to shape the world as well. I just hope, more and more people believe that and engage in policy processes, because they’re not only important, but they can also be fun.

[00:37:39] SM: Wonderful. Thank you very much, Alek.

[OUTRO]

[00:37:42] SM: Thanks for listening. Thanks to our sponsor, Google. Remember to subscribe on your podcast player for more episodes. Please review and share. It helps more people find us. Visit deepdive.opensource.org, where you find more episodes, learn about these issues, and you can donate to become a member. Members are the only reason we can do this work. If you have any feedback on this episode, or on Deep Dive: AI in general, please email contact@opensource.org.

This podcast was produced by the Open Source Initiative, with the help from Nicole Martinelli. Music by Jason Shaw of audionautix.com, under Creative Commons Attribution 4.0 international license. Links in the episode notes.

[00:38:24] ANNOUNCER: The views expressed in this podcast are the personal views of the speakers and are not the views of their employers. The organizations they are affiliated with, their clients, or their customers. The information provided is not legal advice. No sponsor had any right or opportunity to approve or disapprove the content of this podcast.

[END]

Episode 1: transcript

Ariel Jolo — Tue, 16 Aug 2022 00:00:00 +0000

[INTRODUCTION]

[00:00:00] PC: We’re getting to a point of software development, where it’s not so easy to put things in buckets anymore as to what a human wrote, or what a machine wrote. The concept, may be easy, but I think the application might get really complicated.

[00:00:20] SM: Welcome to Deep Dive: AI, podcast from the Open Source Initiative. We’ll be exploring how artificial intelligence impacts free and open-source software, from developers to businesses, to the rest of us.

[SPONSOR MESSAGE]

[00:00:34] SM: Deep Dive: AI is supported by our sponsor, GitHub. Open-source AI frameworks and models will drive transformational impact into the next era of software; evolving every industry, democratizing knowledge and lowering barriers to becoming a developer. As this evolution continues, GitHub is excited to engage and support OSI’s deep dive into AI and open-source and welcomes everyone to contribute to the conversation.

[INTERVIEW]

[00:01:02] SM: I’m Stefano Maffulli. I’m the Executive Director of the Open Source Initiative. Today, I’m talking to Pamela Chestek, a lawyer with extensive experience in open-source, a board member of the Open Source Initiative. She also practices in trademark, copyright, advertising, marketing law. Thanks, Pam, for joining. Let’s jump right into it, Pam. From our virtual hallway conversations, I know that you have some very clear opinions about copyright on materials that has been created by machines. Can you share more about your thoughts on this front?

[00:01:35] PC: I just want to start off by saying that I’m speaking from the perspective of a United States copyright lawyer in the United States law. I think this is an area that may turn out to be quite different in different jurisdictions. I’m just speaking about what I know. The US has been pretty clear about what works are subject to copyright. They have been very clear for many, many years, long before computers that copyright only exists when a work was created by a human author. This goes back quite some time. Probably the most famous example that people might be familiar with was the monkey selfie, where a photographer claimed that a monkey grabbed his camera and took this really charming photo of this big grin by a monkey.

Then when he filed a copyright application to register the copyright in that photograph, the copyright office rejected it, because there had been so much publicity, that this was the work of the monkey, not of the person. The story changed over time where the person contributed – claims to have contributed more copyright on the whole content than story is originally told. Actually, Wikipedia took them to task on it. Wikipedia did a great deal of investigation on this, and reach a conclusion that it was not copyrightable, because the photo was taken by a monkey.

Another example was someone wanted to register a copyright in a work, where they said that they have not written it, but instead, the Holy Spirit had channeled through them to write this copyrightable work. The copyright office rejected it and said, “No, I’m sorry. It’s not written by a human author. We can’t register the copyright in it.” I take back whether these are copyright office decisions, or lawsuits that the copyright office is now incorporated into its guidance. don’t hold me to it if it was – if I had it backwards.

[00:03:20] SM: To be clear, is the Bible out of copyright, because of that?

[00:03:28] PC: The Bible, because at a time. Actually, I don’t know enough about the Bible to say that. Certainly, out of copyright because of the time lapse, because of the time period. I don’t know how many chapters were dictated by God, but versus someone’s retelling of what God told them.

[00:03:42] SM: There’s still God involved. Definitely, computers are now gods in this case. I was also thinking about machines, program, like painting is done by swinging a pendulum, ended up being paint on it. I mean, at that point, there is the person pushing the bucket.

[00:04:01] PC: Yeah. Actually, the copyright office, there also is this – The standard for copyright, for protection by copyright is by the supreme court requires originality and creativity. The copyright office will refuse registrations if they don’t believe that the work has sufficient creativity and originality. I have personally experienced this when I was trying to register a copyright for a monumental sculpture for a site-specific monumental sculpture. The copyright office said that it refused to register it. This was actually really a quite famous sculpture. The copyright office refused to register it, saying it wasn’t creative enough.

The copyright office does find itself, as much as they claim not to, they do end up in the role of arbiter of what is an artistic work and what is not. That’s another facet in that example, if I just push a pendulum and it goes on its own after that, is that is that creative? I could talk a long time about this, because I believe the standards are quite different depending on what work it is. Whereas, photographs are very easily considered copyrightable work, even though you just push a button. Just the development of the law around photographs protects those quite easily and other works, not.

You alluded to maybe — there is this issue of this complexity that we use computers to create works of art. It can’t be simply that a machine was involved. That can’t be the dividing line on whether or not something is copyrightable, because I use Inkscape, or Gimp to create works. The copyright office does have guidance on this on where the dividing line is. I’m going to read a long paragraph, and please forgive the reading and the length of the paragraph. This is actually based on a statement made by the copyright office from 1966. Think about that. This is from the copyright office’s own guidance called the copyright compendium on how to do registration.

It says, “The office will not register works produced by a machine, or mere mechanical process that operates randomly, or automatically, without any creative input or intervention from a human author. The crucial question is, whether the work is basically one of human authorship, with the computer or other device merely being an assisting instrument, or whether the traditional elements of authorship in the work (literary, artistic, or musical expression, or elements of selection arrangement, etc.) were actually conceived and executed not by a man, but by a machine.”

That’s theory, right? I think that’s the theory. It sounds bright line, maybe. Where is that line between what is the human being doing versus what the machine is doing? This is going to be the battleground for the copyrightability of works that are self-modifying based on input, based on the machine learning work.

[00:06:58] SM: Got it. Something like a tool that has been lately in the news, this software from the OpenAI organization called DALL·E. That basically, you feed it text, a description of something, like a sunset on a beach, and the machine is capable of generating art based on that text. Representing something looks like a sunset on a beach. I’ve seen experiments of Twitter bios described, represented as art by this software called DALL·E, and they’re beautiful, to the point where there was a conversation on Hacker News from a young artist. He wondered like, “With the output that I’ve seen from this machine, I’m probably going to be out of business.” It’s pretty clear that the art produced by DALL·E is not copyrightable, right? That’s very easy.

[00:07:52] PC: Yeah. Yeah, I think so. Yeah.

[00:07:54] SM: Now, what’s interesting for me is what happens behind the scenes, like for DALL·E to be considered, or something like DALL·E to be considered open-source. Now, that is a question that fascinates me. Because it’s certainly the thing, in order for DALL·E to be trained, to read and to generate art, they had to look at a lot of art, like interpret, look in a weird way as a computer. By doing that, it needed to – the algorithms, they end up generating the DALL·E output are an output of machine learning, training machines, learning by themselves. Is that copyrightable or not?

[00:08:35] PC: Yeah. This is, I think, where the complexity comes in. To walk through how this all comes about, how the software develops, someone wrote a software program about taking input, and then either analyzing that input, creating rules, or creating some model. Certainly, the software that a human being wrote in order to create the ultimate DALL·E system, that certainly was copyrightable. To the extent that we can also divide it into well, “Here’s the software and here’s the data.” Algorithms acting on data produce a result, that running data through an algorithm and producing a result, I don’t think the result is going to be considered copyrightable.

Where I think it gets interesting is where the software itself is modified as a result of what it has learned from the data that it has been given. As I understand it, I’m not a software engineer but we’re getting to a point of software development, where it’s not so easy to put things in buckets anymore as to what a human wrote, or what a machine wrote. The concept, it may be easy, but I think the application might get really complicated.

[00:09:53] SM: Right. Yeah, that is exactly my fascination. I’m not a software engineer either. I’m a mere architect and I’ve been an observer of this world for a long time. I do remember at one point, my very little diving into, or using AI from a more advanced perspective has been when I was putting together mail server in the past. I was installing SpamAssassin. I never really thought about it. SpamAssassin is a fairly simple machine learning system, where the software itself developed by Apache Software Foundation, is packaged by Debian and it’s fairly – it’s simple to install. APT gets installed from SpamAssassin.

Then what you do is you feed it with your set of good emails, the ‘ham’, and the bad emails, the ‘spam.’ Then there is some other components. Fundamentally, that’s what it is. You train the model, you train the SpamAssassin to understand your set of emails that are good from the ones that you don’t want to be approved. Then it creates rules. Based on those rules, it will apply the filters. Fairly simple. It’s in Debian.

Now, in that context, I do understand that the machine after feeding the spam and the ham, it generates a model. That model is generated by the machine. Is that copyrightable or not? Usually, you don’t package those in Debian, because everybody has their own spam. It’s fairly simple. I never thought about it. It can be simple to reproduce in any case.

[00:11:27] PC: From your description, the models, I think the models would tend to fall in the line of not copyrightable. Because the creative aspect of the work is in the software. You now then feed that software data, and then it spits out and then the machine figures out, the software figures out what the model should be. You aren’t making any artistic, or creative choices, or active choices. Where, I guess, as an aggressive lawyer, there is a concept under US copyright law, and this applies to databases or collections of information, is there can be copyrightability in the selection, coordination and arrangement of information.

To give an analogue example, if I choose to publish an anthology of a poet’s works, and I want it to be a complete anthology, I do not have a copyright in the anthology, in the overarching work. I don’t own the copyright in the poetry, but I also don’t own a copyright in the selection, coordination and arrangement, because there was no creative choice there. I simply identified every single work of the author and included it. Now if instead, I had said, “Well, I want to choose works of this author that are of a specific — that all talk about, say, sadness.” I go through and I select all of the poems that I think fit the selection on the creative – this creative choice that I’ve made.

Then I put them in a certain order. I don’t necessarily put them in chronological order. I put them in an order of happiest to saddest, or something. That may cross a line where that is considered copyrightable. Because there is creativity in that selection, that coordination, selection and arrangement. That applies to databases, so there is some protection for databases in the United States. The reason I’m hesitating on the model is, my argument would be, “Well, I made a creative choice in selecting what ham and spam, I was going to use for training.” Therefore, this model as a result of that training is back to this concept of was the work done by the machine, or was the work done by the human I would say? Or it’s on the human side of the line, because I chose the spam and ham to use to train.

[00:13:33] SM: Got it.

[00:13:34] PC: That’s the argument I would make. I don’t know how successful it would be, but I make it.

[00:13:39] SM: That makes sense. Because if it is not covered by copyright, then what happens? Is that considered completely public domain?

[00:13:48] PC: It’s just not protected by copyright. It’s interesting, we’ve reached, I think, a place in our society where we have this copyright maximalism going on, where there is this belief of, if I created it, therefore, I have some exclusive rights to it. That just isn’t true. There are works that just, they’re not protected by any regime at all. You may have created it, but everybody gets to use it, because it just doesn’t for whatever reason, it’s not subject to copyright protection.

This is where I think it’s going to – it also gets really interesting and maybe counterintuitive and difficult for people to accept and maybe will change over time. The Supreme Court has been very clear that what they call the ‘sweat of the brow’ is not enough for work to be copyrighted. It doesn’t matter how hard you work on it, or how much time, money and effort you put into it, if there is not this creativity and originality, those are the hallmarks of a copyrightable work. Sweat of the brow, putting a lot of time and effort into it is not enough. That’s where I think it gets really interesting, because obviously, a lot of time and energy is spent on machine learning, on tweaking the models.

I mean, we know from experience now that the image generation, or image recognition software has a very bad database that it started the training with, and that’s causing problems. Just throw into the pot, also, this concept that sweat of the brow is not enough to make it copyrightable. No matter how hard you work on it, that doesn’t make it copyrightable.

[00:15:20] SM: Got it. No, that’s very interesting. You touched on a very important point, because in the end, if something is not copyrightable, then we may not have an easy way to understand whether it’s open-source or not.

[00:15:32] PC: If I could just jump on that. Because I think that’s an interesting point, which is, it forces a real examination of what is open-source? What are our priorities? What are we trying to achieve here? The fact that something is not protected by copyright, does that get us where we wanted to be in the first – Anyway, if we think of the open-source licenses and particularly copyleft licenses being a hack on copyright that was necessary, because maybe software shouldn’t be copyrighted at all. Then is it possible that not having copyright protection for these works actually is the best solution? It’s actually a great outcome for us. Then, you also don’t have that license as an instrument of control for purposes of good. You’ve given up control.

[00:16:19] SM: It’s a very interesting conversation, because the dichotomy that it’s been hard to explain. I remember having conversations with the early European Pirate Party members, who were completely against copyright. We had that tension between not having copyright applied to software, means that also copyleft becomes without T. Going back to the open-source for artificial intelligence, one of the things that I noticed is that some pieces in for example, in Debian, there is some conversations going on inside the Debian community about whether they need rules to decide what packages they can import into the Debian archives.

Because on one hand, it’s fairly simple to say PyTorch, TensorFlow, NumPy, the basic software pieces that implement some interesting algorithms, or neural language, neural processing, text processing, and things, and computer vision. Then, there are some models that are necessary in order for science to progress. Some of these models are not available with licenses that they can be easily interpreted. There are conversations about even the basic feed, the big data sets that go into training models.

There are conversations about whether we need a definition, or we need some help to understand what can go and can be shipped into a Debian package. Do you have any thoughts on those?

[00:17:48] PC: I have been confronted with these questions. They’re starting to pop up more. What I’m actually finding more troubling is the data sets that are used to start – actually, have people who are, I don’t care about their models, or we’re going to do our own modeling, so we don’t need those models as much as. We’re going to do our own modeling. Some of this data is going to be copyrightable content. First question is, do I have permission from the copyright owner to use that data in this way?

As an example of copyrightable content, photographs. I don’t know where all of the data is, where all of these photo sets are coming from that are used for training. The subject matter that is being used for training is subject to copyright. Is all of that data allowed to be used? Was that used with permission? Then the question does follow after that is, if not, what does that mean about the model? If I use data that I shouldn’t have used, I didn’t have permission to use to model, does that taint my model?

Let’s say, the model was put under an MIT license or something, that it’s freely available. Is that okay? Is that just because the model has been sanitized enough that I can use the model? If I don’t know the quality of the content that it was – If I don’t know the progeny of the content I was trained on? That’s where I start to get my head. I can’t get past the dataset.

[00:19:16] SM: Absolutely. Because one of the things that I learned doing some research for this series is that the European Union has a new right, has introduced a new right, the right to data mining. They have turned it on by default, which is surprising. This was in response from what I’ve read so far, and would have guests explaining this to us a little bit more. It looks like, the European Commission was convinced from researchers that some of the – think of the archive of images on Flickr that implemented the Creative Commons Licenses early on. There is a very wide array of pictures with lots of metadata and tags and with freely available licenses. Only, it wasn’t clear whether the data mining was included in there.

As a human being, as a citizen of countries, I think of myself – like my face now is up there. It can be used for nefarious uses, not just to identify a white man in a picture. There are those implications. That I think, are tied to the Open Source Initiative and the Open Source Definition somehow. Because in many aspects, even though we don’t – we try to be as an organization, we try to be neutral, we do have organizations that rely on open-source to set also a stage of technology that can be implemented without discrimination. Now, we have artificial intelligence systems that are capable of deciding whether someone gets out of jail or not.

In the past, we would have said, “You have to make that code open-source, because it’s public. It’s used by public. You have to make the code open-source, so we need to be able to inspect it as the public and we should be able also to demand and demand fixes.” Now, with AI systems, things get a little bit more nebulous.

[00:21:10] PC: I don’t know if they get more nebulous or not. One of the things when you mentioned the data mining, and correct me if I’m wrong, somewhere in the back of my head that that data mining, the permission under the EU law for data mining, though, is for non-commercial use only. The OSI is very clear that we do not discriminate against commercial uses or non-commercial uses. As you’ve just explained, there’s a reason for that, which is the line drawing gets very difficult. The good and evil question, it’s insoluble as far as I’m concerned. We have to take this position of, “We’re not making value judgments on how this stuff is used.”

Knowing in particular that there are problems with models that were created from flawed databases and how harmful that’s going to be, there certainly is a big part of me that wishes that we could say, “No, it is not consistent with our belief system, that these should be used.” We have this discrimination principles. I’ve always been very clear that OSI software can be used for evil. Maybe I’m too rigid in my thinking, but I just don’t see any way to draw a different line for models.

[00:22:22] SM: I don’t think that it’s the role of the OSI necessarily to be the judge of that. We definitely have been helping people who have been involved in policymaking and policy discussions. I’m thinking like, DFF, or other organizations like that were open to being having the software available as with an open-source licenses, some baseline to accept that filing taxes, for example, or the Free Software Foundation in Europe as this campaign has been going on for a long time. Public money, public code. If it’s funded by taxes, then any software development should be free as open-source. We want to have that conversation about what is an open-source AI, or at least some groups will want to hear that. Maybe at one stage, we’ll have to have that conversation.

[SPONSOR MESSAGE]

[00:23:16] SM: Deep Dive: AI is supported by our sponsor, DataStax. DataStax is the real-time data company. With DataStax, any enterprise can mobilize real-time data and quickly build the smart, highly-scalable applications required to become a data-driven business and unlock the full potential of AI. With AstraDB and Astra streaming, DataStax uniquely delivers the power of Apache Cassandra, the world’s most scalable database, with the advanced Apache pulsar streaming technology in an open data stack available on any cloud.

DataStax leaves the open-source cycle of innovation every day in an emerging AI everywhere future. Learn more at datastax.com.

[INTERVIEW CONTINUED]

[00:23:56] SM: You mentioned you have clients working on machine learning. What kind of issues are they running into?

[00:24:02] PC: I don’t want to share too much. One commercial situation, what I have found interesting in the commercial – for commercial clients is it is a significant one. I have a client who’s a service provider for another company doing some machine learning work for them. There is this question of ownership and who’s going to own it and reuse, and reuse of data. For example, the customer might say, “Well, if you’re using my data, here’s my dataset that I want you to evaluate, and come up with some modeling. But you can’t use this dataset for anyone else.” Because they’re trying to get a commercial edge, right? They’re trying to get a market differentiator for themselves. They think that they can do that by limiting the dataset.

It actually is very, I think very similar to commercial software development, versus an open-source software development model, where it’s your proprietary to development, and you’re going to keep doing the same thing over and over and over and over again. If you’re not going to share your work product, or you’re not going to use other people’s work product. I think that the same thing is going to happen here as okay, I’ll just retrain. I’ll do the same thing with somebody else’s dataset, which may look really, really similar to your dataset.

Now, I guess, there are implications there, too. Maybe Are we better off allowing using more data, rather than restricting data? Are we going to come up with better models if we use more data, rather than having to reinvent the wheel every time? Yeah, that I found interesting.

[00:25:21] SM: Yeah. It’s exactly the early conversations we had when free software and open-source software were spreading. Why are you reinventing the wheel? Why is everybody is working on a different kernel and different Unixes from variations and dialects? Why don’t you just collaborate and put all of your energy into one and build it faster? We’re probably going to get to that point. Are you hopeful about AI being a force for good and a way to progress faster with open collaboration, or more of on the scary front of Robocops and Skynet?

[00:25:59] PC: That’s a really great question. I don’t have an answer for it, because my level of trust would come down to who’s doing it. We’ve seen people of goodwill, understand the problems and are cautious of the problems. I remember there was one – there was a Twitter bot that turned racist in about eight hours. In almost no time, it was degrading racist slurs, and they had to take it down. That, of course, gives me great pause. I think we do see that these tools are being used prematurely, being relied on in ways that are harmful to us by the police, or by the prison system.

These tools that haven’t been adequately tested, that we think that minority reporting will exist, and we can make predictions about people about whether they’re going to commit crimes before they commit them. That part of it is terrifying. There are a lot of people who recognize that these problems exist. We’re still at early stages, and we’ll see what happens.

[00:26:57] SM: I agree with you. It all depends on who’s going to be able to guide and gain the trust. So far, I’m a little bit nervous, because from what I’ve seen, AI at the level of the tools that we have seen so far, like DALL·E and the most awe-inspiring, the ones that really make you go “Wow” require an amount of data and an amount of processing power that is really not available for the Debian developer. The kinds of software development that we used to do used to be accessible 20 years ago to create a full distribution, perfectly Unix capable machines and servers doesn’t seem to be readily available in the same way with AI systems.

I’m also hoping that what some of these conversations that we’re going to have in the next few weeks will be revealing some hope and some path forward. Because I really like to see the light around academic evolution, for example, all the research. I had very interesting reads, and I recommend it if you haven’t, the papers that have been released by the Free Software Foundation on co-pilot analysis. Some of them are extremely thoughtful and eye-opening. At least, they were for me since I’m so new to the field.

[00:28:19] PC: Something occurred to – I was thinking about yesterday and the day before, whereas, I received a document for review that was written in French. This is a legal document that was written in French, and had simply been put through a machine translator for the English translation. The fact that we have reached a point, because I remember, and this is within my memory of when at the very earliest, Babel was a website where you could – That was phenomenal. It’s phenomenal that you could actually get anything as unintelligible as it was.

We’re to a point now where we rely on the machine – the machine learning, I think, is probably always the first step in any translation at this point. Then it may be reviewed by a human to make sure that it’s cogent and understandable. Sometimes not. Sometimes in the work that I do, it’s probably close enough. That there may be some syntax problems, but I get the gist. I just think of that as an example of where we may be going with machine learning. We’re still at that very early stage right now of, “Yeah, I can get the gist of it. It’s not great” but we will be getting to a point where it’s just going to be part of the ordinary fabric of our lives as to rely on all of this machine generated content, or decision-making by machines is very interesting.

[00:29:34] SM: Like that young person was saying this morning on a forum online, “I’m a designer. I see myself going out of business soon because of DALL·E. Transcribers, people who transcribe text and translators, they’re basically are already out of business as of today. With some of the GPT3, so the text processing from OpenAI and other huge projects from them is capable of also summarizing text and writing very basic marketing copy. A lot of the low level jobs in creative jobs can be gone away. It’s a fascinating world.

[00:30:16] PC: The Associated Press uses machine-written some content. It’s simple reporting on a company’s, say in a company’s earnings, they use machine generated copy of those.

[00:30:26] SM: The LA time has bought that price, little snippets about earthquakes.

[00:30:33] PC: I mean, I still have hope that we will always be able to tell the difference, that there is a difference between what a machine will generate and what a human will generate. Maybe there is only a slight difference of that. Always have the upper hand.

[00:30:45] SM: Again, with DALL·E, some of the readings that I’ve done over the weekend, people were noticing little things as much as that would make you tell whether it’s generated, or an artist did it. But nothing that you can’t just touch up on Photoshop to fix.

[00:31:03] PC: Yeah. I think it’s also, and back to the subject matter of copyright is original and creative, is it’s sometimes referred to as the creative spark that the artist has this creative spark. By definition, a machine is not going to have a creative spark. There’s hope for us, I think.

[00:31:20] SM: Is there anything we should be talking about, you think?

[00:31:24] PC: I talked about the concept — this conflict between there is this huge amount of work being done, that it appears it may not be protected by copyright, or certainly, there are arguments that it is not protected by copyright. Yet, it’s a substantial amount. It will be a substantial value proposition to a company to have that. It doesn’t look like, where the law currently stands, I would say, they may not have exclusive rights to that.

What does that say about their business model? How are they going to make money on it? Having worked at Red Hat, a question I was asked all the time, and I still get asked. I haven’t worked there in many years. When I say I worked at Red Hat, the first question out of everybody’s mouth is, “How do they make money selling free software?” Red Hat has figured out a pretty good business model to make a fair amount of money doing it. Because Microsoft was built – early software companies were built on exclusivity of copyright. You have to pay them to get to use their copyright. That’s a business model.

Those of us who have been in open-source have been thinking creatively about business models, because that one’s not available to a company that’s – Although, there are very few purists, right? Most companies are doing a combination. They’re doing this open-core, which is a loss leader on the open-source, but then they’ll sell you a license to proprietary widgets. A true, pure open-source play is very uncommon and very difficult and challenging. It may be, when all this work is not protected by copyright, people will be scrambling for how do I monetize this? What is my business model around this thing for which I have no exclusive rights?

Those of us who have been thinking about that for decades might be able to help them out with that. Maybe they’ll come up with new models. Maybe they’ll come up with stuff that we’ve never thought of, which would be really great, too. I think that’s going to be really challenging for people. How do I monetize this?

Access is, if you don’t have copyright, the second way you do it is access. This, for example, is the museum. There’s no copyright. You can’t take pictures in our galleries. There’s no copyright. You’re not infringing the copyright in most works, other than more modern ones. What they do is, it’s a condition of permission to access to the work. It’s a condition on your entry into the museum is that you do not take photographs. That is a gatekeeping that is used in open-source business models is, we’re not going to give you the executable until you pay us, then we’ll give it to you.

I expect our reliance on some gate, access gate will be one way. Now with cloud, where you don’t have to give people a copy of the software. You just give them a portal to access it. Then that access is much easier to control.

[00:34:07] SM: Like the OpenAI model seems to me like what they’re doing, they have built this great machine and then they charge by API access, maybe. There is another area of algorithms shaping things and moving stuff. It’s been fascinating to me. It’s less tied to the open-source part, but more in the general public conversations, and that the algorithms in Twitter, or Facebook, or LinkedIn that decide what items you’re interested in. Recently, there have been some conversations again about Twitter having this pie in the sky, new project called ‘blue sky.’ They want to open-source their algorithm in a very wide sense. Do you have any opinion?

[00:34:54] PC: I think, maybe, whether by open-source, whether we mean simply visibility into what the algorithm is, versus – I mean, maybe they would be willing to let other platforms use their algorithm. That would be very interesting. Particularly when we talk about, “Do I trust Twitter to filter content correctly?” I hold a lot of skepticism. If that model were to be shared publicly amongst all of the social media platforms, so that they could each tweak it, to come out with a better model? Under open-source development theory, that would be a better model, right? Because we’re not relying on just one person’s or one entities judgment. We’re getting judgments from – consensus from a lot of people. Yeah, that is interesting. If they really, truly mean open-source process, which is probably not what it is.

[00:35:42] SM: Well, yeah. Exactly. I don’t think that there is very good definition, or very good understanding of what they want to achieve. I’ve read some of their papers and they seem to be well-intentioned. Also, thinking about distributing it so that no one entity owns access to the information, or the algorithm itself. We’ll see. We’ll see if something is worth putting an eye on.

[00:36:05] PC: It brings me back to this concept, when I was talking about, is the model of copyrightable? Is it under a license? Which is, what happens is we tend to put a license on – if we don’t know whether or not the content is protected by copyright, but we want other people to use it, we put a license on it, because that’s very clear. I just throw that out there, because there’s a downside to that, which is by putting licenses on everything, then we reduce the pool of what we’re going to assume is freely available for everyone to use. It’s in the public domain. There’s no copyright about for whatever reason.

By putting a license on something, you’re saying, I think this is copyrightable, and you need a license to do that. There is a negative consequence to doing that. There’s a reason it’s done, but there’s also this negative consequence. I’m just going to throw that out there as the same concept of algorithm as they would put a license on it, and that would make everybody happy, but we’ve just now made a public statement that says, we believe this algorithm is copyrightable, owned by one entity.

[00:37:06] SM: Thank you so much.

[00:37:08] PC: It’s such a pleasure.

[END OF INTERVIEW]

[00:37:10] SM: Thanks for listening. Thanks to our sponsor, Google. Remember to subscribe on your podcast player for more episodes. Please review and share. It helps more people find us. Visit deepdive.opensource.org, where you find more episodes, learn about these issues, and you can donate to become a member. Members are the only reason we can do this work. If you have any feedback on this episode, or on Deep Dive: AI in general, please email contact@opensource.org.

[END]