Exploring the business side of AI

Transcript from October 11th Deep Dive: AI Business panel

Stefano Maffulli:

Then welcome everyone officially. Okay. Here we are. Thanks everyone, and welcome to Deep Dive AI. This is an event series from the Open Source Initiative that started as a podcast series. First, exploring how artificial intelligence impacts open source software from developers to businesses and to the rest of us. Today we start the second phase of this exploration with a panel focusing on the challenges and opportunities of AI as seen from the perspective of corporations businesses. There will be more three more panel discussions. One on Thursday, the 13th and then the 18th and the 20th. And the objectives, the objective of the panels is to better understand the similarities and differences between AI and what I would call probably classic software and particularly open source software. And so I’m, Stefano Maffulli and I’m the executive director of the Open Source Initiative.

Stefano Maffulli:

And today I’m joined by – in random order – David Kanter. He’s the founder and executive director of MLCommons, which is an open engineering organization dedicated to making machine learning better for everyone. And the members of MLCommons are basically the who’s who of AI from Google to Baidu to hardware manufacturers like Super Mink, Micro, Dells. There’s startups in it, like really an impressive list of sponsors and members. David also co-leads the development of MLPerf, which is a set of industry standard suites for measuring machine learning performance at every scale. Lots of experience in AI and development is also the MLCommons also maintains the list I mean, maintains two large open data sets in speech domain. And welcome. Thank you, David, for taking the time.

David Kanter:

Thank you so much for the opportunity to be here in the fantastic introduction. Everyone should know he did a great job of taking what was probably too long of an introduction and compacting it. So excellent work. Thank you so much. Thank you.

Stefano Maffulli:

Stella. Stella Biderman is next. She’s a leading natural language processing researcher and open source AI advocate. She runs ElutherAI, which is one of my favorite grassroot research group. Best known for pioneering open source large language models alternative to the ones that have been released in the proprietary fashion. She worked also on dataset such as the Pile, the similar Lara Aesthetic captions, and developing the v key gun clip methodology for generating images from text. She’s also a member of the big science research workshop where she worked on developing open multitask and multi multilingual language bottles, and she co leads the evaluation working group. Thanks, Stella, for being here.

Astor Nummelin Carlberg:

A pleasure.

Stefano Maffulli:

Next we have Astor Nummelin Carlberg, is, he’s the executive director of Open Forum Europe which is an independent non-profit think tank based in Brussels. Astor is responsible for the overall vision activities of the organization and policy development. He has an extensive experience on open or European policy making , communications and network building, and he leads the conversations on Europe’s digital challenges and role of open technology in in achieving its full potential. Thanks Astor for being here.

Astor Nummelin Carlberg:

Or as you say, least some, maybe some conversations. There are many conversations ongoing.

Stefano Maffulli:

Oh, there are so many, in fact, going on, so many  Sal Kimmich, thank you. She’s an, they are an engineer passionate about helping peers, ethical actors, and digital enthusiasm to fill the cracks in the open source supply, open source software supply chain. They work primarily with the open web application security project and the open source security foundation to build systemic solutions to security issues. They also lead the efforts on rewards and incentives mechanisms for cybersecurity in the US federal cybersecurity Mobility mobilization plan. Thank you, Sal.

Sal Kimmich:

Thank you very much. I also am a machine learning engineer by training. I work player primarily in like scaling Kubernetes, and I used to work with supercomputers to do realtime brain image processing. So I’m excited to have this chat today. Thank you all.

Stefano Maffulli:

Wonderful. And finally, Alek Tarkowski. He’s the director of strategy at Open Future, another European think tank that develops new approaches to an open internet maximize societal benefits of shared data, knowledge and culture. He has an extensive experience with public interest, advocacy, movement, building and research into the intersection of society, cultural and digital technologies. Is also a sociologist by training and holds a PhD in sociology from the Polish Academy of Science. He’s worked with the Prime Minister of Poland as a strategic advisor and other experience very relevant to this field. Thank you

Alek Tarkowski:

Know, well that I’m on the board of Creative Comments, which I think is quite relevant.

Stefano Maffulli:

Very relevant, very relevant, super relevant, <laugh>, super relevant. So there are three main points that I’d like to cover with, with you all today. And one is how AI is different or similar to other technologies that we’ve seen before. The other is what lessons we can learn from open source that can enable the collaboration and speed up the progress of AI. And what are responsibilities that you see businesses that should have to keep the society safe from the abuse of AI or from abusive AI. And so let’s start with the first topic question for all of you. And you, you can pick who wants to answer. One of the comments I heard often is that AI is somewhat different from any other technology we’ve seen before, and therefore it poses really unique problems regarding ethics and responsible use. But you know, throughout history we’ve seen technologies emerge with great promise and potential peril all the time with debates about whether it would be too dangerous to be given to the public. Likefrom firearms to social media, genetic engineering. We’ve seen a lot of technology where public or, you know, corporations wanted to retain full control. What’s your take? How is AI different from the rest of other technology and competing industry specifically? Maybe David?

David Kanter:

We should go back to the original technology and mythology, right? If we, we think about fire, right? You know, in, in the Greek myth, right? The the Titan Prometheus, I believe who brought fire to humanity was not supposed to do that, got chained up to a rock and had a very unpleasant future. So it’s not just, you know, modern technology, right? I I think this is a trope that actually goes back, you know, thousands of years in some ways.

Stefano Maffulli:

Indeed.

David Kanter:

Or you could even say Adam and Eve you know, for those, the Christian.

Stefano Maffulli:

Apples.

David Kanter:

Abrahamist traditions. But yeah.

Stefano Maffulli:

So yeah. So how about AI and fire?

David Kanter:

Yeah, I, I’m someone else want to take this? I’m, I’m always happy to talk, but I wanna –

Sal Kimmich:

I can jump onto that metaphor for a second, right? Because I think it is pretty crucial to consider that fire is only dangerous if it’s not containable. And in this case, the discussion, I, it’s, there’s one context in this, which is the runaway algorithm, which always gets brought up a lot. But when we’re talking about artificial intelligence and the way that we need to be viewing it, kind of one of the things that I think is most important is like, are we starting wildfires? So the question here, a lot of times where we’re at really in this stage of our maturity and understanding AI as a sector is understanding really the computational costs and how to do large scale computing efficiently at a global scale. I think that for me would be containing the fire, letting us have humanity, and also getting to learn new stuff about the world with these beautiful machines.

Stefano Maffulli:

So do you think there is a difference? So did you think this is really brand new all we’ve seen it before?

Astor Nummelin Carlberg:

Well, maybe I can add just one thing and here, of course, take taking a step back because it, it’s sometimes more comfortable than, than saying something specific about AI and trying to predict the future. But it is interesting, like you mentioned in your intro that we, from a policy perspective, in the kind of political discussions we have heard very, it’s kind of the same reactions to a new technology. We can go back all the way back to fire, but also to encryption. But we also can look at other sectors that are not AI. There are also questions around, let’s say, open sourcing or increasing the availability of knowledge around check designs, which is considered something very strategic for governments. And it’s, we see a lot of analogs in other technological technology areas, and especially my angle into this through open source. And it of course, brings not only a lot of access, but could, it also brings a lot of speed to the development itself. And that I think are kind of the two elements in the open source AI policy discussion that really stand out to me.

Stella Biderman:

I think. So I’m gonna, I’m strike out against what, what seems to be the, the dominant life on. I don’t think AI is particularly different or particularly special. I think that it has, I think that the way AI works is not particularly well understood by policy and legal experts, and that there are, it often doesn’t fit the mold of currently existing laws and regulations. And I’m sure we’ll, we’ll talk about that soon when we talk about, you know, what does it mean to have an open source AI? But on a philosophical level, I don’t think it’s special. I think that, you know, oftentimes when there are new types of technology, be it AI, be it the internet, be it whatever, you need to adapt legal in societal frameworks to have this more nuanced and inclusive view of technology and how to regulate it.

Stella Biderman:

But I don’t think that there’s something fundamentally special about artificial intelligence. I see people you know, I see people on Twitter all the time saying, talking about open the importance of opensourcing AI on the internet, and a lot of people respond to me like, well, you wouldn’t want to give everyone plan to build a nuclear weapon. And I think it’s really important to push back into like, that is not the scale. That that is not what anyone is talking about doing. And I think that there’s a, a tendency to assume that the latest and greatest technology must be, you know, world cataclysmic scaled, dangerous. And I don’t think that, you know, you know, if Open AI had had open source the GPT-3 model, for example, as soon as they made it, I don’t think that would’ve, I don’t think that’s in the same conversation as you know, giving everyone in the world their own personal nuke and –

Stella Biderman:

Policy makers use language like that. Academics use language like that. But I think it’s really dangerous and irresponsible, honestly.

David Kanter:

I would say that my views are actually, you know, fairly close to Stella’s. Like I think there’s a lot of ways in which, you know, AI is not that different than technologies that came before. I think there are some important ways where it is different, some, some idiosyncrasies, right? Like I think to pick one in particular AI explainability sort explainability for ML and neural networks is sort of, to me, that’s a reaction to the fact that, you know, neural networks are currently somewhat inscrutable. And, you know, I’m not old enough to like, understand was steam power inscrutable given the, the, the understanding of physics at that point in time. But, you know, I sort of think like, let’s go look at this technology and find the ways that it is, you know, sort of fundamentally different. And I think there may be, not fundamentally, but I think there are sort of second order characteristics of ML that are a little bit different, and that requires maybe a little bit different tooling to think about.

David Kanter:

But I think one of the things that is a first order top level difference getting into sort of the legal side is, and particularly with respect to open source, is there is a lot of intuition that we collectively have built up around sort of open source that focuses on code. And so the big thing today, and you know, this is not an original David Kanter thought, right? You know, and Andrej Karpathy, you know, sort of said like, look, in the context of ML data is essentially the new code, right? And so what that means is all of a sudden we now have a regime where data was something that, you know, was not gone over with a fine tooth comb and sort of trying to understand licensing and different combinations and things like that. And so now that’s something that we will need to do.

David Kanter:

And then there’s both data taking first stage along with code, and then there’s the interactions of that and then all the different parties. So I think it exposes a lot of complexity, which is I think what Astor sort of you were saying, right? There’s like policy areas that we haven’t touched. But yeah, it doesn’t seem to me that AI at the top level is so fundamentally different. I think there’s also a lot of confusion as with any new technology. Like one of an experience I had was going to a conference where there was a lot of people talking about AGI and –

Stefano Maffulli:

AGI stands for?

David Kanter:

Artificial General Intelligence. And this is sort of the idea. I, so here’s the issue. I’m not actually sure what it means. Like so I had emailed Stella and both of us studied math at U Chicago, which only comes in one flavor, which is theoretical. And so you start with a very crisp, clear definition of anything. And one of the challenges I find is like getting a definition of AGI is sort of like, you know, holding a greased watermelon in the ocean, it just, it like keeps on slipping away and that’s fine, right? Like, it sort of reflects some societal fears, right? Is my job gonna be displaced? And I think those are interesting things to take account to. But like every time, like I, I think there’s a, a lot of education and understanding that needs to happen because I think when many people who pose these fears, it’s because they aren’t intimately familiar with it.

David Kanter:

And it’s sort of like any technology, it’s going to start out as sort of arcane black magic until at one point it’s prosaic. And like flight is a great example of that. Like, I don’t know anyone who takes a flight on a commercial airliner and says, I need to study what plane I’m on to figure out if I’m gonna die cause it’s gonna crash. Now that was a reasonable thing to do, potentially like the 1910s. But like, you know, today, like flight is just this like amazingly magical and beautiful thing that’s in the background for almost everyone. And one day I hope AI will get there.

Stella Biderman:

I think that’s actually a really interesting analogy because flight is exceptionally dangerous.

David Kanter:

Mm-Hmm.

Stella Biderman:

Nowadays nobody ever dies on airplanes. Like the, the odds of dying in a crash of an airplane is exceptionally small. It is arguably the greatest safety accomplishment of the technological sector is developing safe commercial air flights. Just full stop. But the way that we got there was basically most of the governments of the world getting together and telling the overwhelming majority of people they weren’t allowed to build airplanes and creating a, a very small monopoly. You know, you people, people think about you know, there, there’s a handful of companies in the world that produce basically all of the aircraft we’re talking now, commercial aircraft. There’s like six or seven companies that produce basically every commercial airliner in existence. It doesn’t matter if you’re flying an American Airline, if you’re flying a British Airline or flying in the UAE or China. There’s just a very small number of companies that produce all of them, and they have extremely careful regulations both internally and externally from government about, you know, quality control and manufacturing processes and all of this stuff. And ultimately like that is how come we have these extraordinarily complex systems that can take tens of thousands of people, hundreds of thousands of people into the air every day and not kill them.

Alek Tarkowski:

If I could could add one more analogy. It’s housing. There’s this amazing episode of the 99% Invisible podcast, which is a podcast about design, which talks about how houses didn’t have fire safety systems for a very long time and were basically very dangerous technologies in which everyone had to live. The state of our system was the basket on which you could try to lower the baby from the top floor hoping that at least the baby will survive, right? And, and then a major breakthrough happened, which had to do both with technology and, and standards and safety codes. And I, I always like that. I also like Stella, your example of GPT-3, I always thought that moment where they said we have this system, but the responsible thing for us to do will be not to release it is a very interesting one.

Alek Tarkowski:

And of course I like it that we’re talking about technology and the public discourse on technology. And I think there are very two different things than I’m more an expert on the latter. So I will not comment on the technology, but, but for me, that GPT-3 move was quite symbolic and it’s repeated a lot both in policy debate among policy makers, but in industry debates, basically things like responsible and ethical. I’m not saying these haven’t been discussed before with regard to technologies. Of course they were, but if we think about the like, traditional mode of like, like ship it while breaking things of many of the web techno, maybe not technologies but products, right? Or business models, I think there is something new, right? And we can discuss whether the for instance, policy proposals around safe AI or are good or miss misjudged. But I think in, if we look at the policy debate, there is something you hear, and maybe we can come back to it for me, that becomes very practically visible with the new rail licenses, which suddenly say, let’s take the open source stock and attach to it a responsible module. Right? and, and that for me feels very interesting. I guess we’ll talk about it.

David Kanter:

Do you think that the statement of we built this thing and it’s so powerful that we’re not gonna release it for ethical reasons. I mean, do you think, like, to me, I sort of read that as partially being marketing, to be perfectly honest.

Alek Tarkowski:

Very much so I think, but it nevertheless built the debate.

Stefano Maffulli:

That’s it. Yeah. I got that same impression. It built the foundation of oh my god, or freaked more people out by giving that, that by stating that from a powerful organization.

Alek Tarkowski:

And, and about two weeks ago, again I read on Twitter someone says, Table diffusion is reckless open AI is being responsible.

Stefano Maffulli:

Mm-hmm.

Alek Tarkowski:

Which I find very interesting in terms of the debate.

Stefano Maffulli:

In fact, it is very interesting. So if I can summarize, I, I think that what I’ve heard from you is saying that one of you mentioned the, the speed of which AI has been moving as a differentiator somewhat but not very important. The most important thing that you highlighted David, is the importance of data and how the data has changed a little bit. It’s level of importance and explainability or other technical issues that are still not clear inside the researcher community. Or at least, you know, it still has lots of jumps to go through to evolve. But overall, basically you’re all agreeing it sounds, sounds to me like this is nothing too special.

Sal Kimmich:

I’m not sure if I do, I really do want us to make a clarification here because if what I’ve heard discussed so far, still very much does sound like large scale, but machine learning. It is absolutely different when we are looking at building something which is true real artificial generative intelligence, right? It’s very different in that case because I’m not then going from collecting and curating a data set to then be able to have a machine that’s very efficient in making known outcomes. What I’m then doing is having a machine sometimes make its own choices about exploring that data set to then go on to build a data set. Now for me, we don’t have anything existing in policy and we cannot be doing this by a case by case basis. If this is going to do a policy enactment, it has to be categorical. Categorically, right now I do not think we have anything in place in order to be able to identify for whom the intent belongs to for negative secondary effects of a data set produced by or an algorithm produced by an artificial agent.

Sal Kimmich:

Right now in the US alone, if that’s done by a developer, that intent goes back to the developer. If it’s built by a developer who uses generative code, does that go back to that developer or a percentage of that policy can help us with that. And I think we need to distinguish right now the ethics in open source and the ethics in artificial intelligence, cuz they’re super different. The premise for what we’re doing for ethics in open source is all around intellectual property. If we are actually having a discussion here about artificial intelligence and not just large scale ML, AI is actually a discussion much more around the policy of intent. So that is what I really want to understand. Are we, are we able to parse that apart in two separate discussions? Cause I think they’re two very different ethics.

Stella Biderman:

Can you give a couple examples of what would be artificial intelligence in this lingo that, so like, so some examples of things that come to mind were, like OpenAI’s GPT-3, someone mentioned stable diffusion, you have reinforcement learning algorithms like alpha zero. Which of those, which like category do each of those fall under?

Sal Kimmich:

So I think for, I think for these Stell, if we are producing something which very fundamentally has data set first, which all of these models run by, right? This is how we’re getting our, our stacks to pull from. If it’s a data set first that we’re then learning from, and I think this contextualization that you’re working with is good. I think these stacks that we’re able to pull from make a lot of sense, but that’s very different from some of the things that I’ve built where I was simply telling a computer, here’s a massive global data set of real time sensors. I want you then to go out into that state’s place, explore it and decide yourself how you wanna do the feature representation. I think really the way that that feels as a developer is that the intent and the production for those features, I don’t even always tell it what initially it should optimize around that does feel different to developing a deterministic pipeline.

Astor Nummelin Carlberg:

Could I, this becomes a meta comment again, but bear with me. It’s very interesting of course to listen to you as experts talking about this, but from my point of view, I’m bringing you in here also, Alek we are both working a lot in the, in the Brussel space. There’s an interesting question around this general space where of course I made the point about speed of technological development on the one hand, but there’s also the limited number of people who actually have deep understanding of how the system worked and what the potential effect in society could be. At the same time, there is this instinct or reaction that we need to regulate this space. Then it bring, it creates this question in my mind that is like, who then holds the responsibility of education and learning and like teaching or explaining. If it is a very limited number of people that can actually, let’s say, bring that education. Because I think that’s quite different now, isn’t it? Like compared to fire, compared to steam engines. How fast ideas of a limited number of people can have effects very broadly in society, like the time to market so to say is a lot quicker and policy making is not always super comfortable with that kind of speed. So how do we square this circle with, for example, two experts as yourselves here where, who does the responsibility fall onto?

Stella Biderman:

I think that’s a good point, but I’m, I don’t think that that is necessarily like core to artificial intelligence. It’s more core to the way that most modern advanced artificial intelligences are developed in specifically who they are developed by. The overwhelming, you know, whether you’re interested in text generation or text image modeling or or reinforcement learning for playing games. The overwhelming majority of the research in this field is controlled by very large tech companies and a very small number of them globally. And, you know, they have a lot of money and resources and influence to be able to first of all pump out this research very quickly. You know, a lot of what’s currently en vogue in AI is roughly speaking, if you have twice as much money, you can finish the problem twice as quickly.

Stella Biderman:

They’re extraordinarily paralyzable. Your ability to actually purchase GPUs or purchase data assets is the primary model map. And so for, to make this a little personal, like I’ve trained a 20 million perimeter language model called GTP New X, and that took me about three months. Which is, you know, if if I had enough money, I could have done that in three weeks instead of three months. The difference there is solely about the number of GPUs I can afford to pay for. And so, you know, and also we know that some of the, these AI so the DALL-E for example, that DALL-E 2 that OpenAI developed was, was streamed in less than a month. And that’s really a statement about resources is not something about the, the AI itself, it’s a statement about the fact that OpenAI actually had the resources to go do that. Someone else could have turned the same model in a year. But yeah,

Stefano Maffulli:

Yeah, in fact, I think this is an important topic because in my mind it’s one of the largest difference between AI machine learning, the way we’re talking about, and classic software, right? Classic software as today can be developed with less than a hundred bucks and a text editor and, and lots of open source software to write application. But when it comes to AI models, you know, we’re talking about half a million dollars and up to get the data set. So the, but one thing that happened, I mean, this was not always the case. Like when software started to appear, machines were, were expensive, hardware was expensive, the availability of of basic software was almost nonexistent. And so it was, it was hackers who came together and, and started to democratize and, and share the means of production if you want to use the sort of names. So what do you think we need to do in order to achieve the same, to accelerate and create this basic set of commons that open source has enabled?

Sal Kimmich:

Well, I mean, I, I mean at least this answer I don’t think is so much on a theoretical scale. I think this is something we’re working on on a corporate level right now. So it’s really interesting in the federal mandate that’s come down for cyber security, we have to put an SBOM, right? We have to put in software bill of materials, we’ve gotta get a discrete outline of exactly what was utilized. Now they put that out with the intent and thinking really just around cloud architectures. They didn’t really think about some of the more complex architectures that we’ve been looking at. Now, this has brought up a really fundamental question now that we have to be able to label the data sets in those SBOMs and state their true providence. And when we’re speaking to these big data sets that people have been pulling from these really problematic, centralized, closed off data sets, even when we’re able to get into those, to clarify the providence of those, they’re just literally trying to go through the top six data sets that corporations use right now and see if they’re valid, see if the images that they took were valid even to be used legally in the first place.

Sal Kimmich:

And they’re finding problems in every single data set. So it’s not even as if leaning on the corporate model has worked for us, we’re running into problems that’ll look a lot like open source just because of the size of the data that we need and the way that they’ve been scraped so far.

Stella Biderman:

This is huge cultural issue in machine learning. Historically speaking, machine learning researchers tend to think that if they go out and collect or reprocess or repackage data it’s its own thing and they can license it however they want and there is no providence before that. And not to put, you know, that’s false, that is just simply not true. But it is historically the way the overwhelming majority of ML researchers have conducted themselves. And now we’re in a really awkward spot where there are a lot of very widely used data that explicitly have falsified providence that are, are used by thousands or even more researchers all the time that have, like I said, falsified providence and no real ability to either prevent that from happening in the future or kind of undo that, like building the correct documentation of even a relatively modest by modern standpoint, status set is an exceptionally large amount of work, and it’s not something that really the organizations, the companies have the resources to do it really care about doing.

Stefano Maffulli:

Right? How do we fix that? Any thoughts?

Stella Biderman:

We could make it financially viable for companies to do this in the future, but the, the US government’s track record with actually loving penalties like that is basically nonexistent. So I wouldn’t hold out too much hope in that regards.

Sal Kimmich:

Yeah. But on the positive side, one of the people that’s doing some of the best work around this for NLP right now is on the call, so, right there’s two sides to the coin, but it really does take having that understanding of I think, I think we are perhaps that level of education needs to come in is, you know, yeah, we could teach it in schools, but right now the people that need to be aware of it are probably the legal entities of large corporations because this is a massive lawsuit that we are all aware could be coming down the line. If you’re using two or three of these data sets, which several of these corporations are using the providence is known to be invalid. And so they would have to then resource, and this is, this is where it gets interesting.

Sal Kimmich:

This is where open source does come back into it because in order to effectively resource the data sets and the computation for this, it does make sense to share both that compute and the outcomes and to efficiently store those outcomes in a way that’s feature engineered so that people can query from it and pull what they need to. But so that we’re not storing unnecessary data if we don’t need to. And that again, is a bit of its own question. I think there’s computational answers to what that is, but not everyone agrees with me as to what should be kept in a data set. So yeah, lots of, so much work to be done.

Stella Biderman:

I feel like I have to make at this point, which is thatI was one of the people who created one of the largest currently widely used data sets for training these models, most of the data, in it is above board, some of it is not. We were trying to put together approximately 1.5 terabytes of text. We basically went with the standard of it’s okay to use something if it’s widely used in machine learning already, because we’re a bunch of people off, like in a discord channel hacking away and trying to train our own AI. Obviously people aren’t gonna change their decision making based on, on what we choose and that we were okay with doing, you know, if it’s already widely used by people who have a lot more resources and companies, then fine. And, but anything new that we’re gonna pull in needs to, needs to actually have a real license on it. Of course, that’s paper that we put out describing that data set is now my most commonly cited paper and is widely used across the world. So I feel kind of bad about that for sure.

David Kanter:

I mean, one of the things I would say, and I have a lot, I’ve spent a lot of time dealing with licensing and data sets because of ML perf. And we have been extremely, in the two data sets we built, we were extremely particular about making sure that they had licenses that were compatible with the intended use. And so to step back our intended use of our speech data sets, one is keyword spotting. One is for full on automatic speech recognition. We wanted to support both commercial and research applications. And so we were only using data that was CC BY or friendlier, essentially. I think we actually, in one of the papers, we may have had CC BY share alike, but that has its own problems just from the commercial standpoint potentially. One of the things that I’ve found surprising, and you know, I think that a lot of folks are gonna run headlong into is that, you know, when you’re doing research, you can kind of get away with whatever you want.

David Kanter:

Like ImageNet, for example, is a super classic machine learning data set. The licensing around that is to put it politely, a quagmire. Some of our benchmarks are built around ImageNet. I would love to fix that in the near term, but like, one of the challenges is just everyone historically uses that. And the other thing that is complicated is there’s not a uniform, any sort of real uniform agreement on commercial use. Right? And so to some extent, one of these aspects, so first of all, you know, for all of us technologists in the room, right? One of the bizarre characteristics of the legal system is that, you know, many things are not considered settled until they have been fully litigated in the courts. And that’s like, nor, you know, if you think about that as a programmer, you’re like, litigating things in the court should be an exception handling process.

David Kanter:

Not the like inner loop. It’s kind of weird, but, you know, it is what it is. And then the other thing is like, even things that seem very unambiguous, like, oh, commercial use is granted, some legal departments will be more conservative and they might say, for instance, hey, you have a license that allows any use. Now, in David’s definition, any use includes training and ML model on that data. But some people might say, and as a matter of policy, it might be a good thing. I don’t wanna weigh in on the policy aspects, but you could take the interpretation well, that license was granted before, you know, people knew about AI, so they didn’t really, when they said all, they didn’t really know what they were. Right. And, and so there’s a lot of issues around licensing and making sure that things can be clean. And I think this is one of the areas where there’s a lot of opportunity to make the ML space lower friction.

Stefano Maffulli:

Yeah, for sure. Yeah. AleK, I I just wanted to hear from you about your, cause you, you studied this space a lot.

Alek Tarkowski:

Yeah. Oh, I’ve been looking at a very specific case because it’s interested thing from the perspective of content licensing with Creative Commons licenses and that’s the case of the user photographs for face recognition training data sets that do that. And this goes back by now a decade. It’s, for me a fascinating case because in 2014 when the YFCC100M dataset was built with 100 million images taken mainly from Flicker, maybe with media commons, it was really, it seems around the quarter of openly licensed photographs. It’s really huge. I know today these numbers are not so huge and the data sets are much bigger and by the way, I don’t think has a lot of openly licensed content. But, so that was the big case and basically exploring these issues and great work was largely done by Adam Harvey.

Alek Tarkowski:

He’s a Berlin based American sort of activist researcher and artist who, for instance, created MegaFace a search engine where you can search whether you’re in, in the dataset, which currently seems to be becoming a, which I find very interesting a tool that’s being also running online on by Andy Baio. And apparently other people are doing it, which makes me a bit hopeful. And, sorry, I’m jumping around a bit the topic, but when you ask what are the solutions, maybe these little steps, they of course don’t solve everything. But if I see that in a few years, something shifts from being an art project, basically, you know, a critical art project to something that starts to feel like maybe a standard, that’s a good step. But basically these cases show that all over the place just the license compliance, as you say, the quagmire.

Alek Tarkowski:

Maybe that’s the easiest way to describe it, which is really confusing. You have really big research projects usually you have companies involved in this research, which take a very less as fair approach to how they understand the license. Admittedly, there will be people who will immediately tell you, in the end, it’s not even clear why they wanted to use openly license content because it’s very probable that especially in the US this is all fair use. And indeed, especially when your data set is actually not the photographs, for instance, the case we’re studying with basically a list of URLs. You know, so from a purely copyright point of view, there might not be an issue. I think it’s not a purely copyright point of issue, by the way, and I hope we can apply some of these broader issues we discuss in the previous session to, to frame this issue, not just as am I being compliant with copyright law, I think that’s too narrow. What I would see as an approach. I would like to have one data set, and I think it’s slowly happening that will admit that in the past, this has not been state of the art and define a really high standard and run that data set in is governance against a high standard. Because I think this needs to be self-regulated. There will be probably policy debates on dataset on like legislative. I don’t think it should go that way. So yeah, my 2 cents.

David Kanter:

I was actually, can I inject a like more poignant example that will tie together like some interesting AI questions as well as sort of licensing conundra? I think conundra should be the plural of conundrum, by the way. Conundrum sounds okay, but conundra sounds cooler to me. So I mean, everyone here has probably heard about copilot, right by Github, which was done by ingesting a very large amount of code. Now I will, and some, a lot of that code has interesting licenses, varied licenses. And so it’s not really clear what the output actually is in terms of licensing. But one of the things that I would say anecdotally and Stella and Sal, you should, you know, please correct me if I’m wrong, but it’s not fully understood the extent to which you know, deep neural networks, there is both a memorization function they perform where they can emit some of the input that they received in some scenarios.

David Kanter:

And then there is also a transformative aspect, and you more commonly see the transformative expression, right? It’s relatively rare, but it can happen and right. And I believe there are instances where, you know, you have things that are just emitting memorized inputs. And so then the interesting question is, suppose you have a GPL input and you have the potential of emitting a copy of that, right? Like how does that work? Like sort of my intuition is, well, if you’re emitting what was previously GPL code, it definitely is GPL on the output, right?

Stefano Maffulli:

It’s a huge legal conversation. I don’t think that, the jury is still out from, from what I understand. And, and there is definitely lots of thinking that is going on from the legal that we have a bunch of legal experts coming up in the next few days. And that definitely, this is one of the questions for them for sure. But it is an interesting an interesting thing. I mean, it’s the same if you, if you remove the code from the picture and you start talking about art, you know, so many artists are so confused by now because they look at, you know, graphic designs and stuff that are produced that look exactly like if they made them. So what’s, what’s happening there? Well, how do we deal with it?

Sal Kimmich:

I mean, it does, it does come down. So again, I’m just constricting this to the US core precedence, but it does come down to the change in the look and feel that sort of defines an intent of change. So you can take websites that are literally, you know, like a e-scooter website and turn it into an e-bike website with a different color, and that is legally sufficient because the argument that was used as the precedent behind that in the US which I think is ridiculous, is the idea that you can have an entirely new song by doing like a Weird Al Yankovich cover, right? By taking the exact same style, the exact same intonation, and replacing quite literally in this case, different text snippets, right? In order to produce something which is a new consumable object. Now that’s the way that we would be thinking about this if we’re thinking that it’s taking larger snippets of code. But I think it’s actually a little bit more interesting, and I think this is a nuance that we’re missing with copilot, is that they’re genuinely short snippets. And so that begins to look and feel a lot more like, at least for me, like what is the minimally viable unit of intellectual property on the web? Is it that one line of code that I wrote is that that four line code?

Stefano Maffulli:

And it’s not an easy answer. I mean, there is, from what I hear from lawyers, and there is no, there is no easy answer to that. Cause it depends on case by case scenarios and depends on how good your lawyers are in defending that position in court too.

Alek Tarkowski:

But I think the interesting thing is that it’s sort of, it is a legal issue and also isn’t because when you look at it, it also shows that for now at least law is not working. I mean, the amount of basically clearly copyrighted images of Mickey Mouse, you can find on Lexica, you know, produced by stable diffusion under hypothetically a CC-0, you know, it’s not license, but the CC-0 sort of tool or mechanism is, is just staggering. Okay? It doesn’t go into millions, but it’s quite big. And and what does it tell us? For me, it’s about enforcement. And I know lawyers will be interested in asking, So how do we get that enforced? Is it enforceable? I think when we have a broader conversation about business or society there are also interesting questions. Does it need to be enforced? Stella, you shared that search engine.

Alek Tarkowski:

It’s done by artists, Mat Dryhurst and Holly Herndon who say it’s a post copyright project. Yes, they want to enforce some form of protection of their basic creativity, but they’re not interested in repeating the copyright debates, which I find very interesting. And maybe also suggest that vice versa, when we look at tools like open source licenses or open content licenses that build information commons I think there are lessons to be learned from the last 20 years, but maybe also good moment to ask, do we really want to repeat you know, exactly the same moves. I find that certainly sort of fresh and exciting that people openly share our values are the same. It’s about some balance between sharing but also protection of what I hold dear, but want to do it differently.

Stefano Maffulli:

Yeah. And as long as we’re talking about copyright, you’re basically pulling me into this are we, I started to wonder whether this is the right moment in history to think about some something else, like to ditch copyright. And especially when we think about open source and imagine something new, and let me be a little bit more clear. So if you go back into the, the sixties, the seventies and the early eighties, software started to emerge and it was a, there was a conscious decision policy decision at that point from hardware manufacturerIBM to separate, to unbundle the software from the hardware because they had this fear of being sued for creating a trust or having a monopoly. And so this entangled the two pieces of a computer system that they were used to sell.

Stefano Maffulli:

And they made a decision, they made a call and they said, well, we’re gonna use copyright. And it wasn’t until the eighties that the courts actually said yes, okay, believes copyright for source code and binary code. So today, you know, we talked about it, there is a lot, there are new things, there are new artifacts like these data sets that create a model. And we are deciding what seems like everybody’s thinking in terms of copyright. But is that the right thing to think of? Is that the right framework?

David Kanter:

Can I actually ask a very basic question? Is it a matter of settled case law what a trained model versus an untrained model is? Like what legal rubric it falls under?

Stella Biderman:

By an untrained model, do you mean like randomly initialized weights? So usually the way these models are trained is that you define an architecture and then you fill in a billion numbers with random numbers sample from zero to one or something close to that, and then you train it, and those numbers change from random values to non random values,

David Kanter:

Right? Precisely. And so you have sort of like this notion of model architecture, right? And then to extend this a little bit more, right? So there’s this notion of you start with random stuff and then you train it, but there may be multiple steps in training. So for example, and maybe using different data sets, right? The classic thing is, you know Stella, your work was right on large language models, which you might train up to a certain point and then someone downstream could fine tune using potentially, right? But I mean, so is copyright in fact – you know, maybe the answer is there isn’t a legal rubric for a trained model at the current point, right? Because you can think of it in some senses code and in some senses data.

Stefano Maffulli:

Right, Exactly. And it’s the output of a machine, from what I understand and traditionally the output of a machine is non-copyrightable, at least in the United States. Again, I’m not a lawyer, but this is what they told me. And so we, we, we we’re somewhat assuming like, eh, let’s slap a license that is deeply rooted on copyright and assume that that’s the right thing to do, but maybe we need to invent something new. Maybe we have an opportunity. What do you guys think?

Stella Biderman:

So the US patent trademark office actually very recently granted a copyright over a AI generated image to, there are actually two cases of this that I think are really illustrative of what current legal standards are. So in both cases, someone took an AI that takes texts and input and it generates images and output. And one person submitted a patent, a sorry, a copyright application regarding their ownership of the image that came out of the AI and that was granted.

Stefano Maffulli:

Yeah, I discussed this with one of our lawyers and we probably have to write an article about this because it’s fascinating. It’s not clear whether the US PTO knew about, or the US copyright office knew that the creation was actually generated by an AI.

Stella Biderman:

The application explicitly stated that I’ve read the application so that may now have been taken into account in their decision making properly or so, certainly disclosed. And then the other example I wanted to bring up is that someone else submitted basically the same application, but they wanted to have the AI own the copyright.

Stefano Maffulli:

Yeah, that’s, I mean, there are comments also on the – that seems like it’s absolutely – there are cases where the copyrightoffice did not accept the registration because of that. But without going too much into the weeds of the legal conversation because we don’t have lawyer, I mean Alek are you a lawyer? No, you’re a sociologist, but we’ll talk about the legal details more in depth in the future. But what is interesting to me is that at one point the European Commission created new rights. They have created a right to data mining. They have created an ad-hoc right to database structures. So, you know, is it completely out of the picture to invent something that is more useful, specific for AI to create that commons that has powered open source and open culture, open knowledge, open science, you know, all of the opens that we have. They come out of a hack. Like, let’s think about that. Like the hacker community in the late seventieseighties, they hacked copyright and created copyleft, and they established these concepts. They established the policies, they established norms, social contracts that ended up creating this wider array of open knowledge that we call open source. We call it creative commons. We call it many different ways, but they all have the same route, their hacks on top of copyright. Do we have an opportunity here that we’re missing?

Sal Kimmich:

I mean, I think, I mean, I like to be a little bit hopeful and also a little bit pragmatic. I think that the spirit of open source, where it’s come from was built up. Like you literally just have a generation of engineers who, because of the computation that was available to them, have almost exclusively worked on static architectures. The ethics around static architectures are do no harm. Machine learning and artificial intelligence is a different sport and has much different social implications. I think that we should leave this out of the question of sort of like case by case what feels good to us as a society? If you wanna categorically see if there’s a way to regulate this, we have to start by actually putting the telemetry in place. So right now, I think what has really helped is this massive constraint from the US government of making every single federal deliverable with AI/ML within it to be able to the first provide a clear demonstrable machine readable, and again, 2022.

Sal Kimmich:

And for the first time we’re actually demanding that we know what the data is, what the providence was, what scripts you used, what their providence was, and then what the actual date is that you produce a final outcome from, from this deliverable. So you’ve got an encapsulation of what was produced with that. Then we’ve got a really, really nice taxonomy that we can see. Are there perhaps specific types of pipelines that we do not want to be disclosing the data from? Maybe, but that would have to be categorical for me, right? That wouldn’t have to be a case by case.

Stella Biderman:

And you’ve mentioned this before, is there some somewhere I can learn about this new policy approach Sal?

Sal Kimmich:

Yeah, so really recommend you checking out Datatology. That’s a group that’s working on this right now. I’ll put it right into the slack cuz we need more people involved.

David Kanter:

And I would just say like I, I actually, you know, the notion, one of the first principles that we built into MLPerf was this notion that we wanted things to be reproducible, right? And, that requires probably many of the same properties that Sal has outlined. Like, I think to a large extent that is like a very good practice you know, of being able to identify how did we pre-process the data, right? Where did it come from? And just, you know, enabling things to be reproduced by third parties is a critical aspect of trust, right?

Stefano Maffulli:

Yeah. Yeah.

David Kanter:

Do we have, by the way, do we have a timetable for, do we have to shift gears between your high level three questions sale at all?

Stefano Maffulli:

No, no, we definitely –

Alek Tarkowski:

I think we’ve been shifting the gears. Yeah, we’re already in high gear.

Stefano Maffulli:

Right? We are in high gear. I wanted to say. Yeah, we, we have another half an hour and I think, we touched on a bunch of the topics, but the one thing that maybe we should go back and to think a little bit more on about the challenges and opportunities that corporations and businesses have to create this collaboration. What do you hear, David, that from your members maybe that’s a conversation that’s a starting point. What do they want? What do they hope for?

David Kanter:

Okay, so you know, I should be clear, you know, with respect to my organization, you know, our goal is making machine learning better, and to some extent that means faster through things like MLPerf driving up quality and accuracy through other benchmarks and enabling adoption of ML. Like I sort of, you know, for shorthand, think of like, our goal is to grow the pie of ML and extend the benefits to more people. I mean, one of the things that was, you know, interesting to me, and I think we had talked about this a little bit before is a lot of the sort of interface with regulatory side and ethical side is not the high order bit for my organization. And that, like, I think there are a lot, like, we’re very engineering focused, right? We want to build things.

David Kanter:

And so like to me, like I think some of the conversation around ethics and responsibility is great to the extent that we can for example, use that to inform a test. Like if we said, let me give you a hypothetical example. So say we decide as a society, and like there’s one issue, which is we are not one society, right? There’s many governments. There’s not necessarily shared values. There’s also something I have to grapple with. But let’s just say we decide that we would like ML algorithms to be equally accurate for men and for women. And I’m gonna ignore everything else on the spectrum to make this just a really contrived example. That’s something that you can actually probably reduce to practice with a bunch of tests and like help to measure, right? And I think that’s something that’s very important. But, you know and I think it’s important for technology to, to play a role in some of these discussions and to advise so that you don’t end up with policy that is bonkers, for lack of a better term.

David Kanter:

But, you know, one of the things that I find that is a little bit challenging is I don’t think there is uniform agreement on overall directions. I mean, as an example, Alek, you had mentioned data sets for facial recognition before. I, within my constituency, my member companies, I don’t know if there is any sort of consensus on is that a data set that we would wanna produce. And in fact, my instinct is to say, I don’t wanna produce a data set like that because it’s too gray of an area for me to encompass. Like, I think when you look at and there’s both blessings and curses to that, right? Some of the folks who are in say, automotive are vastly more thorough in how they inspect ML before it is ever deployed compared to things like you know, advertising search, which are essentially largely unregulated.

Stefano Maffulli:

Yeah. Yeah. So you’re basically seeing it that what we’re summarizing what I heard is that there are very strong norms inside your community and your constituents about what is acceptable behavior. What is, you know, the constraints about the responsibility or the feeling of responsibility?

David Kanter:

I would actually phrase it differently. I would say that I think I look at this, you know, as I want to focus on things where there’s a clear intersection for all or almost all of my members, like our sort of byword is grudging consensus, which is not that everyone agrees, but everyone walks out of the room and no one is crying, right? Everyone’s, at least most people are happy and a few people might be grumbling. And so part of what that means is where you don’t have consensus like that is oftentimes an area that I think it makes sense to shy away from. And a lot of the sort of regulatory aspects like this is certainly something where I think we can add a lot of value because we’ve dealt with many of these issues and have a great deal of understanding. But I don’t think it is my role to dictate policy to my members.

Stefano Maffulli:

Yeah. And Astor I mean, now we are seeing a little bit of movement, also the progress and fastingrapidly advancing AI act in Europe, thinking about regulating AI and its use and how it’s produced and Alek too. What are your thoughts on this? What do you think the businesses in Europe should be looking forward to?

Astor Nummelin Carlberg:

I mean, I think it’s an interesting point that David is making there. It’s a question then. So the kind of approach, and this has of course been welcomed by some and been criticized by others, but it’s this division of you knowdifferent rankings of risk applications and that is the approach that the EU took in the AI Act. And in some ways isn’t that, you know, linking that to what David said, it is taking, at least in the European context, some of that responsibility away from a person like David to, to not dictate policy, but in fact, in these high risk applications, the regulator would step in.

Sal Kimmich:

Yeah. So we learned this lesson in cybersecurity already, which is why I would like you to tie these policy decisions to the telemetry that we’re putting in place for cybersecurity, right? It’s exactly the same model. You have known and unknown risks, and you can define those. Make sure that there is a scrapable database available that lets you know if something has become, for example, one of my concerns going into the future is that not all of the databases that you’ll be pulling from are necessarily static databases, right? The data itself may change and drift over time, and you need to know whether or not that’s still trustable as a signal that you need for your algorithm. Yeah. When we’re stepping into that space, I think we need to think about it a little bit differently. But I guess there’s two issues here.

Sal Kimmich:

I still think fundamentally here, and this is unfortunate cause I want the answer, we still haven’t answered the question about do we need a new policy for artificial intelligence? What I would say on the other side is I don’t think we need new licenses for static pipelines for databases. We just need them to be explicitly tied to the correct objects in the code. And as soon as we get that, you have the same ability to dissect and ascertain what level of risk you have. In the same way we’re doing in a regulated way for cybersecurity.

David Kanter:

And actually, I just wanna point out, there’s a, there’s a great point that Sal made that I want to amplify cuz it’s something I’ve experienced personally, which is that data sets do mutate over time. And in particular many image data sets for policy reasons, right? There is a right to remove things, right? And you can say, you know, I bought a new house, My house was in the database before, I’d like it removed, which seems like a very reasonable thing. Well now you’ve changed the data set. And one of the things that and that has to be countenance, which is there’s a cost to that, which is it makes research and a lot of other things very difficult. And so as sort of one of the things that I’ve looked at some trepidation with is, you know, GDPR essentially makes almost every dataset that it contains personal data conditional, right?

David Kanter:

So every single dataset that contains anything that might conceivably be personal information or can be used to derive personal information is not a static dataset. And I just cannot emphasize that enough and I think one of the aspects here that is potentially challenging is that and again, not saying it shouldn’t be done, but this is where the conversation is helpful, is that as there seems to be more regulation, like my intuitions says regulators will want to be able to pull more control on behalf of their citizens and then behalf of their policy objectives, which will make things even less stable.

Alek Tarkowski:

So admittedly, at least in Europe, data sets are not regulated as far as I understand, right? It’s systems and especially deployed system with the whole language around users or basically organizational users of systems. I was just today at the AI the round table on the AI Act, which was interesting, there was both business there and the civil society. And it’s interesting, they shown a diagram, someone drew of sort of decision tree around the compliance issues if the AI Act is introduced, and even the representatives of sort of digital rights said that, that doesn’t sound very realistic. And I think that’s the key question, realistic enforcement, right? What kind of scenario that will be? I’m not from the industry, so obviously when I have some problems, some of the language to me feels too simplistic about just protecting market actors by making their life easier, right?

Alek Tarkowski:

So for instance, there’s a very strong narrative that you need to help small and medium size enterprises, which in principle is good. But, you know, I looked and Clearview AI has around 30 employees. So if we take that logic, we just stuff them into that category and finish the conversation. So I think it requires a bit more balancing. Similarly, obviously, you’ve probably seen it, I think it’s relatively, I know the issue has been there, but the issue of regulating open source general purpose AI that has emerged I think only recently and will be a big conversation.

David Kanter:

I’m sorry, what do you, you mean by General Purpose AI?

Alek Tarkowski:

Well, there are definitions thrown around that, don’t ask me to quote them cause around three of them, one coming from Sloven presidency at the end of 2021, one from the French in May, and now there’s a Czech version. So I, I can provide you with the details, but –

David Kanter:

Yeah, yeah, yeah.

Stefano Maffulli:

Sounds to me like the AGI that you referred to before.

Alek Tarkowski:

Populations leave a lot of space for positive policies, right? That don’t just aim to build guardrails, but really ask questions about, okay we have some values, we have some ethics, and we also have the positive vision of generative technologies. And I think that’s very hard to do. I would like to see a good approach to data sets. I think in the public interest there is a role for public actors, and I liked Sal, what you talked about, that’s exactly the kind of impact on the ecosystem I would like to see. But this conversation is not happening as far as I know, for instance, in Europe. And a lot of people believe that’s good because they’re a bit scared that governments can be, you know, heavy handed.

Sal Kimmich:

Well, I kind of, I’d love to bring this back just real quick and then we can jump back in. But why don’t we take this back to aviation because that is one of those places where like we are still in a space of innovation and I’m very careful about where I walk around on earth because as a pilot, I know that there are three or four different air spaces that I can be underneath it. In any given time. It’s class A, B, and C, Class A. That’s where you’re getting all of this regulated space. They’re like, that is big airplanes, everything’s very controlled. Class C, they don’t care if you strapped yourself to a drone. As long as there’s nothing underneath you, which is what those spaces are zoned for, you’re in a low risk to do external harm. So I think it’s important here to not put into place regulation, which removes the central agency of a developer to still look and perform what is a fundamentally creative practice, right? That would be like architects and they’re not allowed to. I think if, if we are looking at regulating this seriously, again, steal from what we’ve learned from cybersecurity, focus on what is most mission critical to global work, and then triage from there, the lessons that you learn in that mission critical space are useful. But ultimately, if somebody wants to go and run something bizarre on their own computers, on their own time and pay for their own, for their own carbon cost to do that, I don’t wanna regulate that space.

David Kanter:

I think one of the things things that is great that actually both of you were saying andI want to just tease this out and make it explicit and cuz it’s a value that at ML Commons, we feel strongly about within MLPerf, which is, you know, if you put rules into place, whether it’s rules for a benchmark or laws or policies, like, I think it’s actually very important that they actually be enforceable, right? And I think that is a very good thing to keep in mind with – and I think that’s actually one way in which like, AI is maybe a little bit different because of the, like lack of explainability. You know, if you, for example, just, just very simple example in the United States, it is illegal to discriminate on many different bases for financial decisions like issuing mortgages, right?

David Kanter:

And I think one of the concerns about, you know, getting not big AI models or ML models, but even like small ones in the loop is like, if you can’t actually explain what’s going on, then you may have like inadvertently violated the law right? Which is obviously a bad situation, but also, you know, it can make the regulatory – like how, you know, how would you prove that there was, that there was bias going on, right? And like, there’s sort of a question of like kind of who’s – must you prove that there is an absence of bias or, you know, can it be taken on good faith? And I, and I think to some extent to me, sort of the, the, the European AI Act is saying, you know, we’re willing to shift that burden of proof, that burden of responsibility based on like how important we think that application is.

Stefano Maffulli:

Are we witnessing, I mean, are these concern, these issues related only to the fact that AI is fairly new and being deployed relatively with rudimentary tools?

Stella Biderman:

What is the word rudimentary doing in the sense why?

Stefano Maffulli:

Going back to the fire thing, you know, we started putting fires inside houses and we didn’t know that carbon monoxide intoxication was, was a thing or we didn’t fireproof everything else, but we, we still tried it. And you know, sometimes I get that feeling and I don’t come from software, I come from architecture like buildings and stuff. And for, I remember the first time I started looking at how software is deployed, there are no, when you do – when you project – you know, plan a bridge and you start building a bridge, there is lots of standards. There are lots of specifications and there’s not the same level of control with software. And I think Sal, with the analogy with the fly it’s similar, it’s something similar. I get this feeling that we are giving – someone has given to a bank, this software that automatically decides, removes the humans from the picture and decides whether you are worthy of a mortgage without really having the frameworks around it to explain the decision to prove that you’re actually taking the right steps that you are protecting you know, the people who are applying for the, the society applying for a mortgage. That’s what I mean by rudimentary.

Astor Nummelin Carlberg:

But I also think that many policy makers share that instinct as well. I mean if you’re looking at what is happening in Brussels and in many European member states, it’s essentially moving from a wait and see approach of the last 20 years or 30 years of digital regulation. And that the general feeling is we didn’t like where that took us and we need to start acting before the fact and take the risk of perhaps hindering certain developments. And this is a concern of ours because we look at these regulations among, you know, one perspective that we look at regulation through is how would it affect, let’s say, open source ecosystems and let’s say an individual open source developer to participate in collaborative innovation. But now the view it’s, you know, it’s going into an ex-ante approach, broadly speaking, act early so that we don’t find ourselves in the same situation we found it ourselves in, for example, privacy online.

Stefano Maffulli:

And I don’t know what what’s going on here. So someone has joined, Yeah and I wish I could click on this.

Astor Nummelin Carlberg:

Hello?

David Kanter:

Hello Amahd.

Stefano Maffulli:

Oh God. Okay, so why don’t we go back to the, well, one other question I had for Alek actually, because you were, you were saying you had that meeting yesterday and or this morning. But what are businesses, what is reactions from corporations when they say that they have this this feeling when, where they see this regulation coming ex-ante?

Alek Tarkowski:

Mm.

Stefano Maffulli:

If you can share,

Alek Tarkowski:

It’s, it’s interesting because I think the high level view is interesting. Basically, I think the take is very different from corporations, international ones, and either small and medium businesses or their representatives, right? And obviously, and again, my feeling is that I follow on the European policy debate, but I think is the same everywhere, usually in general businesses against regulation, right? That’s, they’re sort of first reaction. But I think to be fair it’s not just about that. I think there’s a lot of criticism of the approach taken by the European Commission, of course is very hard to simplify it, right? But I think this general sense that this focus on risk prevention solely on that is simply maybe not the best policy choice. But I, I feel right now, and this is where I start to have a problem, I think just all the businesses trying to build a carve out basically, right?

Alek Tarkowski:

And I think it would be good to have a conversation why that is needed, right? And for instance, even on open source, which we are discussing today, I think in that conversation I would appreciate, I saw the papers, which I think make a fair point that we need to look very closely that there are now these open source approaches, right? To this so-called, I know, David, you have a problem with the term General Purpose. AI.I can explain a bit, maybe I should have done it instead of joking that it’s complicated. So basically when they say general purpose, the way this regulation is structured, they’re really interested in application of specific high risk cases, right? So for instance, high risk is facial recognition. And they imagine a system that’s just a facial recognition system that’s provided, built by some company as a system and then deployed, let’s say by a city, right?

Alek Tarkowski:

And they can sort of imagine what’s happening there. And then they suddenly realize that these, what they call general purpose are systems, like large models that can be used in multiple ways. You can use it for instance, maybe it has a military use, maybe it has a security use and maybe it has some kind of a publicly beneficial use, I dunno in let’s say education or in farming. Maybe that’s an easier case or agriculture, right? And this is where they introduce this term that they somehow need to deal with systems which they think basically can have both positive and negative uses. But the point I’m trying to make is I think this will, thinking about the deep dive, I think this will be a really important conversation to be able to explain whats specific about these open source approaches that they you know, warranted different response to responsibility, basically. Because my thought immediately is that, and coming back to your question about what’s new in this approach as opposed to open sourceI need to admit it, is from the AI research community, this impulse that you have new licenses that basically put responsibility and responsible use so much in the limelight. So I think this conversation about responsibility and open source on one hand in the licensing space on the other, in parallel in the policy making space is simply super interesting.

Sal Kimmich:

Okay. But they, there’s, I mean these ethics fault, like the ethics statement on the license has no teeth in my experience. So I’ll give you, I’ll give you one example from my real life. I cannot talk about a lot of this stuff from DC but this one I can, cuz it was precontract. So they were asking a bunch of consulting firms to come together and test the IMDB data set to try to find what was most predictive about giving us some least a few parameters of what they thought would be most interesting after which step from the open data set, you would then be selected to go on and work with their real data sets under security clearance, right? So you had a pretty good idea that this was gonna be used in a high stakes environment. The requirement for this first stage was that everybody used the open data set and made their outcomes openly available, right?

Sal Kimmich:

So this is just a GitHub page that goes out there. Now, I have just created something of which the most predictive element I find in the entire IMDB is that by averaging and, well, it had to be both efficient and predictive. So just by averaging the color of the posters, you’ve got the highest likelihood of knowing what type of movie it’s gonna be, drama, thriller, et cetera. Now, yes, those were movie posters, but in that case, in my license, I needed to make an explicitly an explicit statement that even if not used by the individuals I was tending to quote unquote handed off to this open GitHub repository should not be used over human data, right? So these are the kind of circumstances that we’re seeing ourselves in all the time of this sort of neg it’s literally legally called negative externalities. The consequences not intended for the original use. Those are still really under, I think that’s sort of intent is really under explored.

David Kanter:

I want to echo Sal’s point. Like I think enforceability of licenses, even in the open source context is not super strong. And so adding more heavy lifting makes it very hard. And also, I mean, the other thing I would point out is that, you know, I studied as well as math, also economics at the University of Chicago. And so, you know, one of the things you will see is like business structures and, and organizations that, you know, may allow for regulatory arbitrage, right? Like Uber in its early days, just, you know, as a, a very simple example, but you know, if you start, like, it’s very hard to see how things are are used if they’re entirely internal, right? So suppose I, and yeah, I mean there’s just, there’s a lot of challenges in this space.

Stefano Maffulli:

Yeah. Yeah, I agree.

Stella Biderman:

Mm-Hmm.

Stefano Maffulli:

Indeed. Indeed. Okay. So we are pretty much at the top of the hour, and I’d like to close the panel thinking about some very quick thoughts about a bright future. One where we have, and AI systems are mature, we understand that perfectly, where we have all the tools, we have all the understanding that we need. So what do you think would’ve happened for that to have to be, to materialize? Like that’s super understanding. What, what is gonna be the one thing that you think will bring us there?

Alek Tarkowski:

It’s actually quite maybe simple in comparison to what you’re doing, as I said, and like information commons person. And by the way, I really appreciate, you know, Mike Linksvayer here and I really like his piece that frames this conversation that he connects open source or sort of code debate and the information comments or like content debate. I think this is really quite new. And so what I would like to happen, I think all the open content, sort of a lot, the feeling is that suddenly these AI users came and they surprised everyone and our users are confused that they wanted, maybe they didn’t want it, do they wanna opt out there? There’s some kind of a general confusion and mess. And if we can start that out and basically come up with good governance of data sets, that would be for me really cool and hopefully not that complicated in the long run.

Sal Kimmich:

Yeah. Well, I mean, my argument in this space and what I will be happy to help with for the next few years is making sure, one, that there’s a general understanding that ML and AI is not static architecture, meaning it’s a highly composable architecture, meaning the licensing that has to be aligned with that has to itself be highly composable. This is why I’m not saying we need new licenses, we need new ways to attach those licenses to the relevant sub objects within those, but I think we’re getting there.

Stefano Maffulli:

Yeah. Stella, what do you wish for?

Stella Biderman:

A lot of things –

Stefano Maffulli:

Hard work?

Stella Biderman:

So I definitely agree with, with what Alex said. I think that that would be, that would be phenomenal. So to say something a little different, I think that we’ve talked a little bit about like explainability and machine learning explainability one, almost all of the explainability research that is occurring right now and has occurred in the past five, ten years has kind of been through the lens of you have a object, you have a machine learning algorithm that’s been trained. We want to explain the decisions it makes in terms of like which inputs go in and which inputs go out. And this is actually leaving out, I think, a really crucial thing, which is the influence of the training data and how the models behavior and capabilities evolved over time. And this is just something that like almost nobody studies, and I think that making, making more efforts on studying this is going to be really important and essential because you can only have a very limited view of what a model is doing if your story about how it behaves is entirely independent of the training data.

Stefano Maffulli:

Yeah. Yeah. Makes all the sense.

Astor Nummelin Carlberg:

And I would build on Alek’s point and just see that in addition to untangling some of those confusions and having good governance in place, that we also end up in a situation where this new technology doesn’t look too similar to the introductions of technologies of the last two decades where it would just create a lot of concentrated power and resources in certain organizations. But it iswe find a way that this governance system also engages more and really creates value for a broader set of humanity.

Stefano Maffulli:

David?

David Kanter:

All right. Yeah. I was slightly at a loss for what to say initially. I mean, I think actually everything everyone said was, was really good. I mean, one of the things that I sort of think about and very strongly believe in at MLCommons is, you know, sort of the ability to create open data, open metrics and all of these things to help democratize AI. Again, that’s, that’s the goal of my mission. And I think it speaks to part of what you were saying Astor, right? Which is you know, fundamentally how do we take you know, something that people today see as magic and make it ordinary magic that suffuses our daily lives and in a way that doesn’t violate the expectations of individuals, right? That’s to some extent, Alek, what you’re saying, right? We don’t want this situation where there’s the famous equation of like, expectations minus reality equals happiness, right?

David Kanter:

And we want to keep people happy. But yeah and I think for a lot of that to happen, right we do need to forge considerable clarity on licensing, on interactions, on how these things should work, in part, because one of the things that I see as critically important and as a metaphor I like to use is a lot of ML and AI has been developed by digitally native entities, right? They have huge amounts of data. But one of the things that longer term I think is very important to do is extend those capabilities and that magic to maybe more classic analog centric entities, you know, or to take the the example I like to think, you know, there’s so much magic that is being worked on the internet, you know and not to pick on particular companies, right? Amazon’s an internet retailer. Like how can we bring some of that magic into the hands of a mom and pop shop and like, what does it take to get there? And there’s so much friction we need to eliminate in terms of making things easier to train, easier to deploy et cetera. So that’s my parting thought.

Stefano Maffulli:

Thank you. Thank you very much with that, thanks everyone we are at the top of the hour we made it. Thanks Astor, Alex, David, Sal and Stella. This has been phenomenal. Everyone we’ll talk on Thursday with a panel discussion focusing on society, and we will have speakers from the Electronic Frontier Foundation, Creative Commons, Hugging Face, and Louis Villa, who is a category on his own. So thanks everyone. See you Thursday, and

David Kanter:

Thank you, Stefano, for the excellent moderation and guiding.

Stefano Maffulli:

Thank you so much.

Alek Tarkowski:

Thank you. Bye bye. Bye everyone. Have a good day.

Stefano Maffulli:

Bye.

    Mentions

  • 💬 Betsy Waliszewski