If AI Learns from GPL Code, Is the Output Also GPL? — The Hardest Copyleft Question in the Age of Generative AI

A New Question in the Age of AI

In recent years, the development environment has changed noticeably. In the past, writing code was entirely the developer’s responsibility. While IDE autocomplete features and simple code generation tools existed, the structure and implementation of a program ultimately had to be created by humans. However, with the rise of generative AI, this assumption is rapidly changing. Tools like GitHub Copilot, Claude Code, and ChatGPT have gone beyond simple assistance and become an integral part of the coding process. Many developers now rely on AI not only for function implementation and test code, but even for entire module structures.

This shift clearly brings advantages in terms of productivity. With just a few lines of description, developers can generate dozens of lines of code, and when working with unfamiliar libraries or languages, AI can quickly provide useful examples. However, as code generation becomes easier, new questions begin to emerge. In the open source ecosystem, these questions become even more pronounced. AI models are trained on vast amounts of public repositories, which likely include not only permissive licenses such as MIT and BSD, but also strong Copyleft licenses like GPL.

At this point, a natural question arises.
If an AI has learned from GPL-licensed code, should the code it generates also be subject to GPL? This is not just a matter of legal curiosity. If the answer is yes, then a large amount of code generated by AI tools could be required to comply with GPL obligations. On the other hand, if the answer is no, the protective structure of Copyleft licenses themselves could begin to weaken.

This issue is already moving beyond theory into real-world debate. When GitHub Copilot was first introduced, some members of the community raised concerns that it had been trained on GPL-licensed code. The question of whether the licenses of training data affect AI-generated output is not merely theoretical—it has the potential to become a real legal dispute. Developers are now facing a new question: not just “where did this code come from,” but “what did the AI that generated this code learn from?”

This is the point where this discussion begins. Open source licensing systems were designed under the assumption that humans write code. But as AI becomes deeply involved in the code generation process, that assumption is starting to break down. GPL was created to protect the freedom to share and modify code. But if AI can learn from that code and generate new versions of it, how far should the rules of GPL extend?

To answer this question, it is necessary to first understand the structure of the GPL itself. Without recognizing that GPL is not just a simple open source license, it becomes difficult to fully grasp the debate around AI-generated code. For this reason, the next section will examine the core philosophy and structure of the GPL license.

The Core Principle of the GPL License: Copyleft

Open source licenses may appear similar on the surface, but they are built on fundamentally different philosophies. Licenses such as MIT and BSD are known for being relatively permissive. They place minimal restrictions on code reuse—you can include the code in software, modify it, and even incorporate it into commercial products. In most cases, the only requirement is to retain the original copyright notice.

However, GPL takes a completely different approach. It is not merely a license that allows source code to be shared—it is designed to enforce the continued freedom of the code. The key concept that explains this structure is Copyleft. Copyleft, as the name suggests, represents a reversal of traditional copyright. It ensures that once code is made freely available, that freedom must be preserved in all subsequent versions.

This concept emerged alongside the GNU project. When Richard Stallman initiated the free software movement, his primary goal was not simply to make code available, but to guarantee the long-term freedom of software. Simply releasing code is not enough—someone could take that code, modify it, and turn it into proprietary software. Copyleft was created precisely to prevent this scenario.

The basic rule of GPL is straightforward. If you distribute a program that includes GPL-licensed code, that program must also be released under the GPL. In other words, software built on GPL code cannot be closed again. This rule is not just a technical guideline—it is a legal obligation. For this reason, GPL is often described as a “viral license.”

Because of this structure, GPL occupies a unique position in the open source ecosystem. On one hand, it serves as a powerful tool for promoting the spread of free software. On the other hand, it can be seen as burdensome in corporate environments, since using GPL code may require making the resulting software publicly available. As a result, many companies prefer licenses like MIT or Apache and approach GPL usage with caution.

However, determining when GPL rules apply is not always straightforward. Simply building a program alongside GPL code does not automatically trigger GPL obligations. The key factor is whether the resulting code qualifies as a derivative work. The obligations of GPL operate precisely on this concept.

Therefore, to understand the debate around AI-generated code, it is essential to clearly grasp this idea. Whether AI-generated code is subject to GPL ultimately comes down to a single question: is the code produced by AI a derivative work of existing GPL-licensed code?

When GPL Applies: Derivative Works

The most important concept in determining whether the GPL license applies is that of a derivative work. This is not a rule unique to open source licensing, but a broader legal concept used throughout copyright law. In simple terms, it refers to a new work created based on an existing one. Translated books, rearranged music, or modified versions of existing programs are typical examples.

The same principle applies to software. If part of existing code is copied, modified, or used as the structural basis for a new program, it is highly likely to be considered a derivative work. This is exactly where the rules of GPL come into effect. If you distribute a derivative work based on GPL-licensed code, the result must also be released under the GPL.

However, there is an important boundary here. Not all similar code qualifies as a derivative work. Copyright law distinguishes between ideas and expression. Functional aspects such as what a program does or the algorithms it uses fall under the realm of ideas and are not protected by copyright. What is protected is the concrete expression—the code itself.

For example, imagine two developers independently implementing the same algorithm in entirely different environments. The resulting code may look very similar. However, if neither developer referenced the other’s code, it is likely to be considered independent creation rather than copying. Similarity alone does not establish copyright infringement.

This principle has played a crucial role in the software industry on multiple occasions. In particular, legal disputes involving interfaces or APIs have often centered on this distinction. There can be multiple ways to implement the same functionality, and simply achieving the same result does not constitute infringement. Ultimately, what matters legally is the flow of information—whether the original code was actually referenced.

This concept applies directly to the issue of AI-generated code. Whether such code is subject to GPL ultimately comes down to the same question: is the generated code based on the expression of existing GPL code, or is it a new expression that merely implements the same functionality? These two scenarios lead to entirely different legal outcomes.

At this point, the introduction of AI makes the existing legal framework more complex. With human developers, the origin of code can usually be traced with relative clarity. But AI models are trained on vast amounts of data and generate code probabilistically. So where should the “origin” of AI-generated code be considered? What kind of legal relationship exists between the code an AI has learned and the code it produces?

To answer these questions, it is necessary to first understand how AI models actually learn and generate code. In the next section, we will examine the structure of how large language models learn and produce code from a more technical perspective.

How AI Models Learn Code

In the previous section, we examined the concept of derivative works, which lies at the heart of the GPL debate. Whether code is subject to GPL ultimately depends on how much of the original code’s expression is reflected in the new code. When human developers write code, this relationship can be traced relatively clearly. It is often possible to determine where code was copied from or which projects were referenced through records and version history. However, when AI becomes involved in the code generation process, the situation becomes far more complex. AI does not operate by simply copying or modifying specific pieces of code.

Large language models (LLMs) typically function as probabilistic models at the level of tokens. Whether dealing with code or natural language, all input is broken down into small units called tokens, and the model generates output by calculating the probability of the next token. In this process, the model does not store specific code like a database. Instead, it learns patterns and statistical relationships from vast amounts of training data. For example, common structures of Python functions or typical coding styles associated with certain libraries are embedded within the model as probability distributions.

Because of this learning method, AI-generated output is often described as “reconstructed code.” The model does not remember a specific file, but generates new code from a compressed representation of countless code patterns. As a result, the same algorithms or very similar code structures may appear in its output. In particular, widely used open source libraries or common algorithm implementations are likely to appear multiple times in the training data, which increases the likelihood that generated code will strongly reflect those patterns.

At this point, an important debate emerges.
What does it mean, from a copyright perspective, when an AI model generates code by reflecting patterns from its training data? Human developers also learn from and reference multiple projects when writing code. Experienced developers, having internalized various styles and patterns, are more likely to produce similar code. This raises the question: is AI learning fundamentally different from human learning, or is it simply the same process at a much larger scale?

At the same time, however, AI learning differs from human memory in important ways. Human developers typically do not reproduce specific code verbatim unless they intentionally copy or reference it. In contrast, LLMs have been observed in some cases to reproduce portions of training data almost exactly. This phenomenon is referred to as the memorization problem. Code fragments that appear repeatedly in training datasets are more likely to be regenerated with high probability.

Because of this characteristic, the issue of AI-generated code is not just about whether AI “understands” code, but about what it is actually reproducing. If a model effectively reproduces the expression of a specific GPL-licensed codebase, it could be seen not as a mere implementation of an algorithm, but as a reuse of the original code’s expression. On the other hand, if the same algorithm is implemented in a completely different way, it may be considered an independent creation.

In this way, the learning process of AI models creates a gray area that is not clearly defined within existing copyright frameworks. The model does not directly copy code, yet it is also difficult to regard its output as entirely original creation. To understand the legal implications of AI-generated code, it is necessary to examine real-world controversies. One of the most representative cases that brought this issue into public debate was the GitHub Copilot controversy.

The Copilot Controversy: AI Code Generation and the GPL Issue

The debate over AI-generated code and open source licensing began in earnest with the introduction of GitHub Copilot in 2021. Copilot, built on an OpenAI model, was trained on vast amounts of code from GitHub repositories and provided developers with code autocomplete functionality. When it was first released, many developers were excited about the potential productivity gains. The experience of having entire functions or algorithm implementations suggested after typing just a few characters was fundamentally different from traditional development environments.

At the same time, concerns quickly began to emerge within the community. A significant portion of Copilot’s training data came from public GitHub repositories. These repositories include not only permissive licenses like MIT and Apache, but also strong Copyleft licenses such as GPL and AGPL. As this became widely known, developers naturally began to ask a critical question: if Copilot was trained on GPL-licensed code, shouldn’t the code it generates also be subject to GPL?

The debate did not remain limited to community discussion. Some developers and legal experts began presenting cases where Copilot-generated code closely resembled code from specific open source projects. In certain tests, Copilot was even shown to reproduce fragments of code almost verbatim. These cases strengthened concerns that AI code generation might not be purely pattern-based, but could involve the reproduction of training data.

Eventually, the controversy escalated into a class-action lawsuit. At the center of the legal dispute were questions about whether training AI models on open source code constitutes copyright infringement, and whether the generated output should inherit the original license. The plaintiffs argued that since Copilot was trained on GPL-licensed code, its outputs should also be bound by GPL obligations. On the other hand, GitHub and Microsoft maintained that the model does not copy or store specific code, but instead learns statistical patterns, and therefore does not violate copyright.

This debate expanded beyond a single product into a broader discussion about the future of software development in the age of AI. If AI-generated code is considered GPL simply because the model was trained on GPL code, then AI coding tools could become difficult to use in corporate environments. Conversely, if no such restrictions apply, the protective structure of Copyleft licenses could be significantly weakened.

The significance of the Copilot controversy lies precisely at this point. It revealed that AI code generation is not merely a technical innovation, but a problem that directly collides with the entire framework of software copyright. And addressing this issue cannot be done through technical explanations alone—it inevitably moves into the realm of legal interpretation.

For this reason, the next section will examine what it means, from a legal perspective, for AI models to learn from code, and whether the training process itself could be considered copyright infringement.

Legal Perspective: Is Training a Form of Reproduction?

The AI code generation debate ultimately arrives not at technology, but at law. Technically, it can be argued that models do not directly copy code but generate it probabilistically. However, legal judgment operates under a different framework. The central question in copyright law is always the same: does a particular act constitute reproduction of an existing work or the creation of a derivative work? Therefore, whether the act of training AI models on code falls into these categories becomes a critical issue.

Some legal experts interpret AI training as a form of data analysis. From this perspective, the model does not copy or redistribute the training data, but instead analyzes it to extract statistical patterns. This interpretation is often linked to the concept of “transformative use.” If original data is used in a new and transformed way, it may not constitute copyright infringement. In fact, there are precedents where search engines and text analysis tools have been legally justified based on this reasoning.

On the other hand, some experts argue that AI training is effectively a form of large-scale reproduction. Training a model requires downloading vast amounts of code and loading it into memory, which inherently involves creating full copies of the data. Additionally, studies have shown that models can, in some cases, reproduce specific code fragments almost exactly. These findings support the claim that AI training may not be mere analysis, but a form of potential reproduction.

What makes this issue even more complex is the internal structure of AI models. While models do not store training data as explicit copies, they can still reproduce certain code patterns with remarkable accuracy. In other words, there may be no direct copy of the original code within the model, yet similar expressions can reappear in the output. How this situation should be interpreted within existing copyright frameworks remains unclear.

Ultimately, the legal debate converges on a few key questions: Is the act of training an AI model on code itself a form of copyright infringement? And can AI-generated code be considered a derivative work of existing code? There are no definitive answers yet. Related lawsuits are ongoing in multiple jurisdictions, and courts are still grappling with how to interpret the nature of AI technology.

This debate goes far beyond a single technical issue. If AI training is deemed to be copyright infringement, most existing large language models could face serious legal challenges. Conversely, if training is recognized as entirely lawful, the protective mechanisms intended by open source licenses could be weakened. Ultimately, this issue leads to a broader question about how AI technology will coexist with existing copyright systems.

And at this point, another fundamental question emerges. Even if AI-generated code is determined not to be a copy of existing code, one issue still remains: can AI-generated code itself be protected by copyright? This is the question we will explore more deeply in the next section.

A More Complex Issue: Copyright of Generated Code

In the previous section, we examined the legal implications of how AI models learn code. However, the debate does not end there. The issue of AI-generated code introduces a more fundamental question: can code created by AI itself be protected by copyright? This question goes beyond the licensing of training data and directly determines the legal status of the generated output.

In recent cases, U.S. courts have emphasized the principle of human authorship. According to this principle, copyright applies only to works created through human creativity. Works produced by non-human entities are, in principle, not eligible for copyright protection. As courts have increasingly refused to grant copyright to AI-generated images and text, this principle has become more clearly defined. If the same reasoning is applied to code, AI-generated code may also be ineligible for copyright protection.

This situation creates an interesting legal paradox. Open source licenses generally operate on the basis of copyright—the copyright holder grants others permission to use the code under certain conditions. But if AI-generated code has no copyright, then there may be no legal basis to apply a license at all. In that case, applying licenses such as MIT, Apache, or GPL could become legally meaningless.

This is not merely a theoretical issue. Developers using AI code generation tools are already facing this reality. Suppose a developer describes a function to an AI and receives generated code. In that case, who is the author of the code? Is it the AI model, the developer who wrote the prompt, or the countless contributors whose code was included in the training data? The current legal framework does not provide a clear answer to this question.

An even more intriguing possibility is that if AI-generated code is not protected by copyright, it may effectively fall into a state close to the public domain. Works without copyright can be freely used by anyone. At the same time, however, if the generated code reproduces the expression of an existing work, it may still be subject to that original copyright. In other words, AI-generated code can exist in a contradictory state—having no copyright of its own, while still being influenced by someone else’s copyright.

Because of this complexity, the issue of AI-generated code cannot be resolved simply by asking whether GPL applies. The more fundamental question is the legal status of the generated code itself. And depending on how this question is answered, the future of Copyleft licenses may also change significantly.

If AI Output Were GPL

Let us now consider a thought experiment. Suppose that AI-generated code is deemed to be influenced by its training data and therefore subject to GPL. In other words, if an AI model has learned from GPL-licensed code, then the code it generates must also follow the GPL. At first glance, this assumption may seem fair. If the goal is to protect the freedom of GPL code, then any derivative output should preserve that same freedom.

However, extending this logic reveals a much larger problem than expected. Modern large language models are trained on billions of code files. Within this data, a wide range of open source licenses are mixed together, and it is highly likely that a significant portion includes GPL-licensed code. If licensing is determined based on training data influence, then most AI-generated code could effectively be subject to GPL.

In such a scenario, the development environment would change dramatically. Developers using AI code generation tools would face the risk that their code might unintentionally carry GPL obligations. This becomes particularly serious in commercial software development. If a program includes GPL code, it may require the entire source code to be disclosed. From a corporate perspective, simply using AI code generation tools could become a legal risk.

This scenario could also have unexpected effects on the open source ecosystem. If most AI-generated code is considered GPL, it could lead to a situation where GPL effectively “spreads” across newly created code. This outcome may differ from the original intent of Copyleft. While GPL was designed to promote the spread of free software, in the age of AI it could be applied in ways that were never anticipated.

In reality, this scenario has already sparked significant debate. Some developers argue that AI-generated code should indeed be treated as GPL, while others counter that such an approach would make AI technology practically unusable. In corporate environments especially, the use of AI coding tools could become a serious legal liability, potentially limiting the adoption of the technology.

Ultimately, this thought experiment highlights an important point. The issue of AI-generated code is not merely an ethical question of whether GPL should be enforced—it is a broader question of whether the current licensing framework can function as intended in the age of AI.

Conversely, If GPL Does Not Apply

Now let us consider the opposite scenario. Suppose that even if an AI model has learned from GPL-licensed code, the generated output is not subject to GPL. Under this interpretation, the AI has merely learned patterns, and the resulting code is considered a new creation. Many AI companies and some legal experts argue that this is the more realistic direction.

If this view is accepted, the legal risks associated with AI code generation tools would be significantly reduced. Developers could freely use AI-generated code without worrying about specific license obligations. In corporate environments, AI coding tools could be adopted more aggressively. From the perspective of technological diffusion, this conclusion appears far more practical.

However, this interpretation also introduces new problems. It could weaken the protective structure intended by Copyleft licenses. For example, imagine a developer inputs an entire GPL-licensed project into an AI system and asks it to “rewrite the same functionality in a different way.” If the generated output is treated as new code, it could be distributed without GPL obligations.

Such a scenario could undermine the core philosophy of Copyleft. GPL was designed to preserve the freedom of code, but with the emergence of AI rewriting, it may become possible to bypass these rules. In fact, some developers argue that AI code generation exposes the structural limitations of Copyleft licenses.

This issue is not limited to GPL alone. The entire open source licensing system was built on the assumption that human developers write code. But in an era where AI becomes a central actor in code generation, that assumption may no longer hold. As a result, AI code generation is emerging as a new variable that tests the boundaries of existing licensing frameworks.

At this point, we return to the original question. If AI learns from GPL code, should its output also be GPL? As we have seen, there is still no clear answer. The issue lies at the intersection of technology, law, and open source philosophy.

For this reason, the next section will step back and examine the debate from a broader perspective. We will look at the questions AI code generation raises for the open source ecosystem and explore the possible directions this discussion may take going forward.

Conclusion: We Still Don’t Have the Answer

So far, we have followed a long path guided by a single question: if AI learns from GPL-licensed code, should its output also be GPL? At first glance, this question seems simple. Open source licenses have existed for decades, and their rules appear relatively clear. However, with the emergence of AI code generation, the issue has suddenly become far more complex. Legal concepts such as derivative works, reproduction, and the scope of copyright were all designed around human developers. But now, the process of writing code involves not only humans but also large probabilistic models that generate new code based on billions of learned code fragments.

This shift is undermining the very assumptions on which open source licensing systems were built. GPL was designed as a powerful mechanism to preserve code freedom, but its rules were created for an environment where humans write and modify code. In contrast, AI models do not copy or edit code in a traditional sense—they learn patterns and generate new code probabilistically. This creates a gray area that existing copyright concepts struggle to explain. There is still no clear standard for determining whether AI-generated code is a derivative work or an independent creation.

On top of this, another issue emerges: whether AI-generated code itself has copyright at all. If code created by AI is not considered a human work, it may not qualify for copyright protection. In that case, there may be no legal basis to apply a license in the first place. This would affect not only permissive licenses like MIT or Apache, but also Copyleft licenses like GPL. In other words, AI code generation raises a deeper question—not just whether GPL applies, but how the very concept of licensing should function.

The debate around this issue is likely to continue. Lawsuits related to AI training data are already underway in multiple countries, and courts are still trying to interpret AI technology within existing legal frameworks. Some rulings have leaned toward denying copyright protection for AI-generated works, while other discussions focus on how to protect the rights of training data. However, there is still no clear legal precedent defining the relationship between AI-generated code and open source licensing.

What we are witnessing is likely a transitional phase. AI code generation is already part of modern development environments, yet the legal challenges it introduces remain unresolved. Open source licensing may need new interpretations—or even entirely new rules—to adapt to this change. Some developers fear that AI could weaken Copyleft licenses, while others argue that in an era where code generation becomes trivial, the meaning of licensing itself may evolve. Regardless of the perspective, one fact is clear: AI is transforming not only how software is built, but also its legal structure.

The question explored in this article does not lead to a definitive conclusion. Whether AI-generated code should inherit GPL remains unanswered—technically, legally, and philosophically. Yet the importance of this question lies precisely in its ability to challenge the assumptions we have long taken for granted about the open source ecosystem. AI code generation is not just a shift in development tools—it is forcing us to reconsider how software is created and shared.

And at this point, the next discussion begins. So far, we have explored the tension between GPL and AI-generated code. But to fully understand this issue, one more concept must be examined in depth: clean-room implementation. This legal strategy, historically used to avoid copyright infringement while reproducing functionality, is becoming newly relevant in the age of AI. In the next article, we will take a closer look at how clean-room implementation works in practice and under what legal conditions it is recognized.