Author: Deng Jianpeng, Zhao Zhisong; Source: Journal of Xinjiang Normal University
This article was first published in the Journal of Xinjiang Normal University in 2025.
Abstract: With the rapid development of generative artificial intelligence, the advent and open source of DeepSeek have attracted much attention. DeepSeek has achieved low-cost training and efficient output by breaking through algorithm models, knowledge distillation and thinking chain technology. Technological changes have two sides to the supervision of generative artificial intelligence: on the one hand, thinking chain and model open source have alleviated the dilemma of "algorithm black box" and provided convenience for regulatory review; on the other hand, it has brought new challenges to privacy protection and data compliance, intellectual property rights, "hallucination problems" and model security risks. The explosion of DeepSeek once again shows that the supervision of generative artificial intelligence should be future-oriented, with privacy protection and data security, technological innovation and protection balance, model prompt and feedback mechanism, and model security as the focus of supervision, avoiding technical abuse and malicious attacks, and ensuring and promoting the safe and healthy development of generative artificial intelligence.
Keywords: DeepSeek; Generative AI; Legal compliance; Supervision; Artificial Intelligence
Author profile: Deng Jianpeng, Professor and Doctoral Supervisor of the School of Law, Central University of Finance and Economics, and Director of the Financial Technology Rule of Law Research Center; Zhao Zhisong, Master's student of the School of Law, Central University of Finance and Economics.
Summary
Technical features and legal issues of DeepSeek: Knowledge distillation technology and compliance issues; Thinking chain model and compliance issues; Regulatory advantages and new issues of open source models.
The positive impact of DeepSeek on the current regulatory dilemma: On the one hand, technologies such as model open source and thinking chain can help alleviate the dilemma of "algorithm black box". On the other hand, the active open source of large models represented by DeepSeek not only enhances people's trust, but also provides more convenient and transparent review conditions for regulators.
New challenges brought by DeepSeek's technological changes: intensified challenges in privacy protection and data compliance; disputes over intellectual property rights; aggravated "illusion problems"; model security issues.
One of the dilemmas of generative artificial intelligence regulation is that regulatory measures must constantly adapt to the rapid development of technology. Some scholars believe that privacy protection, model security, technical standards, open source protocols, etc. should be regulated by "soft laws" such as flexible and diverse, cooperative experiments, and cross-border applicable industry standards. Regulatory authorities should carry out extensive cooperation with enterprises, industry organizations or enterprise cooperation alliances, and promote the combination of "soft law" and "hard law" through various mechanisms to regulate the development of generative artificial intelligence.
Since the launch of ChatGPT by OpenAI, an American artificial intelligence research and development company, content-generating artificial intelligence has achieved vertical deployment from a single modality to multi-modal industry empowerment, and the rapidly developing technology continues to change the world. At the end of December 2024, DeepSeek, a Chinese artificial intelligence research and development company, released and open-sourced the DeepSeek-V3 model, whose performance is comparable to the current top closed-source models; in January 2025, the company released the inference model DeepSeek-R1, which reached or exceeded OpenAI o in multiple tests. Unlike other models that rely on massive data and computing power, DeepSeek has attracted widespread attention with its Group Relative Policy Optimization (GRPO) algorithm model, knowledge distillation (KnowledgeDistillation) and chain of thought (Long-CoT) technology. Some scholars believe that DeepSeek has achieved performance comparable to top products at a training cost as low as 1/10, setting new benchmarks in multiple dimensions such as reasoning efficiency and scene adaptation, achieving low-cost and efficient model training and output, and demonstrating excellent capabilities.
DeepSeek achieves efficient training at low cost, representing the latest major breakthrough in China's generative artificial intelligence technology, which will further promote the widespread application of large model technology, but some technologies require regulators to think about. At present, academic research on generative artificial intelligence supervision mainly focuses on algorithm and data compliance, focusing more on false information regulation, ethical risk regulation, infringement liability and exemption, etc. Algorithm compliance focuses on algorithm review, algorithm explanation and algorithm accountability, while data compliance focuses on the legitimacy of data sources and data privacy protection. DeepSeek's technical application can help alleviate some of the current difficulties in artificial intelligence supervision to a certain extent, for example, improving the interpretability and transparency of the "algorithm black box" problem, and model open source facilitates regulatory review to a certain extent, etc. However, some technical means also pose new challenges to supervision, including increased risks of privacy protection and data compliance, intellectual property disputes caused by distillation technology, aggravated "illusion problems" and model security risks, etc. Some scholars believe that we have reached a historical turning point: part of the major process of current history is driven by decisions made by non-human intelligence. This is why the fallibility of computer networks has become so dangerous. When computers become the drivers of history, these errors can bring disasters. Therefore, while paying attention to the technical features of DeepSeek, we should focus on exploring the regulatory trends of cutting-edge technology development. Due to market competition and technological development needs, companies usually spontaneously optimize algorithm explanations and pursue algorithm transparency to enhance product competitiveness and user trust. The focus of supervision should be on aspects such as the lack of self-drive of enterprises but involving public interests and security, avoiding damage to the rights and interests of vulnerable groups, and preventing technology abuse and malicious attacks.
I. Technical features and legal issues of DeepSeek
DeepSeek adopts a series of technical innovations such as the Multi-Head Latent Attention Mechanism. Some of these technologies belong to algorithmic logic and do not affect regulatory regulation. We focus on the technical features and legal issues of DeepSeek from a regulatory perspective.
(I) Knowledge distillation technology and compliance issues
Knowledge distillation technology is one of the important means for DeepSeek to optimize model performance and improve resource utilization efficiency. Its core method is to train a smaller model ("student model") to imitate the output or intermediate features of a larger model ("teacher model") to achieve low-cost and efficient deployment. That is, using a small model to imitate the prediction result of a large model, this prediction result is not a simple answer, but a prediction probability distribution (similar to a "soft label"), which can contain more data information, so the small model performs better in the test. From a technical perspective, although the current distillation technology has shown good technical effects, the method of directly using the prediction results for distillation is not optimal, and is not as good as using the model's intermediate layer features for distillation. In addition, the knowledge distillation loss formula has certain defects. The small model mixes some data that should be analyzed separately for analysis, which will affect the effect of knowledge transfer.
From a regulatory perspective, the current legal disputes over distillation technology are mainly concentrated in data compliance and intellectual property disputes. If the "teacher model" is trained with defective or inaccurate data, then in the process of knowledge transfer, the data problems of the "teacher model" will be transferred and expanded to the "student model", which may further expand the "illusion problem". There are no clear legal regulations on the source of data, processing methods, data privacy protection, etc. under the application of distillation technology, and there is also a lack of clear legal definition of responsibility for the amplification of data defects during the transfer process. Although distillation technology is relatively mature, the boundaries between model use and infringement are relatively vague. Distillation technology is not a simple code copy, but a deep-level utilization. Although open source models are allowed to be used, neither open source licenses nor industry specifications clearly define the specific restrictions on the application of distillation technology. There is a lack of detailed regulations for closed source models, making it difficult to determine to what extent distillation behavior is a reasonable use and to what extent it constitutes infringement, resulting in a gap in regulatory standards.
(II) Thinking Chain Model and Compliance Issues
The thinking chain model of DeepSeek technology is a technical iteration of the instruction prompt model. Good results may gradually promote the development of large models from instruction type to reasoning type. Thinking chain technology refers to improving the reasoning ability of large models, the interpretability of thinking processes, the controllability of models, and the flexibility of output through serialized thinking models, and is suitable for highly complex tasks. The difference between it and the instruction prompt type large model is that the former mainly relies on pre-set prompt information to guide the model to generate output, but there are certain limitations when dealing with complex task requirements. The thinking chain model simulates the human thinking process, constructs a logically coherent thinking chain, enables the model to reason and analyze step by step, and displays the thinking process, which not only effectively improves the ability to understand the context, but also effectively captures the previous text information when processing long texts or multi-round dialogues, maintains logical coherence, and generates more accurate and reasonable answers.
From a regulatory perspective, due to the "algorithm black box" of the instruction-prompted large model, the model only outputs the final result, which causes data and privacy leakage risks in its output results. The thinking chain displays the intermediate steps in the reasoning process, which may contain privacy fragments in the training data, infringe on personal privacy or other rights and interests, and trigger compliance challenges. In addition, the thinking chain may aggravate the "hallucination problem". The instruction-prompted large model simply gives the result, and the user can accept the inaccuracy of the result to a certain extent. However, because the thinking chain has the characteristics of "human reasoning", it allows users to clearly see the thinking process, which may pass on incorrect reasoning logic or content without factual basis layer by layer, causing the final result to deviate from the facts. Wrong reasoning steps may lead the model to draw wrong conclusions. Users may trust the wrong results more because they see the thinking process, which may lead to misleading decisions, leading to economic losses, and infringing on the rights of others. Therefore, it is necessary to further clarify the countermeasures for the "hallucination problem" of large models, and effectively prompt and alleviate the impact of the "hallucination problem".
(III) Regulatory advantages and new problems of open source models
As an open source model, DeepSeek fully opens up its code, data and algorithms. Some scholars believe that open source models may lead the development direction of large models and bring new business opportunities. From the perspective of social trust, users and developers can conduct technical analysis of open source models, verify their effects and capabilities, maximize the ability to develop models, and enable the public to have a clear understanding of the capabilities and technical effects of models; in terms of technological innovation, open source models have greatly stimulated the creativity of developers. Both developers and individual users can continuously conduct secondary development around DeepSeek open source models, absorb or optimize model technologies and improvement plans; at the same time, open source features can improve the transparency and trust of technology, and technicians can deeply analyze the model operation logic, and then discover its potential vulnerabilities and fix them in time. Compared with closed-source models such as ChatGPT, regulators can achieve penetrating supervision of open source models and supervise the safe and reliable operation of models.
From a regulatory perspective, although open source models have promoted technological progress, they have also brought new regulatory issues. Although the open source agreement contains rules for regulating the use of models, malicious users may violate the agreement and use the model for commercial purposes, competitive development, or redistribution of the model. In terms of security responsibility, even if model developers such as DeepSeek declare in the user agreement that they are not responsible for the consequences of their use, a unilateral statement does not mean that they can be exempted from all legal responsibilities (the developers and service providers of generative artificial intelligence have the responsibility to maintain their basic security). If the model itself has security vulnerabilities and is maliciously exploited and causes damage, the developer cannot be completely exempted from responsibility.
II. DeepSeek's impact and challenges on supervision
From the perspective of technological development, the application of DeepSeek-related technologies has eased the regulatory dilemma while also bringing new challenges.
(I) Positive impact on the current regulatory dilemma
On the one hand, technologies such as model open source and thinking chain can help alleviate the dilemma of "algorithm black box". The "Internet Information Service Algorithm Recommendation Management Regulations" and the "Generative Artificial Intelligence Interim Service Measures" and other regulations mostly adopt advocacy rules rather than mandatory rules for algorithm transparency and explainability. The "algorithm black box" problem is a difficult point in the regulation of generative artificial intelligence. The essence of the "algorithm black box" is the insufficient disclosure and explanation of the generative artificial intelligence algorithm. Under the impact of algorithm bias, there will inevitably be an algorithm trust crisis between the public and generative artificial intelligence. However, the algorithm model structure is complex, and its decision-making process is difficult for humans to understand and explain. It is difficult for users, regulators and even developers to analyze and judge the algorithm based on rationality. Therefore, the early regulatory thinking is committed to improving the transparency of the algorithm. Some scholars have proposed methodological paths such as "attribution explanation" and "counterfactual explanation". Some scholars have proposed regulatory measures such as transparency reports and algorithm detection, as well as observable, decomposable and simulatable transparency requirements. However, some scholars believe that algorithm transparency does not have a significant effect. Even if it is technically feasible, it may not achieve the expected results. Therefore, technical limitations, theoretical disputes, advocacy norms, etc. have caused regulators to face major difficulties in evaluating model security, reliability and fairness for a long time.
The development of DeepSeek shows that in order to enhance the model's understanding ability and improve user trust, companies have a certain degree of self-motivation to complete the technical requirements of relevant algorithm explanation and algorithm transparency. Although it may be for commercial purposes, it also provides a path to alleviate the "algorithm black box" problem, thereby helping to improve regulatory technology. The thinking chain model used by DeepSeek makes the model decision-making process more transparent by simulating the human reasoning process. When the model processes tasks, it reasoned step by step according to the thinking chain, making the reasoning basis and logical relationship of each step relatively clear. At the same time, the thinking chain fully displays and records the intermediate steps in the model reasoning process, providing users and regulators with more observable and analyzable information, which helps regulators and users to have a deep understanding of the process from input data to final decision-making of the model, and provides a basis for evaluating the rationality and fairness of model decisions.
On the other hand, large models represented by DeepSeek are actively open source, which not only enhances people's trust, but also provides regulators with more convenient and transparent review conditions. The competition between open source and closed source models of generative artificial intelligence has been going on for a long time. Technology giants such as OpenAI have chosen closed source with their strong R&D strength and resource advantages, aiming to protect corporate technical secrets and commercial interests. Although some models choose to be open source, their technical performance cannot match that of OpenAI's closed source models. The new progress of DeepSeek technology may indicate that the open source model has become a new trend that cannot be ignored in the development of generative artificial intelligence, and this new form has a certain positive impact on supervision.
In the closed source model environment, the regulatory agency's review of large models is constrained by information asymmetry and trade secrets, and it is impossible to obtain the model source code and internal algorithm details. The regulatory agency can only evaluate the model performance and compliance through external observation tests of the generated content. The relevant model tests show that there are obvious obstacles to tracing the problem of closed source models. Compared with closed source models, open source models have the characteristics of openness, scene adaptability, professional user-friendliness, high transparency, compatibility, etc., which greatly facilitates technological innovation and regulatory review. Model open source includes model architecture, training data, model parameters and other contents. In terms of model architecture, the algorithm can be fully tested to evaluate its rationality, fairness and security; in terms of training data, it can directly detect whether the data collection, cleaning, labeling and use are in compliance with legal and technical standards, and timely avoid data abuse or privacy infringement risks; in terms of model parameters, the open source code can be directly reviewed to understand the model's data processing flow, algorithm implementation details and model training process in detail. At the same time, model open source is conducive to continuous supervision by regulators. Compared with algorithm review and filing, regulators can obtain the latest model dynamics in a timely manner, track model development, and ensure that the model meets regulatory requirements throughout its life cycle. More importantly, model open source can encourage users and peers to participate in algorithm supervision. Technology developers or competitors may discover potential problems and report them to regulators in the process of using or supervising open source models, which will help regulators to discover and solve problems in a timely manner, promote compliance with open source models, and reduce legal and technical risks.
(II) New challenges brought by DeepSeek’s technological changes
1. Intensified challenges in privacy protection and data compliance
The data compliance of generative AI mainly includes data sources, data processing, data storage and transmission. DeepSeek’s thinking chain and knowledge distillation technology have brought breakthroughs in improving model performance, but have also intensified the challenges of privacy protection and data compliance.
In terms of data sources, when generative AI first came out, some researchers pointed out that the quality and source of data are important factors affecting the development of generative AI, and generative AI may face the situation of running out of training corpus. As mentioned above, the distillation technology adopted by DeepSeek is essentially the knowledge transfer from the "teacher model" to the "student model". The distillation process reduces the cost of data collection and annotation, greatly improving data quality while reducing training costs, but it also faces two compliance risks: First, if there are legal issues with the training data of the "teacher model", then the "student model" will also be indirectly affected, making it more difficult to review the legality of privacy protection and data sources; second, this method may involve intellectual property disputes.
In terms of data processing, knowledge distillation can efficiently transfer knowledge from large models to small models. In order to make the "student model" better learn the knowledge of the "teacher model", it may perform some feature extraction and conversion operations on the data. Massive data processing may increase the risk of re-identification of desensitized data, thereby increasing the risk of user privacy leakage. At the same time, the thinking chain technology involves deep mining and analysis of data in the reasoning process. This reasoning process is no longer in a "black box", but is displayed and saved in the log. The reasoning process is the data information processing process. For example, when DeepSeek is asked to evaluate a public figure, it will crawl various URLs in the thinking chain, some of which may contain unverified or proven false information. Similarly, when a user asks it to block bad information, the thinking chain will still display the relevant content of the bad information in the thinking process. It can be seen that the increase in data exposure will increase the risk of data compliance.
In the data storage and transmission link, generative artificial intelligence relies on a large amount of data, and its data is usually stored on servers in multiple different geographical locations. How to ensure the security and integrity of this data is a major challenge it faces. According to DeepSeek's privacy policy, personal data information is stored on DeepSeek's servers in China, but the open source model is prone to data exposure. Some users have found that DeepSeek's publicly accessible ClickHouse database allows visitors to fully control database operations and access internal data including more than 1 million lines of log streams, including chat records, keys, backend details and other highly sensitive information. Although DeepSeek immediately repaired it, its risks still need attention. In terms of data sharing and cross-border flow, generative AI needs an international perspective, but data protection laws and regulations vary from country to country. In recent years, the legal protection of personal information has gradually received high attention from various countries. For example, the EU has relatively strict requirements for personal information and other data, and the General Data Protection Regulation (GDPR) sets strict conditions for data access, storage, modification, deletion and cross-border data transmission. Although countries around the world have differences between the "free flow" and "cross-border data control" of cross-border data, data security is an important restrictive principle for cross-border data. Therefore, my country's generative AI institutions can further improve data compliance standards and thus enhance their competitiveness in the world.
2. Disputes over intellectual property
While achieving low-cost and efficient training, knowledge distillation technology may trigger a series of intellectual property disputes. The essence of knowledge distillation is to use the training effect of the "teacher model". Some people believe that using distillation technology to build directly competing large model products is a violation of its terms of service; others believe that using the output of advanced large models for secondary development is an industry practice and there is no dispute.
In the process of model distillation, the essence is to use the training results of the "teacher model" to input knowledge and efficiently utilize the "student model". From a technical perspective, it is difficult to clearly distinguish the boundary between the utilization and infringement of knowledge in the process of model distillation. The "student model" learns the knowledge of the "teacher model", which will inevitably have similarities with the "teacher model" to a certain extent. However, to what extent this similarity exceeds the scope of reasonable utilization and constitutes infringement, it is necessary to conduct clearer detection and set standards at the technical level. At the same time, different model architectures and algorithms show different effects in the distillation process, which further increases the complexity of technical standards. Although major generative artificial intelligence systems adopt open source or API call (application programming interface) mode to open sharing platforms for commercial purposes or for technological innovation purposes, whether open source or API call mode, the user's use, copying, and modification should be subject to certain restrictions, or an agreement license should be formulated, or relevant regulations and specifications should be issued. If the generative artificial intelligence system uses other companies' models as "teacher models" without authorization during the distillation process, or relies too much on the unique technology of the "teacher model" in the process of knowledge utilization and learning and lacks originality, it may cause intellectual property infringement.
From a legal perspective, the existing intellectual property law system has a certain lag in dealing with disputes caused by model distillation. Traditional intellectual property laws mainly protect innovative achievements with relatively clear specific contents. For generative artificial intelligence models, which are opaque, highly complex, and rapidly developing technical achievements, the legal definition and scope of protection are not clear. There is a lack of clear normative standards for related disputes in the industry, and existing agreements mostly take an evasive attitude. For example, the GPL (General Public License for Open Source Software) stipulates: "If a software distributor initiates a patent infringement lawsuit against others, accusing the other party of infringing its patents by using the software, GPLv3 will automatically terminate all patent licenses granted to the party in the lawsuit." The core of intellectual property is to find a balance between technology protection and incentives for technological innovation. Distillation technology not only affects the social trust and market competitiveness of the original large model, but also has an impact on the innovation and development of artificial intelligence technology. However, overly strict technical protection may allow monopoly companies to use intellectual property disputes as a means to maintain their monopoly position.
3. The "hallucination problem" is aggravated
The "hallucination problem" of large models is an inherent drawback of all models. The "hallucination problem" refers to the content generated by the model seems accurate, but in fact it is fabricated or lacks data support. The "hallucination problem" of the model seriously affects the reliability of the model output content, and thus affects social trust. With the application of DeepSeek model distillation technology and thinking chain technology, although the logical reasoning ability has been improved, the "hallucination problem" has gradually become prominent and aggravated. For example, in actual use, some users found that the DeepSeek large model outputs false and fabricated files, page numbers, professional terms, etc. in the thinking chain deduction process and the final result, and the model expression content, logical deduction implementation and details are too perfect, and the knowledge reserve of ordinary users is difficult to correct. Even if the user points out the specific error, the model will still "fabricate" the relevant content.
In the model training stage, the "hallucination problem" is mainly affected by the quality of training data. For example, the training data contains a lot of inaccurate and incomplete information. The model is prone to absorb this erroneous information during the learning process, and then outputs erroneous content when generating content. As mentioned above, the knowledge transfer of distillation technology will amplify the problems of data while improving data quality. The data trained by distillation technology may be flawed or erroneous data. Repeated use leads to extreme output in the final output - high-quality data brings accurate and rigorous content, and flawed data will amplify the "illusion problem". At the same time, the negative impact of the thinking chain technology on improving the reasoning ability of the model is that it causes ordinary users to "believe" in its output more. Thinking chain is a rigorous reasoning based on the model. In the process of reasoning, it gradually processes data that may be biased or incomplete. The more complex the reasoning steps, the greater the probability of introducing erroneous information. When dealing with problems in some professional fields, the model may generate erroneous conclusions due to lack of sufficiently accurate data or inaccurate captured information.
The aggravation of DeepSeek's "illusion problem" reminds us that generative artificial intelligence has a certain self-driving power and can improve the interpretability and transparency of the algorithm, but its "illusion problem" may be amplified simultaneously with the enhancement of model capabilities. Although some scholars have proposed that the accuracy of artificial intelligence can be improved by establishing an accurate "external" private database RAG (Retrieval Augmented Generation), this method will bring a huge data load and is more suitable for the personalized design of the model rather than the basic model. Therefore, it is urgent to explore reasonable regulatory measures to encourage generative artificial intelligence to actively deal with the "illusion problem" and avoid unpredictable losses caused by users' excessive trust in its output.
4. Model security issues
After the release of the DeepSeek model, it was attacked by a large number of overseas networks, which directly affected the registration, access and normal use of the system. This attack is a rare large-scale cross-border attack since the advent of the generative artificial intelligence model. This attack exposed a series of typical network security issues, including the complexity of attack methods, cross-border attacks, network security protection technology loopholes, insufficient emergency response capabilities, data security risks, and supply chain security risks. The above problems not only affect the normal operation of DeepSeek, but also remind us to pay more attention to the security issues of large models.
Generative AI large models are inherently vulnerable in terms of security. Some researchers attribute the attacks they may suffer to model theft, data reconstruction, member inference, data poisoning, prompt word injection, indirect prompt word injection, model hijacking, sponge samples, etc. The model's architecture and algorithm design have inherent internal defects. When the model processes input data, it is sensitive to small changes in the data. Attackers can use various means to break through the model's defense. The current response is mainly to enhance the robustness of the model (the ability of the model or algorithm to maintain stable operation and good performance in the face of various uncertainties, interference, abnormal situations or adverse factors). As large models move from basic vertical deployment models to "AI+" models that are combined with highly professional fields such as medicine, finance, and law, the scope and severity of security issues in the above key areas are increasing. In commercial applications, although technological innovation brings efficiency and convenience, ensuring the security of new technologies is always the top priority of all commercial organizations. If new technologies are applied to existing business models, their security and data privacy protection measures need to be fully evaluated.
III. Key Directions for Regulation of Generative AI
One of the difficulties in regulating generative AI is that regulatory measures must constantly adapt to the rapid development of technology. Some scholars believe that privacy protection, model security, technical standards, open source protocols, etc. should be regulated by flexible and diverse, cooperative experiments, and cross-border applicable industry standards, such as "soft law". Regulatory authorities should carry out extensive cooperation with enterprises, industry organizations or enterprise cooperation alliances, and promote the combination of "soft law" and "hard law" through various mechanisms to regulate the development of generative AI.
(I) Strengthening Privacy Protection and Data Security Supervision
Clearing the scope of use of personal data and the remedies when infringed, and balancing the relationship between personal data protection and the development of the digital economy are the focus of adapting to the development of the digital economy in today's society, and should also become one of the key directions for the regulation of generative AI.
In the data collection stage, the supervision of generative AI should determine relatively clear norms, require the scope and purpose of data collection to be clear, and ensure the necessity and legality of data collection. When collecting user data, the principle of minimization should be followed. A large number of data leakage incidents show that generative AI should improve the user consent mechanism to ensure that users have a clear understanding of data collection behavior. DeepSeek's privacy policy clearly states that "the input content collected by the service and the corresponding output content will be used to improve and optimize the quality of DeepSeek services, provided that they are processed by secure encryption technology, strictly de-identified, and cannot be re-identified to specific individuals." Such privacy clauses are widely used in the current large model service agreements, and seem to have become a business model of "exchanging services for data." The supervision of generative AI does not absolutely prohibit this business model, but the right to choose should be placed in the hands of users.
In the data processing stage, data should be classified and graded, and classified and graded according to factors such as data sensitivity. Differential privacy methods should be used to take different protection measures and usage restrictions for different levels of data. In view of the use of synthetic data brought about by data distillation, new rights and new rules for data processing in machine learning scenarios should be further created, and a system for the use of synthetic data should be stipulated. With the processing and analysis of synthetic data and large amounts of data, the risk of de-privacy data being re-identifiable in model processing continues to increase. While using massive amounts of data, developers should standardize the technologies used in related behaviors in all links of the data cycle based on the data classification and grading protection system. By setting sensitive words for privacy data, using algorithms, data analysis and other means to dynamically monitor and feedback the model operation and processing process, using algorithms to deal with problems caused by algorithms, and timely identifying and effectively processing privacy and sensitive data that appear in model operation.
In the data storage and transmission stage, developers should be encouraged to adopt encrypted and secure storage architectures. Sensitive data can be encrypted and stored through systems such as data classification to prevent data from being stolen or tampered with during storage; secure transmission protocols can also be used to ensure the confidentiality and integrity of data transmission. In terms of cross-border data flow, differences in technological development among countries have led to different laws and positions. For generative artificial intelligence to be globally competitive, developers and service providers need to provide large models that meet the requirements of different countries. Therefore, my country needs to further regulate cross-border data transmission and processing, condense and propose a data strategy that is in line with its sovereignty, security, and development interests, conduct data security assessments on data recipients and processors, enhance data security protection capabilities, and ensure my country's data security.
(II) Balance between technological innovation and technological protection
Knowledge distillation technology and model open source will trigger intellectual property disputes. The two are inseparable in technology, and the object of knowledge distillation usually becomes an open source model. In the Internet age, some scholars have proposed the view of weakening intellectual property rights and sharing benefits, but excessive weakening of intellectual property rights is obviously not conducive to protecting the motivation of right holders to continue technological innovation. In response to the intellectual property disputes caused, it is necessary to regulate distillation technology and open source behavior, make a reasonable response to such dispute risks, and then resolve the conflict between technological innovation and technological protection.
First, it is necessary to accurately define the nature of the behavior and distinguish whether the specific behavior belongs to "technical improvement" or "infringement copying". If the "student model" is trained only through the output of the "teacher model" (such as prediction data), it is necessary to examine whether it exceeds the permitted use scope and protection scope of the original model; if the "student model" only learns the general knowledge and public technology of the "teacher model" and independently improves and innovates the key innovation points, it does not constitute infringement, otherwise it constitutes infringement. For open source models, open source is not a waiver of rights, but a new copyright licensing model under the protection of a license. The nature of its behavior should be subject to the constraints of the license. Its essence is the open sourcer's disposal of rights and compliance with the open source community norms.
Second, a special review mechanism for large models should be established. The regulatory authority may require the disclosure of key technical processes of the distillation model, such as data sources, alignment methods, etc. At the same time, the burden of proof can be tilted towards the party with technical advantages, and the accused party can be required to prove the independence of its model. For open source models, it should replace "command-control" supervision with agile governance, and use institutional mechanisms and resource allocation to regulate the use of models while encouraging innovation to avoid abuse and malicious use.
Third, regulators should establish a cooperation mechanism with the open source community. Whether the model is distilled or open source, it depends more or less on the open source license of the open source community. The open source license is a contract, but it also relies on the traditional intellectual property system. The regulatory authorities can guide the community to establish and dynamically update intellectual property agreements that conform to current technological trends through technical standards, scientific and technological ethics, industry standards, etc., and regulate open source and distillation through "soft laws" such as cooperative, inclusive, flexible self-discipline norms and autonomous norms, provide channels for open intellectual property rights, and set a bottom line for the development and utilization of new technologies.
(III) Improve the prompt and feedback mechanism of the model
Although researchers have tried to use various methods to alleviate the "hallucination problem" of large models, DeepSeek's tests show that the stronger the large model is, the more convincing the model's "hallucination problem" is. DeepSeek clearly states in the user agreement that "all outputs provided by this service are answered by the artificial intelligence model, which may contain errors or omissions. They are for your reference only and you should not use the output content as professional advice." However, this hidden formatted service terms should not completely become its unilateral exemption basis. While the technology is developing, service providers should improve the model's safety prompts and feedback mechanisms, and regulate the behavior of institutions themselves while reminding users of risks, so as to provide a basis for the attribution of responsibilities.
From a regulatory perspective, while encouraging the improvement of technology, generative artificial intelligence service providers should be required to adopt sufficient prompts and feedback mechanisms to avoid risks. At present, although the duty of care of generative artificial intelligence service providers does not include a general obligation to review the generated content, it should include the obligation to take prominent marks for content suspected of infringement. Such marks should not only appear in formatted terms such as service agreements, but should give prominent prompts for answers involving professional fields such as finance, medical care, and law. In terms of model feedback, regulators should require service providers to establish and improve user feedback channels and promptly collect problems and anomalies found by users in the process of using the model. At present, according to the DeepSeek user agreement, the feedback channel for users to find infringement, illegal, and false information is email feedback. This cannot provide targeted feedback when users find model reasoning, generated hallucinations or wrong information during the use of natural language processing models. Feedback channels should be convenient, efficient, and targeted. The industry should be encouraged to formulate standardized processes and provide multiple feedback channels to facilitate users to submit feedback information. At the same time, a model performance monitoring system can be established to publish the feedback rate, response rate, and processing rate of "hallucination problems" to drive developers to improve technology.
In dealing with "hallucination problems", regulatory authorities should strengthen the prevention function of the law rather than the sanction function. At present, it is not appropriate to require generative artificial intelligence service providers to bear full responsibility for the authenticity and accuracy of the generated content, but generative artificial intelligence service providers should assume the responsibility of the operating entity, fulfill clear obligations such as content control and traceability marking, and take effective response measures to ensure that the problem can be traced and improved in a timely manner. For example, the instruction prompts of the input model, the output content of the model, and the key steps in the decision-making process should be recorded, and data integrity and traceability should be ensured through storage methods such as logs. When problems occur in the model, the generative AI service provider should ensure that the problem can be quickly located through the traceability system, and take targeted improvement measures to improve the reliability of the model.
(IV) Model security supervision
DeepSeek's server data leakage and large-scale attacks show that model security performance is not only a problem for enterprises, but also related to the personal privacy of users, social interests and national security. Therefore, model security, data security protection and anti-attack performance should become one of the key directions of generative AI supervision in the future.
On the one hand, model security standards should be formulated. Generative AI governance standards can play an important role in "undertaking legislation and supervision, and connecting technical practices". The industry should be encouraged to formulate security technical standards applicable to generative AI, including data de-privacy, algorithm confrontation, database access, security vulnerability detection and repair, etc., comprehensively evaluate the security of the model, and provide test reports to the regulatory authorities. Developers can refer to OpenAI's Red Adversarial Team and conduct adversarial tests such as functional testing, performance testing, and security vulnerability testing on their own, and encourage them to be open to the public. However, such standards and requirements should be "soft standards" rather than hard requirements. In the era of rapid technological development, overly stringent security standards should not be formulated to increase excessive compliance costs for enterprises, but should be based on scientific and technological ethics as the basic principle and risk management as the basic system. Through the security certification mechanism, administrative licenses can be granted to models that meet security standards for application in medical, financial and other fields, which can not only improve the security and reliability of the model, but also help expand the application field of the model.
On the other hand, the detection, early warning and risk plan of model security performance should be the focus of supervision, requiring developers to use technical means such as data monitoring to monitor the data security status of model storage and operation in real time. A data leakage early warning mechanism can be established to respond and report in a timely manner when abnormal attacks, data leakage or security threats are found, so as to improve the ability to detect and respond to data security risks. At the same time, regulatory authorities should conduct regular inspections and tests, introduce third-party assessment agencies for professional and technical issues such as safety performance, take administrative measures such as risk warnings or rectification within a time limit for models with serious safety hazards, and publish the test results to the public when necessary.
IV. Conclusion and Outlook
The technical application of DeepSeek has attracted widespread attention and learning in the field of generative artificial intelligence research and development. After the release of DeepSeek, a team released a new model with technical indicators comparable to DeepSeek R1, but the training cost is less than US$50. While technological development has brought rapid development to the industry, it has also brought new challenges to the supervision of generative artificial intelligence. DeepSeek's distillation technology, thinking chain model and open source strategy have played an important role in improving model performance, expanding application scenarios and promoting technological innovation. Although DeepSeek's technical application has alleviated the dilemma of "algorithm black box" and regulatory review to a certain extent, it has also brought new challenges to privacy protection and data compliance, intellectual property rights, "illusion problems" and model security.
In the future, artificial intelligence technology will continue to develop, and while it will trigger a scientific and technological revolution, it will also bring risks and challenges, and the importance of supervision will become increasingly prominent. Therefore, we must look to the future, predict the development trend of generative artificial intelligence, further standardize data collection, data processing, data storage and transmission, effectively protect personal privacy, and strengthen the construction of a data compliance system; for technologies such as distillation and open source, we must encourage the application of "soft laws" such as industry standards and technical standards, accurately define the nature of behavior, balance technological innovation and technological protection, and establish effective model prompts, identification, feedback and traceability mechanisms to weaken the "illusion risk" of large models. In the future, as the technology matures, more stringent model security standards can be formulated to prevent technical risks, protect public interests and social security while giving full play to the advantages of artificial intelligence technology. Regulatory research on future technologies should also focus on the international perspective, strengthen global coordination and cooperation in artificial intelligence supervision, promote the Chinese solution for generative artificial intelligence supervision in the contemporary international order, and respond to the global challenges of artificial intelligence development.
Source | Journal of Xinjiang Normal University (Philosophy and Social Sciences Edition) first published online