Source: AI Technology Review
What changes will happen to China's big model startup circle after the DeepSeek storm?
In recent days, the author has also discussed with many industry insiders and found that there are currently two extremes in the domestic big model circle: one is extremely enthusiastic, and the other is extremely frosty.
The former is represented by computing power manufacturers and model service providers who actively embrace the DeepSeek ecosystem, as well as open source beneficiaries who were originally unable to participate in the big model "arms race", while the latter is mainly represented by other Chinese big model startups (commonly known as the "Big Model Six Tigers") and the VCs that have invested in these companies in the past two years, forming a "two extremes of ice and fire" situation.
It is understood that some VC teams that have invested in large model companies with first-tier valuations in the past two years have begun to prepare for/are accepting internal "whippings". The main angles of questioning are nothing more than the following:
"Why can DeepSeek train such a powerful model at such a low cost, while the large model companies we invested in have raised billions of dollars but cannot do it?"
"The essence of DeepSeek's success this time is that its technology is innovative and powerful enough. XXX doesn't even have a basic large model technology team, why should we invest in it?"
"XXX also has a very strong team of talents, and has the experience and pursuit of training base large models. Why didn't it become DeepSeek? What supports such a high valuation?"
"DeepSeek After they come out, who will invest in the six little tigers of big models? Which of them have the hope of going public? If not, should we buy back or exit next? ”
……
“Why didn’t it become DeepSeek” and “Why is there only one DeepSeek in China” are questions that almost all big model practitioners and VCs have been asking since the Spring Festival. These two questions can almost cover all the anxiety about big model innovation in China at present. Only by seriously discussing these two questions can we answer another more important question: How to become DeepSeek?
From the perspective of the comparison of AI innovation between China and the United States, we try to convey a message to the industry: Chinese AI needs to have national pride; and in this article, we hope to combine the development history of China’s big models in the past four years to further explore:
Does China lack technical idealists like DeepSeek?
If China is not lacking, have such technical teams been fully tapped and received corresponding social systemic support? If not, what is the reason?
As an industry account that has been following the big model reports since the outbreak of GPT-3 in 2020, this article does not intend to answer such a macro and profound question, but only presents some facts or opinions that may be related to the topic from a third-party perspective.
1 Systematic dislocation
Before 2023, there were only 4 large model companies in China: Zhipu, Mianbi, Shenyan and Lingxin (later acquired by Zhipu), and all of them were from Tsinghua University; after 2023, the number of large model startups increased to more than a dozen. From a technical point of view, the direct reason was that Llama was open source, but the most fundamental reason was that everyone believed at that time:
The technical threshold of large models is high, but it is not impossible to imitate.Especially based on the existing open source large models,the technical difficulty is further reduced, and the argument that "technology cannot constitute a commercial barrier" is rampant.
Under this "rule" of collective consensus, we review several dynamics of China's large-scale model entrepreneurship after the ChatGPT explosion in 2023, and it is not difficult to understand the current abnormal phenomenon of China's large-scale model entrepreneurship in the middle:
First, due to the weakening of the market's awe of technological innovation, after the ChatGPT explosion in 2023, among the first batch of China's large-scale model technology pathfinders, only Zhipu became the darling of capital, breaking through the 20 billion yuan valuation mark and entering the first echelon of large-scale models. (The Dark Side of the Moon was established after 2023, so it is not included)
The other two startups that came out of the Tsinghua Natural Language Processing Laboratory (THUNLP) faced the wall and spoke deeply, and their voices in the capital market were far less than those of the new forces that came later.
Especially Mianbi Intelligence (because Shenyan chose to focus on products), as the first company in China to propose to make a "civilian version of the large model", the company with the most similar technical vision and innovation direction to DeepSeek, and even established earlier than DeepSeek, it was not until the end of 2024 that it completed a RMB 300 million financing, and its valuation was less than RMB 3.5 billion, which is far from the RMB 20 billion threshold of the first echelon.
According to the exchanges between Leifeng.com AI Technology Review and more than 50 large model investors in the past two years, there are several main reasons why Zhipu and Menbi, which both originated from Tsinghua University, have the same technological first-mover advantage and outstanding young technical talents, are so different:
First, the Tsinghua academic school that pursues the base model only bets on one company because "they have reservations about professors starting their own businesses"; second, Zhipu's vision is easier to understand. When it said "benchmarking OpenAI" in its early external financing, VCs immediately understood it. However, because Menbi emphasized the optimization of underlying model training efficiency from the beginning, it was once considered to be an "AI Infra" company similar to Luchen and Silicon Base in 2023 when there was the most hot money.
Mianbi Intelligence did not get much money in 2023 and could not invest in large base models. Through training with large base models like DeepSeek V3, it can intuitively feedback the importance of efficient training. In 2024, it can only go for small end-side models, and the latter's endorsement effect on "efficient training" is far less good than that of DeepSeek V3.
When raising funds in 2022 and 2023, Mianbi raised funds under the banner of "efficient training", but was almost rejected by VCs.
Secondly, it is also the premise of the general environment without awe of technology. After the wave of large models came in 2023, China's AI technology VC did not actually settle down to study the technology of AGI, but in order to quickly get to the table, it invested money in "serial successful entrepreneurs who have won battles", even if these teams had no experience in large model research and development before.
Among them, the most typical representatives are Wang Huiwen's Light Years Away and Wang Xiaochuan's Baichuan Intelligence.
Among the big model companies with a current valuation of more than 20 billion yuan, only Zhipu Tang Jie, Moon's Dark Side Yang Zhilin and others began to explore the technology of big models in 2020 when big models were not out of the circle. Most of the teams of Baichuan Intelligence, MiniMax and Step Star did not start until after 2023.
For example, Yan Junjie, the founder of MiniMax, is from computer vision, and the big model initially solved language intelligence (multimodality is another chapter). However, MiniMax first gained capital favor by relying on the product Glow to go out of the circle, rather than the underlying big model technology, so this is another dimension, and people close to Yan Junjie all evaluate him as "very technically oriented."
DeepSeek's R&D team also started learning big model technology from scratch, studying papers and doing experiments, so there is no sign that a team that has never trained a big model before cannot make up for the technical shortcomings through hard learning after 2023, but from the industry development in the past two years, Baichuan Intelligence has not frequently upgraded the base model, and its focus has shifted to the big model of the medical industry.
Because it does not train big models such as videos, Baichuan's R&D costs are lower than other companies and its cash flow is abundant - but this is only beneficial to Baichuan, and it does not contribute to the development of the entire big model industry.
Assuming that under limited resources, teams without technical capabilities occupy a large amount of capital resources, while teams with technical capabilities can only get very few capital resources. The systematic dislocation of money and talent is destined to produce only regrets and no future.
If AGI big model technology really has no room for growth and the technical barriers of each company have gradually leveled, then the Internet era's strategy of competing for resources and capital may also be able to get the last piece of the pie. However, entrepreneurs who have awe of technology always keep a clear mind and can still see the shortcomings of the existing big model underlying algorithms and architectures in training and reasoning, and know that AGI still has many specific and difficult problems to solve.
In other words, the continuous innovation ability of the underlying technology is still the moat of big model companies, and the Internet methodology of pure resource competition is not applicable to the current development of big models in China. ——But these words are unlikely to be recognized by most Chinese technology VCs, because big model investment in 2023 and 2024 will even have a "Club Deal" game...
In the past two years of big model development, a VC who is unwilling to learn technology may be more lethal than a research and development that is unwilling to learn technology.
The bubble period will eventually end. After the tide recedes, it will be clear who is swimming naked.
2 AGI is hard to come by
Another impact of the market's lack of awe for technology is that in order to cater to the market (and of course to break through the encirclement of large companies), in the past two years, China's large model startups have also shifted their focus from long-term AGI to short-term commercial collection and product polishing.
This change in strategy is also due to the above-mentioned industry's misjudgment that large models are no longer innovative. Entrepreneurs who are determined to pursue AGI must take into account both business and technology, while teams that are skeptical of AGI or completely confused by the market voice will either give up pre-training, turn to C-end applications, or simply fine-tune industry large models based on open source models.
It took two and a half years from GPT-3 to the emergence of ChatGPT, but the market generally showed a "rule": it only takes two years for domestic large models to go from the base to commercialization. Although some large model companies can adhere to the "L2" and "L4" two-step approach at the same time, no company can be as pure as DeepSeek in terms of investment in talent and research resources.
When the financing war just started in the first half of 2023, an analysis in the industry was: After the "baptism" of the previous generation of AI companies, China's VCs have shortened their commercialization patience for large model companies from 5 years and 8 years to within 3 years. ——This may be the general dilemma of China's large model companies.
It is well known that DeepSeek focuses on AGI research, relying on the original reserve funds of Liang Wenfeng and Huanfang Quantitative, and has not raised funds from outside. "I have money, so I don't need to listen to the outside world, I can do whatever I want. "——This is also what many large model companies envy of DeepSeek.
Recently, Zhu Xiaohu, who originally criticized AGI, changed his words and said that because DeepSeek was willing to invest in AGI companies, it can be said that DeepSeek changed the views of VCs with its strong technical strength, but a more cruel reality is: A large number of teams with strong innovation capabilities may fall on the eve of the era because they cannot raise money.
"Commercial thinking" is not only reflected in the shadows of some technology VCs, but also in the selection of R&D talents.
According to the feedback from headhunters, in 2024, the company with the greatest "krypton gold" for talents in China is undoubtedly ByteDance. The division between large companies and entrepreneurial teams has been formed, and the flow of large-model talents from entrepreneurial teams to large companies has become a common choice in the past year. For example, according to AI Technology Review, the outstanding talents in NLP, multimodal and reinforcement learning that DeepSeek has selected for AGI chose ByteDance between DeepSeek and ByteDance.
According to headhunters who served DeepSeek in the early days, DeepSeek also hoped to recruit top talents from overseas teams such as Google, Meta, and OpenAI, but the progress was not smooth, so it could only settle for the second best and cultivate itself.
In addition to money, the investment in AGI also requires people, and they are a group of absolute technical idealists and excellent organizational culture. DeepSeek's success may not be replicated, but from V2, V3 to R1, R1-Zero, DeepSeek's technical results reflect its advantages in funds, talents/ideals and organizational culture.
Before DeepSeek, "Bei Jiukun and Nan Huanfang" were already well-known in the field of financial quantification, and the high requirements of the quantitative industry for technical talents are also well-known. Basically, the top 2 universities and gold medalists in informatics competitions are used as the benchmark. The team size is often small, but the ability is super strong. According to AI Technology Review, the team size of DeepSeek in the first half of 2024 was only more than 40 people, and most of them were technical experts from the original Magic Square Top2.
Continuing the style of the original Magic Square, DeepSeek's recruitment threshold has always been very high. For example, they began to look for technical experts in multimodal and reinforcement learning in the middle of 2024, but after more than half a year of recruitment, the relevant positions are still vacant, and they would rather have no talent than have inappropriate talent. After R1 became popular, although the number of resumes submitted increased dramatically, according to people familiar with the matter, "there are not many suitable ones."
The organizational culture within DeepSeek is also very flat. According to AI Technology Review, there is only one boss in both Beijing and Hangzhou: Liang Wenfeng, the founder of DeepSeek. "Liang Wenfeng and his subordinates are basically all workers."
In addition, Liang Wenfeng's personal style is also very obvious: he has a strong belief in technology, is full of curiosity and thirst for knowledge about AGI, and is very hardworking. People familiar with Liang Wenfeng described him as "speaking very, very slowly, thinking for a long time before expressing each sentence, and expressing himself very concisely. Although concise, his words often hit the nail on the head."
The team culture of DeepSeek is very similar to that of companies like Yushu and Momenta: the top positions are all technology enthusiasts, with a natural awe and curiosity about technology; at the same time, the management style is obviously centralized and the culture is flat, so when encountering difficulties in technology exploration, resources can be coordinated from top to bottom to quickly achieve the effect of uploading and downloading.
At the same time, Yushu and DeepSeek also have their own set of standards when recruiting people, which is very different from the stereotyped interview routines on the market. Interested readers can go and learn more.
DeepSeek Liang Wenfeng started exploring how to train stronger models at a lower cost very early, and the industry generally did not understand it at that time. Similarly, Wang Xingxing of Yushu also started to make four-legged robot dogs when people still didn't understand robot dogs. Cao Xudong of Momenta also started to make L2 and L4 at the same time when the autonomous driving industry was generally obsessed with L4.
Entrepreneurial teams that dare to go against the mainstream need a strong rebellious spirit. In the exchanges between AI Technology Review and many investors, this kind of "rebellion" is easily classified as "young people", but in my opinion, the confidence of rebellion ultimately comes from a team's cognition, judgment and technical confidence in the social problems they want to solve, that is: firmly believe that their direction of progress is the future and will bring huge value.
3 Taste of Innovation
After V2 set off a price war, Liang Wenfeng commented on this technological achievement in an interview with "Undercurrent": "Among the many innovations that happen every day in the United States, this is a very ordinary one."
After V3 and R1, Liang Wenfeng has not yet made any public comments, but for DeepSeek and Liang Wenfeng, before fully realizing AGI, perhaps the innovations of V3 and R1 are just "very ordinary ones." - This does not deny the breakthroughs and merits of the latter two, but wants to highlight that teams with high aspirations often say that 100 points are 80 points, and always pursue additional points.
R1 was released. A senior reinforcement learning scholar in the industry told AI Technology Review: "After replacing the RL+SFT paradigm with pure RL algorithms, I think AGI will be realized in three years at the latest."
Sam Altman said that AI will surpass humans in 2025, and Musk also said that AGI can be realized in 2026 at the latest. ——In the various predictions of "AGI time points", although it is difficult for us to judge when it will happen, we can feel that such a big trend is happening.
The trend is known, and DeepSeek's board has made everyone realize at least two facts: First, AGI technology has not reached the ceiling, and second, China's technology team has the ability to make innovations that lead the world's AGI. Compared with immersing ourselves in DeepSeek's victory, how to promote the development of China's AGI next is more important.
In the past half month, the storm of DeepSeek has brought new changes to the perception of AGI development among large companies, startups, computing power manufacturers, investors, etc. Some elephant-like problems that were ignored in the past have been re-emphasized, and some old views in the past have also been overturned. But the consistent change is: everyone realizes that at this stage, the realization of AGI still requires idealism.
Compared to guessing what OpenAI or DeepSeek will do next, it is more important to infer what technical problems AGI needs to solve. In other words, innovation is more important than imitation.
In fact, according to interviews with AI Technology Review in the past year, in addition to DeepSeek, there are also many AI talents in China who continue to insist on innovation and constantly propose new solutions to solve unresolved problems. Just to name a few:
Professor Ma Yi, Dean of the Institute of Computing and Data Science at the University of Hong Kong, has emphasized in the past two years that the large models currently trained by high computing power have knowledge, not intelligence. Different from the black box characteristics of deep learning, Ma Yi's team has been committed to studying explainable and controllable artificial intelligence algorithms and frameworks (white box theory).
At CNCC 2024, Zhipu Tang Jie mentioned the next development of multimodal technology. Since 2021, the Zhipu team has begun to explore multimodal large models. According to the Zhipu team, in the early exploration, they encountered similar problems: when multimodal data such as text, images, voice and video are simultaneously injected into the training large model, the data of one modality seems to weaken the knowledge/intelligence of another modality. Although multimodality is a trend, there is still a lot of research space on how to optimize cross-modal data alignment, collect high-quality data, and enhance the common sense and reasoning ability of multimodal models.
According to the communication with several founding members of the wall-facing team in March 2024, the current mainstream large model architecture cannot actually solve several key problems well, making it difficult to approach AGI: such as experience learning and spatial memory. For example, people can become more proficient by learning one thing many times, or quickly become familiar with a new environment and effectively transfer the cognition of another problem to the new environment. These problems are not easy to express with the current Transformer.
With the development of embodied intelligence, AGI will naturally be divided into cloud AGI and edge AGI. Edge AGI refers to a model that can naturally perceive the environment and perform high-level reasoning, and can make complex multi-step decisions based on high-level reasoning. The popular embodied big and small brains are developing along this trend, and there are still many problems to be solved in this direction. To solve these problems, in addition to resources, strong technical strength and technical vision are also required.
After the release of o1, many studies in the field of large models began to move towards reasoning, but according to rumors: Google's Gemini team has recently completed a new generation of basic models and opened testing to a small number of users.
Although Google's stock price plummeted in 2023 after being beaten by OpenAI, if we look at Google's big model technology from June 2020 to 2022, we can find that Google's big model approach is to build a system from the bottom up from the underlying computing power, architecture to the upper-level algorithm. This may also be an important reason why Google Gemini was able to make efforts later.
The same is true for DeepSeek. According to DeepSeek's technical disclosure, its path to research big models is also to penetrate from the underlying Wanka cluster and HAI framework upward to build an interlocking technical system.
Only by being vigilant against authority, always working backwards from the essence of the problem, and firmly innovating can we lead the trend. Short-term quick money may flow to lucky people, but long-term resources should flow to those teams that are good at applying resources to the best.
I hope that in 2025, there will no longer be only one DeepSeek in China.