Why DeepSeek is different, in three charts

DeepSeek, a little-known Chinese AI business until recently, has been the buzz of the tech industry after releasing a number of huge language models that surpass the capabilities of many of the best AI engineers in the world.

On January 20, DeepSeek unveiled R1, its most talked-about large language model. In recent days, the AI assistant rose to the top of the Apple App Store, pushing OpenAI’s long-dominant ChatGPT to the number two spot.

As the Chinese company boasts that its model was produced at a fraction of the cost, Silicon Valley is in a frenzy about its unexpected domination and its ability to outperform top U.S. models across a variety of criteria.

A reckoning in the industry has been sparked by the shock in U.S. tech circles, suggesting that perhaps AI developers don’t require enormous sums of money and resources to enhance their models. Rather, scientists are beginning to realize that it might be possible to make these processes more cost-effective and energy-efficient without sacrificing functionality.

Following the release of its previous model, V3, in late December, came R1. However, DeepSeek unveiled Janus-Pro-7B, another powerful AI model that is multimodal in its ability to digest different kinds of media, on Monday.

These characteristics give DeepSeek’s huge language models their distinctive appearance.


Size

With a big, strong model that performs equally well on fewer resources, DeepSeek is outperforming its competitors despite being created by a smaller team with significantly less money than the leading American tech companies.

The reason for this is that the AI assistant uses a system of a variety of experts to break down its big model into many smaller submodels, or experts, each of whom is an expert in managing a particular kind of data or task. Each submodel is only active when its specific knowledge is pertinent, as opposed to the conventional method, which employs every component of the model for every input.

See also  Republicans fume after President Joe Biden pardons his son Hunter

According to a technical paper its developers released, V3 is only using 37 billion of the 671 billion parameters—or settings—that are inside the AI model that it updates as it learns.

Instead of utilizing a standard penalty-based approach, which might result in worse performance, the company also devised a unique load-bearing system that uses more dynamic modifications to ensure that no expert is overloaded or underloaded with work.

Without slowing down the entire process, all of these allow DeepSeek to hire a strong staff of professionals and continue adding more.

Additionally, the model employs a method known as inference-time compute scaling, which enables it to modify its computational effort based on the task at hand, rather than operating continuously at maximum efficiency. For instance, asking a simple question could just need a few symbolic gears to revolve, whereas requesting a more intricate analysis might require the entire model.

When combined, these methods facilitate the usage of such a big model in a far more effective manner than previously possible.


Training cost

Because of DeepSeek’s design, its models can be trained more quickly and at a lower cost than those of its rivals.

DeepSeek claims that V3, which served as the basis for the development of R1, needed less than $6 million and only two months to produce, despite the fact that top tech companies in the US continue to spend billions of dollars annually on AI. Additionally, DeepSeek was compelled to use Nvidia’s less potent H800s to construct its models because of U.S. export limitations that restricted access to the greatest AI computer chips, including Nvidia’s H100s.

See also  'The Golden Bachelor' star Gerry Turner says he has incurable cancer

The company’s creation of a mixed precision framework, which combines low-precision 8-bit integers (FP8) with full-precision 32-bit floating point numbers (FP32), is one of its greatest innovations. The latter is quicker to process and requires less memory, but it may also be less precise.

Instead of depending solely on one, DeepSeek uses FP8 for the majority of calculations and FP32 for a few critical operations where accuracy is crucial, saving memory, time, and money.

According to some experts in the industry, DeepSeek may have been compelled to innovate due to its low resources, opening the door for the possibility that AI engineers could accomplish more with less.


Performance

Even with its very low budget, DeepSeek’s benchmark scores are competitive with the most recent state-of-the-art models from leading American AI developers.

In the artificial analysis quality index, an independent ranking of AI analysis, R1 and OpenAI’s o1 model are almost tied. Google’s Gemini 2.0 Flash, Anthropic’s Claude 3.5 Sonnet, Meta’s Llama 3.3-70B, and OpenAI’s GPT-4o are just a few of the models that R1 is already outperforming.

Its ability to use chain-of-thought reasoning, which divides difficult tasks into manageable chunks, to explain its reasoning is one of its primary characteristics. This technique lets the model go back and make changes to previous steps, simulating human thought, while still letting users understand its reasoning.

When V3 was released last month, it was also performing on par with Claude 3.5 Sonnet. Prior to R1, the model outperformed China’s previous top AI model, Alibaba’s Qwen2.5-72B, Llama 3.3-70B, and GPT-4o.

See also  California fire evacuations: How to prepare, what to pack and what to do if you're trapped

In the meantime, DeepSeek asserts that its most recent Janus-Pro-7B outperformed Stable Diffusion’s 3 Medium and OpenAI’s DALL-E across a number of benchmarks.

Leave a Reply

Your email address will not be published. Required fields are marked *