gtag('config', 'G-0PFHD683JR');
Price Prediction

A lot of AIS with a lot of terrible names: How to choose your artificial intelligence model

Since early 2025, artificial intelligence laboratories were overwhelmed with many new models that I face in order to keep up with this.

But the trends say no one cares! There is only ChatGPT:

How is that?

The new models are great, but their designation is a complete chaos. In addition, you cannot even know the models via the standards anymore. Easy “this is the best, everyone uses it” does not work now.

In short, there are many wonderful artificial intelligence models on the market, but few people actually use.

This is a shame!

I will try to understand the name chaos, explain the standard crisis, and share tips on how to choose the appropriate model for your needs.

Many models, terrible names

Amodei has long been jokingly so that we could create AGI before we learn to nominate our models clearly. Google traditionally leads the confusion game:

To be fair, it is logical. Each “base” now has a lot of updates. They are not always pioneering enough to justify each update as a new version. This is where all these initiatives come from.

To simplify things, you collected a table of models of major laboratories, stripping all unnecessary details.

So, what are these types of models?

  1. There is huge and strong a base Models. It is impressive but slow and widely expensive.

  2. For this reason we invented distillation: Take a basic model, train a more compact model on his answers, and you will get almost the same capabilities, just faster and cheaper.

  3. This is especially important Thinking Models. The best performers now follow the multi-no-steps-to-solution chains-Identify, implement, and check the result. Effective but price.

There are also specialized models: for research, very cheap for simple tasks, or models of specific fields such as medicine and law. In addition to a separate collection of photos, video and sound. I did not include all of these to avoid confusion. I have also intentionally ignored some other models and laboratories to keep them as simple as possible.

Sometimes, more details make things worse.

All models are essentially equal

It has become difficult to choose a clear winner. Andrig Carbashai recently described this as a “evaluation crisis”.

It is not clear the standards that must be seen now. MMLU is outdated, and the SWE structure is very tight. Chatbot Arena is so popular that laboratories have learned “penetration”.

Currently, there are several ways to evaluate the forms:

  1. Narrow criteria measure very specific skills, such as bitoon coding or hallucinogenic rates. But the models have become more intelligent and mastered more tasks, so you cannot measure your level on only one scale.

  1. Comprehensive standards try to capture multiple dimensions with many standards. However, comparing all these degrees quickly becomes chaotic. Note that people are trying to make up the complications of these complex standards. Five or ten at one time! One of the models wins here, another there – a good luck to understand it.

LifeBench contains 3 standards within each category. This is just one standard between dozens.LifeBench contains 3 standards within each category. This is just one standard between dozens.

  1. Arena, where humans blindly compare typical answers based on personal preferences. The models get an Elo classification, such as chess players. Win often, get the ELO higher. But this was great until the models approached each other.

The difference is 35 points that the model is only 55 % of the time.

As in chess, the player who has a lower ELO still has a good opportunity to win. Even with a 100 -point gap, the “worse” model still exceeds a third of cases.

Once again – some tasks are better resolved by one model, and the other by another. Choose a higher model in the list, and one of your ten requests may be better. Which one and how better?

who knows.

So, how do you choose?

Because there are no better options, Karpathy suggests relying on checking the atmosphere.

Test the models yourself and see anyone who feels good. Certainly, it is easy to deceive yourself.

It is subjective and vulnerable to bias – but it is practical.

Here is my personal advice:

  1. If the task is new – open multiple tabs with different models and compare the results. Trust in your intestine on any model that requires less change or adjustments.
  2. If the task is more knowledgeable, only use your best model.
  3. We forget to chase records. Focus on the UX you love, and set the subscription priorities you already want to pay for.
  4. If you still want numbers, try https://livebench.i/#/. Creators claim that it works to repair common standard problems such as piracy, limitations, distress, and subjectivity.
  5. For a product creator, here is a great guide from Hugingface on how to prepare your own standard. https://github.com/hugingface/evalation-guidebook/

Meanwhile, if you are waiting for a mark to try something other than ChatGPT, then this is:

https://claud.ai/

https://gemini.google.com/

https://grok.com/

https://chat.deepek.com/

htрs: // OtPenai. Atom

After that, I will cover the important points of each style and summarize the checks of others.

If you enjoyed this and do not want to miss the following article, subscribe!

There is more in the future !!There is more in the future !!

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button