
发布日期:2025-03-23 21:05 点击次数:86
o3最新解读
两位商讨员 ,英伟达Jim Fan 和 OpenAI 商讨科学家Nat McAleese 在外交媒体上共享了他们对 o3 的看法,揭示了这一模子在通用界限推理才智上的庞杂越过,以过火背后的技巧逻辑和改日瞻望,还有你可能体恤的一个问题:o3是不是存在刷榜的问题
Jim Fan:扩张“单点 RL 超等智能”Jim Fan 觉得,o3 的实质在于将“单点 RL 超等智能”的倡导进行扩张,使其偶然袒护更世俗的实用问题。他指出,AI 在特定界限哄骗强化学习 (RL) 获得惊东谈主树立并不荒凉。
举例,AlphaGo 在围棋界限、AlphaStar 在星际争霸界限以及 Boston Dynamics 的 e-Atlas 机器东谈主在特定看成上的进展王人号称超等智能,超越了东谈主类的平均水平
肖似地,在 AIME、SWE-Bench 和 FrontierMath 等界限,o3 也展现出了超越绝大大批东谈主类的超强才智。与 AlphaGo 不同,o3 的遏抑在于搞定了复杂数学和软件工程界限中奖励函数的难题。这意味着 o3 不再是只擅长单点任务的 RL 众人,而是在更大范围的灵验任务中王人进展出刚烈的 RL 才智
有关词,Jim Fan 也指出,o3 的奖励工程并不可袒护东谈主类明白的沿途范围。这说明了为什么 o3 在某些界限能让顶尖众人咋舌,但在一些浅近的儿童谜题上却会失败。这就像咱们不可指望 AlphaGo 去玩扑克牌并奏效同样。
Jim Fan 觉得 o3 是一个庞杂的里程碑,并描述了明晰的发展途径,但同期也强调仍有好多责任要作念
Nat McAleese:通用界限推理的庞杂越过Nat McAleese 强调 o3 代表了 RL 在通用界限推理上的庞杂越过。她首先追念了 o1 模子——使用 RL 磨练的大型谈话模子,并指出 o3 是在 o1 的基础上进一步扩大 RL 范围的末端。
o3 在多个界限的进展王人令东谈主印象深化:
编程竞赛: 在最近的编程竞赛中,o3 的进展不错与宇宙上顶尖的要领员相比好意思,估量在 CodeForces 上的评级超过 2700 分,这致使超出了 Nat 此前的预期
GPQA: 在 GPQA 数据集上,o3 的得分高达 87.7%,远超之前任何已知的外部 LLM 模子(举例 Gemini Flash 2 的 62%)以及 o1 的 78%
软件工程: o3 在 SWE-bench 考据集上的得分高达 71.7%,大幅提高了之前模子的水平
数学难题: 在 FrontierMath 2024-11-26 数据集上,o3 将准确率从 2% 提高到 25%
ARC: 在 ARC 数据集上,o3 的得分在半独到测试集和公开考据集上分离达到了 87.5% 和 91.5%
Nat 止境强调,为了抹杀磨练数据表露的可能性,OpenAI 相当青睐数据抵制问题,况兼在 ARC 和 FrontierMath 等没见过的数据集上考据了 o3 的性能,确保了末端的可靠性,o3莫得刷榜
中英文原文
Jim Fan:英伟达高档商讨司理,GEAR 执行室(具身东谈主工智能)蚁合创举东谈主。GR00T 边幅厚爱东谈主:搞定通用机器东谈主问题@斯坦福大学博士学位@OpenAI首位实习生
Thoughts about o3: I'll skip the obvious part (extraordinary reasoning, FrontierMath is insanely hard, etc). I think the essence of o3 is about relaxing a single-point RL super intelligence to cover more points in the space of useful problems.
对于 o3 的一些思法:我会跳过那些可想而知的部分(举例超卓的推理才智、FrontierMath 的难度极高等)。我觉得 o3 的中枢在于将单点的 RL 超等智能扩张到袒护更多灵验问题空间的才智
The world of AI is no stranger to RL achieving god-level stunts.
东谈主工智能界限对于强化学习获得惊东谈主树立并不生分。
AlphaGo was a super intelligence. It beats the world champion in Go - well above 99.999% of regular players. AlphaStar was a super intelligence.中国人体艺术
It bests some of the greatest e-sport champion teams on StarCraft. Boston Dynamics e-Atlas was a super intelligence. It performs perfect backflips. Most human brains don't know how to send such sophisticated control signals to their limbs.
AlphaGo 是一个超等智能。它打败了围棋宇宙冠军——远高于 99.999% 的平方棋手。AlphaStar 是一个超等智能。它打败了一些星际争霸中最伟大的电子竞技冠队列伍。波士顿能源公司的 e-Atlas 是一个超等智能。它能完好地完成后空翻。大大批东谈主类大脑不知谈若何向肢体发送如斯复杂的划定信号。
Similar statement can be made for AIME, SWE-Bench, and FrontierMath - they are like Go, which requires exceptional domain expertise above 99.99....% of average people. o3 is a super intelligence when operating in these domains.
肖似的说法也适用于 AIME、SWE-Bench 和 FrontierMath——它们就像围棋同样,需要超越 99.99....% 平方东谈主的超越界限专科常识。当在这些界限运行时,o3 是一种超等智能。
The key difference is that AlphaGo uses RL to optimize for a simple, almost trivially defined reward function: winning the game gives 1, losing gives 0.
Learning reward functions for sophisticated math and software engineering are much harder. o3 made a breakthrough in solving the reward problem, for the domains that OpenAI prioritizes. It is no longer an RL specialist for single-point task, but an RL specialist for a bigger set of useful tasks.
重要的区别在于,AlphaGo 使用强化学习(RL)来优化一个浅近、着实不错说是微不及谈的奖励函数:赢得比赛得 1 分,输掉比赛得 0 分。
而为复杂的数学和软件工程估量打算奖励函数则要艰巨得多。o3 在 OpenAI 优先关注的界限中搞定奖励问题上获得了遏抑。它不再是单点任务的 RL 众人,而是偶然处理更大范围灵验任务的 RL 众人
Yet o3's reward engineering could not cover ALL distribution of human cognition. This is why we are still cursed by Moravec's paradox. o3 can wow the Fields Medalists, but still fail to solve some 5-yr-old puzzles like the one below. I am not at all surprised by this cognitive dissonance, just like we wouldn't expect AlphaGo to win Poker games.
有关词,o3 的奖励机制估量打算仍无法袒护东谈主类明白的统共散布。这亦然为什么咱们仍然受到莫拉维克悖论的困扰。o3 能让菲尔兹奖得主咋舌不已,但却仍然无法搞定一些肖似下图这么的 5 岁儿童谜题。对于这种明白上的反差,我小数也不感到骇怪,就像咱们不会指望 AlphaGo 赢得扑克比赛同样
Huge milestone. Clear roadmap. More to do.
庞杂的里程碑。明晰的途径图。还有更多事情要作念
Nat McAleese :OpenAI 商讨员。此前在DeepMind 责任
o3 represents enormous progress in general-domain reasoning with RL — excited that we were able to announce some results today! Here’s a summary of what we shared about o3 in the livestream
o3 代表了在通用界限推理方面使用强化学习获得的庞杂越过——很欢腾咱们今天偶然通告一些恶果!以下是咱们直播等共享的对于 o3 的总结
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive.
o1 是第一个大型推理模子——正如咱们在领先的“学习推理”博客中详尽的那样,它“仅仅”一个使用强化学习磨练的 LLM。o3 的能源来自于在 o1 的基础上进一步扩大强化学习的范围,由此产生的模子的强度相当相应时东谈主印象深化
Firstly and most importantly: we tested on recent unseen programming competitions and find that the model would rank amongst some of the best competitive programmers in the world, with an estimated CodeForces rating over 2700.
首先亦然最进攻的是:咱们在最近未见过的编程竞赛中进行了测试,发现该模子谢宇宙顶尖的竞技要领员中排行靠前,估量其 CodeForces 评级超过 2700 分
This is a milestone (codeforces better than Jakub Pachoki) that I thought was further away than December ‘24; these competitions are hard and extremely competitive; the model is absurdly good.
这是一个里程碑(在 Codeforces 上比 Jakub Pachoki 更好),我原以为它会比 2024 年 12 月更晚到来;这些比赛相当艰巨且竞争强烈;该模子好得离谱
Scores are impressive elsewhere too. 87.7% GPQA diamond towers over any LLM I’ve aware of externally (I believe non-o1 sota is gemini flash 2 at 62%?), as well as o1’s 78%. Unknown noise ceiling, so this may even understate o3 science improvements over o1.
在其他方面的得分也很令东谈主印象深化。87.7% 的 GPQA diamond 超过了我所知的任何外部 LLM 模子(我觉得非 o1 的 SOTA 是 Gemini Flash 2 的 62%?),以及 o1 的 78%。未知的噪声上限,因此这致使可能低估了 o3 在科学方濒临 o1 的矫正
o3 can also do software engineering, setting a new state of the art on SWE-bench verified with 71.7%, massively improving over o1.
o3 也不错进行软件工程,在 SWE-bench 考据集上达到了 71.7% 的新技巧水平,与 o1 比拟有了庞杂的矫正。
With scores this strong, you might fear accidental contamination. Avoiding this is something OAI is obviously obsessed with; but thankfully we also have some test sets that are strongly guaranteed uncontaminated: ARC and FrontierMath… What do we see there?
有了如斯高的分数,你可能会牵记不测的抵制。幸免这种情况彰着是 OpenAI 相当关注的事情;但庆幸的是,咱们也有一些被强烈保证未受抵制的测试集:ARC 和 FrontierMath……咱们在那处看到了什么?
Well, on FrontierMath 2024-11-26 o3 improves the state of the art from 2% to 25% accuracy. These are absurdly hard strongly held out math questions. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). (7/n)
好吧,在 FrontierMath 2024-11-26 上,o3 将首先进的水平从 2% 的准确率提高到 25%。这些是极其艰巨且严格阻隔的数据鸠合的数学问题。在 ARC 上,半独到测试集和公开考据集的得分分离为 87.5%(独到)和 91.5%(公开)
So at least in those cases, we know with true certainty that results are not due to memorization (and very sure in all the other evals I describe as unseen too; I'm just tremendously paranoid).
因此,至少在这些情况下,咱们不错相当详情这些末端并非由于缅思化所致(而且我对其他我形势为“未见过”的评估末端也相当有信心;仅仅我止境严慎远程)
We’ve also found that we can use o3 to train faster and cheaper models without losing as much performance as you might expect: o3-mini is a mighty little beast, and I’m hopeful that Hongyu will share a good thread on how it stacks up.
咱们还发现,咱们不错使用 o3 来磨练更快更低廉的模子,而不会像你思象的那样亏本太多性能:o3-mini 是一个刚烈的小野兽,我但愿 Hongyu Ren能共享一个对于它若何堆叠的精彩帖子
Are there any catches? Well, as the ARC team outlined in our release, o3 is also the most expensive model ever at test-time. But what that means is we’ve unlocked a new era where spending more test-time compute can produce improved performance up to truly absurd levels.
有什么症结吗?嗯,正如 ARC 团队在咱们的发布中所指出的,o3 亦然测试阶段本钱最高的模子。但这也意味着咱们开启了一个新期间,通过插足更多的测试策动资源,不错将性能提高到极其惊东谈主的水平
My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale.
我个东谈主的渴望是 token 价钱会下跌,而这里最进攻的音讯是,咱们当今有了将测试时策动转动为大范围性能提高的设施
The models will only get better with time; and almost nobody (on a grand scale) can still beat them at programming competitions or math. Merry Christmas!
模子只会跟着时候的推移变得更好;而且着实莫得东谈主(从宏不雅层面)仍然能在编程竞赛或数学上打败它们。圣诞自得!
色酷影院As Sam mentioned at the start of the stream: this is not a model that you can talk to yet... unless you sign up to red team it with us! https://openai.com/index/early-access-for-safety-testing/
正如 Sam 在直播启动时提到的那样:这还不是一个你不错与之交谈的模子……除非你注册加入咱们的红队测试!https://openai.com/index/early-access-for-safety-testing
本文开头:AI寒武纪中国人体艺术,原文标题:《o3 莫得“刷榜”》
风险领导及免责条件 市集有风险,投资需严慎。本文不组成个东谈主投资冷漠,也未磋议到个别用户极度的投资目的、财务景象或需要。用户应试虑本文中的任何意见、不雅点或论断是否稳当其特定景象。据此投资,包袱自诩。