=== TAG === AI Models === HEADLINE === GPT-5.4 Is Here — And It Just Beat Human Performance === META_DESC === 75% on OSWorld vs. the 72.4% human baseline. The first AI to outperform humans at native desktop tasks — not a plugin, not a preview. Native. === DATE === April 23–26, 2026 === AUTHOR === Jane Sterling === READ_TIME === 9-minute read === HERO_IMG === img/content.png === SCRIPT_LABEL === Video Script (9 min, clean transcript for captioning) === SCRIPT === Let me ask you something. What does it actually mean when an AI model beats human performance? Not on a trivia test. Not on a multiple choice exam. On a benchmark that measures whether it can sit down at a computer, look at the screen, and get things done — the same way you do. Because that just happened. OpenAI launched GPT-5.4 this week, and buried inside the announcement is something that I think most people are going to walk right past: this model scored 75% on OSWorld. The human baseline on that same benchmark is 72.4%. AI just beat us at using a computer. Now before we go full apocalypse mode — let's actually talk about what this means, because it's more interesting and more complicated than the headline suggests. OSWorld is a benchmark that tests whether an AI can navigate a real desktop environment. We're talking screenshots, keyboard inputs, mouse clicks. Tasks like "open this file, find this information, fill in this form, send this email." Things a competent human assistant would do without thinking. GPT-5.4 does them better. What's significant here isn't just the score. It's that this is the first time OpenAI has baked computer use natively into a mainline flagship model. Previous versions could browse the web with plugins, could use tools, could write code — but this is the model itself reaching out and controlling a computer as a first-class capability. Not a bolt-on. Not a preview feature. Native. So what else did they ship? The context window is now one million tokens. To put that in perspective — the complete works of Shakespeare are about 900,000 words. You could feed this model the entire collected output of one of history's greatest writers and still have room left over. In practical terms, this means you can dump an entire codebase, an entire legal document archive, an entire research library into a single conversation and work with it in one shot. No chunking. No losing context halfway through. On the coding side — 57.7% on SWE-bench Pro. That's frontier-level performance for real-world software engineering tasks. Not toy problems. Actual GitHub issues from actual repositories. On knowledge work — 83% on GDPval. That benchmark measures performance on graduate-level professional tasks across law, medicine, finance, and research. And the price? $2.50 per million input tokens, $15 per million output tokens. Let that sit for a second. You're getting a model that outperforms humans at desktop automation, handles one million tokens of context, codes at a near-professional level, and thinks through complex professional problems — for two dollars and fifty cents per million tokens. Now here's the part I want you to actually think about, because this is where it starts to matter for people who aren't AI researchers. The shift from AI-as-chatbot to AI-as-coworker is happening right now. Not in five years. Not in the next model cycle. Now. When a model can sit at a computer, understand the screen, and take actions — that model can do your job. Not your whole job. Not every part of it. But the parts that involve navigating software, pulling information, filling out forms, running processes — those parts are now on the table. For some people, that's terrifying. For others, it's the most powerful productivity lever they've ever seen. The honest answer is it's both, and which one it is for you depends entirely on whether you figure out how to use it before someone else does. One thing worth flagging: GPT-5.4 also has five-level reasoning effort control. You can dial up or down how hard the model thinks depending on what you need. Simple task? Fast, cheap inference. Complex multi-step problem? Full reasoning effort. That's a level of control we haven't had before, and it changes how you'd build workflows around this. The model is rolling out across ChatGPT and the API now. The computer use feature specifically is in the API, which means right now it's mostly developers who can get to it — but that changes fast. So where does this leave things? GPT-5.4 is a meaningful step, not a gimmick release. The computer use score alone makes it a serious contender for any workflow involving desktop automation. The context window alone changes what's possible in long-document analysis. The combination of both, at this price point, makes it one of the most practically capable models available right now. The question isn't whether this matters. It does. The question is what you do about it. Stay sharp. — Jane Sterling, Sterling Intelligence === ANNOTATED_LABEL === Annotated Script (with b-roll & cut cues) === ANNOTATED_HTML === [TALKING HEAD — hook]
Let me ask you something.
What does it actually mean when an AI model beats human performance? Not on a trivia test. Not on a multiple choice exam. On a benchmark that measures whether it can sit down at a computer, look at the screen, and get things done — the same way you do.
Because that just happened.
[B-ROLL: company-logo:openai]OpenAI launched GPT-5.4 this week, and buried inside the announcement is something that I think most people are going to walk right past: this model scored 75% on OSWorld. The human baseline on that same benchmark is 72.4%.
[STAT CARD: "75% vs 72.4% human baseline"]AI just beat us at using a computer.
[CUT] [TALKING HEAD — transition]Now before we go full apocalypse mode — let's actually talk about what this means, because it's more interesting and more complicated than the headline suggests.
[VOICEOVER — scene 2] [B-ROLL: screen-capture:osworld-benchmark]OSWorld is a benchmark that tests whether an AI can navigate a real desktop environment. We're talking screenshots, keyboard inputs, mouse clicks. Tasks like "open this file, find this information, fill in this form, send this email." Things a competent human assistant would do without thinking. GPT-5.4 does them better.
[B-ROLL: ai-abstract]What's significant here isn't just the score. It's that this is the first time OpenAI has baked computer use natively into a mainline flagship model. Previous versions could browse the web with plugins, could use tools, could write code — but this is the model itself reaching out and controlling a computer as a first-class capability. Not a bolt-on. Not a preview feature. Native.
[/VOICEOVER] [TALKING HEAD — transition]So what else did they ship?
[VOICEOVER — scene 3] [B-ROLL: stills:shakespeare-library]The context window is now one million tokens. To put that in perspective — the complete works of Shakespeare are about 900,000 words. You could feed this model the entire collected output of one of history's greatest writers and still have room left over. In practical terms, this means you can dump an entire codebase, an entire legal document archive, an entire research library into a single conversation and work with it in one shot. No chunking. No losing context halfway through.
[STAT CARD: "1,000,000 tokens"] [B-ROLL: code-terminal]On the coding side — 57.7% on SWE-bench Pro. That's frontier-level performance for real-world software engineering tasks. Not toy problems. Actual GitHub issues from actual repositories.
[STAT CARD: "57.7% SWE-bench Pro"] [B-ROLL: stills:professional-knowledge-work]On knowledge work — 83% on GDPval. That benchmark measures performance on graduate-level professional tasks across law, medicine, finance, and research.
[STAT CARD: "83% GDPval"] [B-ROLL: finance-charts]And the price? $2.50 per million input tokens, $15 per million output tokens.
[STAT CARD: "$2.50 / $15 per million tokens"]Let that sit for a second. You're getting a model that outperforms humans at desktop automation, handles one million tokens of context, codes at a near-professional level, and thinks through complex professional problems — for two dollars and fifty cents per million tokens.
[/VOICEOVER] [CUT] [TALKING HEAD — transition]Now here's the part I want you to actually think about, because this is where it starts to matter for people who aren't AI researchers.
[VOICEOVER — scene 4] [B-ROLL: ai-abstract]The shift from AI-as-chatbot to AI-as-coworker is happening right now. Not in five years. Not in the next model cycle. Now. When a model can sit at a computer, understand the screen, and take actions — that model can do your job. Not your whole job. Not every part of it. But the parts that involve navigating software, pulling information, filling out forms, running processes — those parts are now on the table.
[B-ROLL: stills:office-workers]For some people, that's terrifying. For others, it's the most powerful productivity lever they've ever seen. The honest answer is it's both, and which one it is for you depends entirely on whether you figure out how to use it before someone else does.
[/VOICEOVER] [B-ROLL: screen-capture:chatgpt-reasoning-slider]One thing worth flagging: GPT-5.4 also has five-level reasoning effort control. You can dial up or down how hard the model thinks depending on what you need. Simple task? Fast, cheap inference. Complex multi-step problem? Full reasoning effort. That's a level of control we haven't had before, and it changes how you'd build workflows around this.
[B-ROLL: screen-capture:openai-api-dashboard]The model is rolling out across ChatGPT and the API now. The computer use feature specifically is in the API, which means right now it's mostly developers who can get to it — but that changes fast.
[CUT] [TALKING HEAD — sign-off]So where does this leave things?
GPT-5.4 is a meaningful step, not a gimmick release. The computer use score alone makes it a serious contender for any workflow involving desktop automation. The context window alone changes what's possible in long-document analysis. The combination of both, at this price point, makes it one of the most practically capable models available right now.
The question isn't whether this matters. It does. The question is what you do about it.
Stay sharp.
— Jane Sterling, Sterling Intelligence
=== ARTICLE_HTML ===OpenAI just shipped GPT-5.4 — and if you only read one thing about it, read this: it scored 75% on OSWorld. The human baseline is 72.4%. For the first time, an AI model has outperformed humans at using a computer.
Not at trivia. Not at pattern recognition. At actually sitting in front of a desktop, looking at the screen, and getting things done.
In this video, Jane Sterling breaks down everything you need to know about GPT-5.4 — what it actually does, why the numbers matter, what changed from previous versions, and what this means for how you work.
GPT-5.4 is OpenAI's latest flagship model, launched in late March 2026 and now rolling out across ChatGPT and the API. It is the first mainline OpenAI model to incorporate native computer use capabilities — not as a plugin or preview, but as a core feature of the model itself.
Previous GPT models could use tools, browse the web, and write code. GPT-5.4 can look at a screenshot of your desktop and take action. It understands what it's seeing, decides what to do, and executes — keyboard inputs, mouse clicks, form fills, file navigation.
That is a fundamentally different kind of AI capability than anything we've seen deployed at this scale before.
OSWorld — 75.0%
OSWorld is a benchmark that measures AI performance on real desktop computer tasks. Not simulations. Actual desktop environments with real software. GPT-5.4 scores 75.0%. The human baseline is 72.4%. GPT-5.2, the previous version, scored 47.3%. That's a 27.7-point improvement in a single model generation.
SWE-bench Pro — 57.7%
SWE-bench Pro measures performance on real-world software engineering tasks pulled from actual GitHub repositories. 57.7% is frontier-level. It means the model can handle more than half of real-world coding challenges at a professional standard.
GDPval — 83%
GDPval measures graduate-level professional task performance across law, medicine, finance, and research domains. 83% is not a research curiosity number — it's a number that should make every knowledge worker think carefully about the next two years.
Context Window — 1,000,000 tokens
One million tokens. The complete works of Shakespeare come in at around 900,000 words. You could load an entire codebase, a complete legal archive, a full research library, or a year's worth of company communications into a single GPT-5.4 conversation and work with it without losing a thread.
The jump from 47.3% to 75% on OSWorld is the headline, but the architectural changes underneath it are what make GPT-5.4 significant rather than incremental.
GPT-5.4 merges the frontier coding capabilities previously exclusive to GPT-5.3-Codex into the main model. Previously, if you wanted top-tier coding performance, you used a different model variant. Now those capabilities are the baseline.
The five-level reasoning effort control is new and genuinely useful. You can tell the model how hard to think. Simple request? Fast inference, minimal cost. Complex multi-step reasoning problem? Full effort, longer output, deeper analysis. That control changes how you'd architect workflows — you're not paying for maximum reasoning when you don't need it.
$2.50 per million input tokens.
$15 per million output tokens.
For context: a typical 9-minute YouTube script is roughly 1,300 words, or about 1,700 tokens. You could generate 588 of those for a dollar. At scale, the cost calculus for GPT-5.4 becomes very interesting very quickly for anyone building AI-powered workflows.
Let's be direct about this.
A model that can control a computer — that can see a screen, understand what's on it, and take meaningful action — is not just a better chatbot. It is a digital worker. Not a complete replacement for human judgment. Not capable of everything. But capable of the parts of knowledge work that involve navigating software, pulling and organizing information, executing repetitive processes, and filling out forms.
Those tasks account for a significant percentage of white-collar working hours. Research suggests that between 30-40% of the tasks performed by office workers on any given day fall into this category.
GPT-5.4's computer use capability puts those tasks in scope for automation right now, at production quality, through a commercially available API.
The million-token context window extends this further. Long-document analysis, cross-referencing multiple lengthy sources, maintaining coherent understanding across an entire project over time — all of these become tractable in a way they simply weren't before.
Developers building automation pipelines — this is your new baseline. The computer use API changes what's buildable without custom vision systems or brittle UI automation scripts.
Legal, medical, and finance professionals — the GDPval score means this model is operating in your domain at a level of competence that matters. The question isn't whether AI is coming for knowledge work. It's already here.
Business operators — any workflow that involves humans navigating software to extract, move, or process information is now a candidate for automation with GPT-5.4. The ROI math on that has changed significantly this week.
Anyone building AI-powered products — the context window alone unlocks use cases that weren't viable before. Full codebase analysis. Complete document review. End-to-end research synthesis. These are now single-session tasks.
Computer use at 75% means it fails 25% of the time. In a fully autonomous deployment, that failure rate has consequences. Human oversight of AI-driven desktop tasks remains important, particularly for anything with financial, legal, or irreversible consequences.
The model is also rolling out gradually. The computer use feature is API-first, which means the average ChatGPT user isn't getting full access immediately. If you want to work with it at full capability right now, you're in the API.
GPT-5.4 is one data point in a larger pattern. Every major lab is converging on the same destination: AI that doesn't just respond to you, but acts on your behalf. Computer use is OpenAI's entry into that space at production quality.
Anthropic has Claude with agentic capabilities. Google has Gemini with workspace integration. The competition is not about who can write the most impressive essay anymore. It's about who can get things done in the real world.
GPT-5.4 just moved the line.
Subscribe to Sterling Intelligence for weekly breakdowns of what's actually happening in AI — no hype, no filler, just the signal.
New videos every week.
— Jane Sterling
Some links in the description may be affiliate links. If you purchase through them, I may earn a small commission at no extra cost to you. I only recommend things worth recommending.
=== YOUTUBE_DESC === GPT-5.4 just beat humans at using a computer. 75% on OSWorld. Human baseline: 72.4%. This is the first time an AI model has outperformed humans at sitting down at a desktop and getting real work done — and it changes everything about how "agent AI" gets built from here. OpenAI shipped GPT-5.4 in late March 2026, and buried inside the announcement is a number most people are going to walk right past. On OSWorld — a benchmark that tests whether an AI can navigate a real desktop environment, take screenshots, click, type, and execute actual tasks — GPT-5.4 scored 75%. Humans score 72.4% on the same test. GPT-5.2, just one model generation ago, scored 47.3%. That is a 27.7-point jump in a single release. In this episode, Jane Sterling breaks down what GPT-5.4 actually is, why the OSWorld score matters more than any benchmark OpenAI has released this year, and what native computer use means for the people whose jobs involve navigating software all day. Key numbers covered: • OSWorld — 75.0% (human baseline 72.4%) • SWE-bench Pro — 57.7% (frontier-level real-world coding) • GDPval — 83% (graduate-level professional tasks) • Context window — 1,000,000 tokens (the full works of Shakespeare is ~900,000 words) • Price — $2.50 / million input tokens, $15 / million output tokens • Five-level reasoning effort control — new in this release We cover what changed from GPT-5.2, how GPT-5.4 folds in the coding power of GPT-5.3-Codex, who should actually care (developers, legal / medical / finance professionals, business operators, product builders), the honest caveats (25% failure rate on computer use is not trivial), and how this fits the broader shift from AI-as-chatbot to AI-as-coworker. ⏱ Chapters 00:00 The headline nobody's talking about 01:00 What is OSWorld and why it matters 02:30 Context window, SWE-bench, GDPval 04:30 Pricing 05:30 What this means for your work 07:30 The honest caveat and who should pay attention 🔔 Subscribe to Sterling Intelligence for weekly breakdowns of what's actually happening in AI — no hype, no filler, just the signal. https://www.youtube.com/@SterlingIntelligence — Jane Sterling, Sterling Intelligence #GPT54 #OpenAI #ChatGPT #AINews #ComputerUse #OSWorld #AIAgents #AgenticAI #SWEBench #AIBenchmarks #SterlingIntelligence #JaneSterling #AIWeekly #ArtificialIntelligence #TechNews2026 === TITLES_HTML ===Expression. Serious-concerned, eyebrows very slightly drawn. Not shocked, not smiling — the face you make when you read a number and realize the implications. Closed mouth, subtle tension at the jaw.
Head position. Squared to camera with a very slight forward lean. Chin neutral, eye line level. Conveys authority and "you need to hear this" without being theatrical.
Wardrobe. Dark blazer, minimalist. No jewelry that catches light. Consistent with the Sterling Intelligence brand palette (black, charcoal, gold accent only).
Eye direction. Direct to camera, locked. The thumbnail's job is to make the viewer feel addressed. Alternate take: eyes cut sharply to the right toward the 75% overlay.
Lighting. Key light from upper-left at ~4800K, soft fill on the right at 25% intensity. Deep shadow on the left jaw line for drama. Subtle rim light from behind-right to lift her off the background.
Scene setup. Near-black charcoal background with a subtle green-teal gradient in the far upper-right (faint OpenAI brand nod). Shallow depth of field — Jane tack-sharp, background soft. Optional ghosted OSWorld desktop-grid motif at 15% opacity behind her shoulder.
Position. Right third of the frame, stacked scoreboard — "AI" row on top with 75%, "HUMAN" row below with 72.4%.
Font. JetBrains Mono Bold for the numbers (monospace reads as data); Inter Black for the labels.
Color scheme. "AI" label in gold (#c8a84b), 75% in pure white with a faint red (#dc2626) underglow. "HUMAN" label in muted gray (#888), 72.4% in white. 3px black stroke on every character for legibility.
Accent detail. Small caps header above the scoreboard: "OSWORLD BENCHMARK" in 11px gold. Makes it read as a credible data card rather than a clickbait claim.
Position. Lower-left third, large, stacked on two lines — "IT" on top, "BEAT US" below. Close to Jane's shoulder so the eye travels face → text.
Font. Bebas Neue Bold or Impact, condensed all-caps, tight tracking.
Color scheme. "IT" in white, "BEAT US" in bright red (#dc2626) at 115% scale of "IT". 3px black stroke throughout. Faint outer glow on "BEAT US" to pop against the dark background.
Accent detail. Gold sub-tag below: "GPT-5.4 vs HUMAN — OSWORLD 75%" in Inter Bold 16px, #c8a84b gold. Backs the shock claim with proof.
Position. Centered upper band, then Jane's face dominant lower two-thirds.
Font. Inter Black all caps, wide tracking (~120), stretched across full frame width.
Color scheme. Base text in white, but the word "USE" overlaid with a transparent glassy gold (#c8a84b at 80%) to visually separate. 2px black stroke.
Accent detail. Red underline under "COMPUTER USE" at 4px. Smaller gold subtitle below: "NATIVE IN GPT-5.4" in Inter Bold 18px, #c8a84b gold. Positions the story as capability-first rather than scoreboard-first.