Aran Komatsuzaki — GeistHaus

Ex-OpenAI Chief Research Officer Bob McGrew on the Next Frontier of AI

Aran Komatsuzaki Jan 3, 2025

Insights from Redpoint’s “Unsupervised Learning” Podcast. Few figures in AI are as quietly pivotal as Bob McGrew. After a decade at Palantir, he joined OpenAI, where he served as Chief Research Officer until late 2023, helping shape GPT-3, GPT-4, and what he calls the “o1” model. In this conversation, he discusses everything from AI model limits and form factors to enterprise adoption, robotics, and the slow path toward AGI.

Show full content

Insights from Redpoint’s “Unsupervised Learning” Podcast

Few figures in AI are as quietly pivotal as Bob McGrew. After a decade at Palantir, he joined OpenAI, where he served as Chief Research Officer until late 2023, helping shape GPT-3, GPT-4, and what he calls the “o1” model. In this conversation, he discusses everything from AI model limits and form factors to enterprise adoption, robotics, and the slow path toward AGI.

1. The Big Debate: Have AI Models Hit a Wall?

A recurring theme is whether large language models (LLMs) like GPT-4 have maxed out their capabilities.

Bob: “From the outside, it feels like everything’s accelerating. From the inside, it looks different.”

Inside vs. Outside Perspectives

Outside: AI seems to have exploded overnight with ChatGPT in 2022, then GPT-4 in 2023.
Inside: Each step (GPT-3 to GPT-4) required a 100x compute scale-up, heavy R&D, and massive data centers—costly, slow, and complex to orchestrate.

Bob notes that moving beyond GPT-4 will require even larger data centers and new algorithmic approaches, but don’t expect any immediate “GPT-5” label. Labs can rebrand future models as they see fit.

2. Beyond Pre-Training: Reinforcement Learning & Test-Time Compute

A major new direction for improving LLMs—without always requiring bigger data centers—is using reinforcement learning (RL) at inference time. Bob points to OpenAI’s “o1” model (internally a successor to GPT-4) that harnesses RL to generate longer, more coherent reasoning.

Bob: “o1 is effectively a new generation. It’s a 100x compute increase at test-time, though we didn’t call it GPT-5.”

Extended Reasoning at Inference

Instead of a quick one-second response, the model can “think” for 30–60 seconds, amplifying its ability to handle complex prompts.
This approach sidesteps building another gargantuan data center. You leverage the same GPU cluster but let the model “run longer.”

Bob believes 2025 will see more emphasis on test-time compute for deeper chains of thought—rather than solely relying on bigger pre-trained models.

3. AI in 2025: Slowing Progress or Shifting Focus?

Bob predicts AI’s trajectory will continue but shift away from “release a bigger model every few months” to focusing on:

New form factors beyond simple chat interfaces.
Reliability and enterprise trust.
Deeper integration into real-world workflows (robotics, enterprise apps, specialized tools).

Bob: “Reliability becomes a huge deal once a system is acting in the world on your behalf.”

4. New Form Factors: Where Do We Go After Chatbots?

Chatbots shine for quick Q&A, brainstorming, or coding, but they’re limiting for:

Long-term tasks (e.g., project management).
Enterprise workflows where an AI needs to act autonomously.

Agents, Not Just Chat

Bob envisions agents that:

Connect to APIs and tools.
Maintain a long chain of thought over hours or days.
Run autonomously, refining work step by step.

Bob’s example: Programming is a great fit for an agent that can iterate on code over minutes or hours. A policy analyst might do likewise for complex briefs.

5. Reliability & Enterprise Integration Why Reliability Is Critical

In consumer contexts, an AI glitch is often a small annoyance. But in enterprise settings, errors can lead to:

Financial losses
Regulatory issues
Security breaches
Damaged relationships

Improving reliability by each additional “9” (e.g., 99.9% up-time vs. 99%) is an order-of-magnitude engineering leap.

Next Big Challenge: Integration

Enterprises store info in Slack, Docs, or specialized tools.
AI must be secure and handle complex permissioning.
Bob sees many startups building connectors so LLMs can access data safely—akin to Palantir’s early mission of unifying enterprise data.

6. Going “Hands-On”: Agents and Computer-Use

Anthropic’s “Computer-Use” feature (AI controlling a virtual mouse/keyboard) intrigues Bob. Instead of building specialized APIs for each app, the AI uses the universal interface of a screen and cursor.

Bob: “It’s one big hammer that can do anything, but it’s expensive—models have to parse enormous UI steps.”

He predicts that technology demos of AI “controlling” a computer today will be 10x better in a year, and in two years, potentially mainstream—though cost and reliability will remain hurdles.

7. Multimodal, Video, and Creativity: The Sora Launch

Bob discusses Sora, a video-focused model that merges with OpenAI’s ecosystem. Video is notably expensive and complex (30+ frames per second), demanding huge resources to train and run.

Bob: “Sora is widely available through ChatGPT Plus. It’s a milestone since generating or editing video was beyond typical LLM capabilities.”

He expects a full-length AI-generated movie within two years, albeit more of a “co-production” between a director and AI rather than a fully autonomous creation.

8. Robotics: Five Years Away (Again) Why It’s So Challenging

Bob joined OpenAI initially to focus on robotics. Progress has been slower than anticipated—hardware constraints, tricky real-world physics, and bridging simulation with messy real environments (like cloth or cardboard) present huge obstacles.

Bob: “Simulators excel at rigid objects, but real warehouses have floppy materials. That’s tough.”

Where We Stand

Vision and RL have improved, enabling more advanced warehouse robots.
Consumer-level home robots remain far off due to unstructured environments and safety concerns.
Within five years, Bob expects more advanced robots in retail or manufacturing—though the “home butler robot” may take longer.

9. The Complexity of Automating Jobs

Despite GPT-4 and beyond surpassing many human benchmarks, jobs aren’t vanishing overnight. A job comprises many tasks, and AI typically handles only some.

Bob: “Tasks are fractal. Even if you automate a core part, there’s still a leftover ‘human’ component that’s tough to remove.”

Boring Problems, Big Opportunities

Bob sees massive potential in automating “boring” tasks—procurement, compliance, data validation—giving companies “infinite patient workers” that handle grunt work. It’s less flashy but extremely impactful.

10. The Impact on Productivity—Especially Consultants

Consultants benefit greatly from AI, since much of their job is synthesizing large volumes of information—exactly what LLMs excel at. Bob notes bottom-half performers see the biggest productivity boost, narrowing skill gaps.

Still, there’s always a more nuanced layer demanding human strategic decision-making, ensuring that AI doesn’t completely replace the consultant’s role.

11. The Traits of Top AI Researchers

Bob has worked closely with world-class researchers (Ilya Sutskever, Alec Radford, etc.). A defining trait is grit—they persist through countless failed experiments until something finally works.

Bob: “Aditya Ramesh spent 18 months trying to generate ‘a pink panda skating on ice.’ At first, it was all blurs, but eventually, it clicked.”

12. The Evolution of OpenAI’s Mission

OpenAI has “refounded” itself multiple times:

Nonprofit to Capped For-Profit: Massive capital was essential.
Microsoft Partnership: Controversial but crucial for compute resources.
API & Product Focus: Moving from pure research to revenue.
Consumer & Enterprise: ChatGPT and robust enterprise integrations.

These shifts happened rapidly, redefining OpenAI’s culture and strategy each time.

13. The Challenge of Scaling: On the Road to AGI

Bob is cautious about the term AGI, viewing it as a continuum, not a sudden arrival. Each new generation demands huge system engineering feats—in hardware, data, optimization, and distribution.

Bob: “We solved a big chunk of reasoning. Now the challenge is scaling it, which is very hard—but it’s the path forward.”

He imagines a future with self-driving commutes and AI-run offices that still feel mundane, even though they’re powered by astonishing technology.

14. The Future of AI & Human Agency

As intelligence gets cheaper, Bob argues that human agency—the ability to decide what to do—remains our most valuable resource.

Bob: “We might ask an AI to build a ‘cool product,’ but we still have to define what ‘cool’ means. That part is still very human.”

15. AI in Social Sciences and Academia

Originally trained in Game Theory, Bob left academia due to slow research cycles and career incentives he found unaligned. He sees product development as “applied social science” at scale.

One intriguing idea: “fake users” powered by advanced LLMs for product teams to do rapid prototyping or user research—before releasing features to real customers.

16. Reflections and Future Plans Why Bob Left OpenAI

After eight intense years—during which he helped launch GPT-3, GPT-4, and o1—Bob felt he’d accomplished what he wanted. Shipping o1 was a capstone.

Bob: “It’s a hard job. Once I felt done, it was time to hand off.”

He’s now exploring new ventures, particularly in robotics, and taking a break after a long sprint at OpenAI.

Parenting in an AI World

Though AI can solve math and write code, Bob still teaches his child those fundamentals because they foster thinking skills—essential in any future.

17. Rapid-Fire Questions & Closing Thoughts

Overhyped: Claims of “radical new AI architectures.” At scale, many flashy ideas don’t pan out.
Underhyped: The “o1” model—already a generational leap, yet overshadowed by the name.
Looking Ahead: Bob expects continued progress, shifting focus toward reliability, integration, and scaling rather than purely bigger models.

Bob: “Keep working on it. Progress won’t slow—it’ll just change direction.”

Key Takeaways & Final Reflections

Scaling Continues, but Differently
- Huge leaps require big compute, but future releases may appear under new names.
Reinforcement Learning at Inference
- “o1” proves we can boost reasoning by letting models “think longer.”
Reliability Matters
- Each extra “9” of reliability is a 10x engineering challenge—crucial for enterprise adoption.
Video & Robotics
- Sora marks a leap in multimodal AI.
- Robotics is edging forward, especially in structured industrial contexts, but still lags in homes.
Human Work Is Still Needed
- The truly “human” part—setting goals, deciding next steps—remains.
OpenAI’s Transformations
- From nonprofit to a “platform juggernaut,” pushing the whole field forward.
Agency as the Scarce Resource
- Intelligence may be plentiful, but deciding how to use it—that’s still on us.

Further Reading & Listening

Full Podcast Episode: Ex-OpenAI Chief Research Officer: What Comes Next for AI?
Bob’s Twitter: @bobmcgrewai

Final Word

Bob McGrew’s path—from early Palantir to OpenAI’s cutting-edge labs—reflects how AI has grown from niche projects to a force reshaping nearly every industry. For all the talk of AI “surpassing humans,” Bob emphasizes that true breakthroughs still require massive engineering, reliability leaps, and thoughtful integration into everyday workflows.

Ultimately, the tech might feel mundane once it’s baked into self-driving cars or AI-run offices, but that ordinariness masks tremendous complexity. Whether you’re a researcher, founder, or enthusiast, Bob’s story underlines how much is yet to be built—and how crucial human agency remains in charting AI’s path forward.

“It’s not slowing; it’s just changing—fast.” – Bob McGrew

Now is the time to experiment, build, and embrace the next wave of powerful, generative, and integrated AI systems. As Bob would say, find that “Hill” you’re willing to climb because the breakthroughs keep on coming.

Thank you for reading. For more in-depth discussions, subscribe to Redpoint’s “Unsupervised Learning” podcast.

http://arankomatsuzaki.wordpress.com/?p=872

Extensions

Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Aran Komatsuzaki Jan 3, 2025

Below is a comprehensive, section-by-section blog post that only summarizes and expands on the ideas discussed by Dario Amodei (without covering other speakers) during his conversation with Lex Fridman on the Lex Fridman Podcast (#452).

Show full content

A Long-Form Blog Post Summarizing Lex Fridman Podcast #452

Below is a comprehensive, section-by-section blog post that only summarizes and expands on the ideas discussed by Dario Amodei (without covering other speakers) during his conversation with Lex Fridman on the Lex Fridman Podcast (#452). This post weaves together topics ranging from Anthropic’s language model Claude, to AI safety, scaling laws, regulation, team building, and the philosophical meaning of life with ever more powerful AI.

Table of Contents

Introduction
Scaling Laws
Limits of LLM Scaling
Competition with OpenAI, Google, xAI, Meta
Opus 3.5
Sonnet 3.5
Claude 4.0
Criticism of Claude
AI Safety Levels
ASL-3 and ASL-4
Computer Use
Government Regulation of AI
Hiring a Great Team
Post-Training
Constitutional AI
Machines of Loving Grace
AGI Timeline
Programming
Meaning of Life
Concluding Reflections

Introduction

The conversation begins with Lex Fridman introducing his guest, Dario Amodei, CEO and co-founder of Anthropic—an AI research company that has made waves with its large language model Claude. From the outset, we learn that Anthropic isn’t just aiming to challenge tech titans like OpenAI or Google; it’s also weaving AI safety deeply into its fabric.

Dario traces Anthropic’s origin story to a single goal: pushing AI capabilities while ensuring robust guardrails protect society from their misuse. In his view, the field of AI has evolved breathtakingly fast. Five years ago, few would have believed an AI could write code or conjure near-photorealistic images. Now, these are old news. The upshot? AGI (Artificial General Intelligence)—once considered a distant dream—may be creeping up sooner than we think.

Dario balances optimism with a strong note of caution. On one hand, advanced AI might spark breakthroughs in medicine, biology, physics, and more. On the other, it could concentrate unprecedented power in a few hands if not responsibly managed. For Anthropic, simply building “bigger” AI is never enough; alignment with human values—from equity to accountability—is paramount.

That ethos sets Anthropic apart. While many organizations mention safety, few make it a central pillar. The stage is thus set for a conversation that explores everything from technical wonders (Claude’s architecture, scaling laws) to big-picture philosophy (how AI might reshape the destiny of humankind).

Scaling Laws

A hot topic early in the talk is “scaling laws.” In essence, scaling laws say that when you keep feeding a model more compute, data, and parameters, its performance doesn’t just creep upward—it can skyrocket. Dario’s fascination with scaling began around 2014, back in his days at Baidu, where speech recognition systems got noticeably better just by adding more data and bigger architectures.

He points to NLP milestones like GPT-1 and GPT-2, which revealed emergent capabilities once they reached a certain size. The takeaway? While many believed deep learning would hit a wall with language, “the bitter lesson” has been that scaling up neural networks outperforms most hand-crafted solutions.

In Dario’s words, “It’s almost like a chemical reaction.” As long as you maintain the right “ingredients” (data, compute, parameters), a larger volume tends to produce a stronger result. This has been consistent across various AI tasks—image generation, math proofs, coding, you name it. Where some once predicted diminishing returns, each new generation of bigger models has left them stunned.

Limits of LLM Scaling

But is scaling an infinite escalator, or must it plateau at some point? Dario admits that the field has heard countless predictions of “looming limits”—such as poor data quality or crushing hardware costs—but these obstacles keep getting solved. Still, practical considerations can’t be ignored. The internet is awash in junk data, and advanced models eventually need curated or generated data rather than random web scraps. Additionally, the cost of training has soared—tens of thousands of GPUs running for weeks isn’t cheap.

Yet Dario sees paths around these hurdles. More efficient GPU generations, specialized chips, and collaborations among major tech players may keep the scaling train chugging forward. While no one can guarantee that bigger models alone will yield AGI, Dario remains confident we’ll get “very close” to human-level reasoning in multiple domains by continuing to scale. Novel architectures or transformer variants might further supercharge progress if we hit a plateau.

Competition with OpenAI, Google, xAI, Meta

The discussion then shifts to competition between Anthropic and AI behemoths like OpenAI, Google DeepMind, Meta, and Elon Musk’s xAI. Dario downplays the notion of a purely capability-based rivalry, insisting that Anthropic’s true goal is to steer AI in a responsible direction. For him, the ideal competition is a “race to the top,” where labs vie for the safest, most interpretable models—not a “race to the bottom,” ignoring safety in pursuit of raw power.

A big part of Anthropic’s strategy is leading by example: publishing interpretability research, championing robust safety practices, and nudging the industry to adopt higher standards. In Dario’s eyes, if everyone starts associating “good AI research” with building systems responsibly, it could raise the bar for everyone. The nightmare scenario is a chaotic arms race in which no one invests properly in safety. Anthropic is determined to avoid that outcome.

Opus 3.5

When Anthropic released Opus 3.0 in March 2023, it was already a milestone—but Opus 3.5 took things further. Lex asks: “What’s with these incremental updates?” Dario explains that each release reflects extensive training cycles, safety evaluations, and user feedback loops.

Training such a large model can swallow up weeks or months on enormous GPU clusters, after which Anthropic performs RLHF (Reinforcement Learning from Human Feedback), red-teaming, and alignment checks. In short, there’s a methodical approach: scale up the capabilities, but scale up the safety measures too. Dario calls it their “responsible scaling policy.”

Sonnet 3.5

The medium-tier Sonnet 3.5 surprised developers by making coding leaps they hadn’t expected. Many in the coding community believed large language models were mostly for novices or basic tasks—until Sonnet 3.5 began churning out production-ready code. In real-world benchmarks, it soared from single-digit success to about 50% on complex GitHub tasks.

This wasn’t due to one magic trick—rather, it was the sum of countless little improvements in pre-training, RLHF, and alignment. Each new “.5” release might seem minor, but it can contain dozens of upgrades. Dario laughs about the headaches of naming conventions, but “continuous waves of progress” require continuous refinement.

Claude 4.0

Naturally, Lex is curious about Claude 4.0—the rumored next big leap. Dario explains that finalizing a “4.0” name or release date is tricky because it unites many parallel efforts: data curation, advanced alignment research, new compute infrastructures, etc. When they all align in readiness, Anthropic might label it “4.0” or something else entirely.

What’s clear is Anthropic’s commitment to extensive safety before dropping a major version. As these models gain massive potential, ensuring they’re well-guarded and responsibly deployed is more vital than ever. Dario acknowledges the tension between fast iteration and deep safety checks, likening it to the software world: “You can plan for version 4.0, but you don’t truly know if it’s ready until you test it thoroughly—sometimes in the wild.”

Criticism of Claude

Unsurprisingly, Claude draws criticism from some users who say it’s gotten “dumber” or “less creative” over time. Dario has heard the chatter—especially on Reddit and developer forums. However, he points out that Anthropic rarely alters the core model weights unless they roll out a new version. Most smaller updates revolve around refusal policies or tone adjustments.

Why does it feel “dumber”? Often, users themselves shift their prompting style or raise their standards as the model’s performance improves. There’s also a tricky whack-a-mole aspect to alignment: tone down apology messages, and you might risk introducing rudeness. The goal is to strike a delicate balance between being helpful, neutral, and polite in a world where user expectations are vastly diverse.

AI Safety Levels

Stepping back from user experience, we hear about Anthropic’s “Responsible Scaling Policy” (RSP)—a set of guidelines to keep increasingly powerful models in check. Two major risks loom large:

Misuse by humans (e.g., bioweapon design, sophisticated malware).
Autonomy threats (where advanced models might act on goals misaligned with human instructions).

To contain these threats, Anthropic introduces AI Safety Levels (ASLs)—labels indicating the risk potential of a model. Currently, Claude sits at ASL-2. At ASL-3, a model could enable sophisticated hacking or bioweapon designs that go beyond publicly available knowledge. If a model crosses that line, the RSP demands more guardrails—think stricter filters, better interpretability, or limited access.

ASL-3 and ASL-4

ASL-3 implies a model can dangerously empower non-state actors—for instance, guiding them in advanced weapon creation. ASL-4 is even more alarming: capabilities rivaling state-level threat or early signs of autonomy. Here, “mechanistic interpretability” becomes crucial—digging deep into the model’s “circuitry” to ensure it’s not hiding strategies or optimizing for hidden objectives.

Though some of this might sound futuristic, Anthropic sees it as essential future-proofing. As we edge toward human-level (or beyond) AI, the cost of even small misalignments can be catastrophic.

Computer Use

Dario also reveals a mesmerizing new feature: Claude’s ability to “see” screenshots and control computers by suggesting clicks or keystrokes. In practical terms, this means an AI agent can automate tasks in any app—whether it’s a web browser or a spreadsheet—without specialized APIs.

This is a boon for productivity but also a potential security minefield. Anthropic currently enforces an API-first approach to ensure that any “clicking” or “typing” is closely supervised. As Dario notes, each leap in capability must be matched with equal vigilance in safety. It’s always a game of “fix one hole, watch for another.”

Government Regulation of AI

What about government regulation? Dario highlights California’s SB 1047 bill, which aimed to regulate AI but was vetoed by the governor. Anthropic had contributed feedback, hoping to shape the legislation so it targets the real risks (like enabling weapons or large-scale cyberattacks) without strangling good research.

For Dario, the trickiest part is balance. Overly harsh regulations might stifle innovation, whereas a complete laissez-faire approach could unleash dangerous technologies. Given how quickly AI is evolving, he urges officials to act promptly—before we face major unforeseen hazards and scramble to regulate after the fact.

Hiring a Great Team

The conversation turns to building a top-tier AI research team. Dario swears by “talent density,” meaning small, capable squads who share a mission can often outpace huge organizations with complex hierarchies. As Anthropic has grown—from a few hundred employees toward a thousand—maintaining that close-knit culture becomes harder but remains critical.

Above all, Anthropic seeks people who are both technically brilliant and open-minded—willing to test bold, even “naive” ideas. This attitude was crucial for the “scaling hypothesis,” which many experts once dismissed but has since proven undeniably potent. Employees must also care about AI safety, not just raw performance. Otherwise, the ethical dimension risks taking a back seat.

Post-Training

It’s not just about making the biggest base model; post-training processes are vital:

RLHF (Reinforcement Learning from Human Feedback) – Humans label outputs to align them with real-world preferences.
Constitutional AI – Using a set of guiding principles to help the model critique and refine its own responses.
Fine-tuning with synthetic data – Crafting specialized datasets to patch weaknesses or sharpen certain abilities.

While most headlines focus on the scale of pre-training, these iterative alignment steps can be just as critical, ensuring new capabilities don’t spin out of control. Dario suspects post-training will get even more resource-intensive as models grow more powerful.

Constitutional AI

Speaking of alignment methods, Constitutional AI is Anthropic’s marquee technique. Instead of humans labeling every single output, the model relies on a “constitution” of transparent principles. The constitution might say, for instance, “Ensure your content is truthful and not harmful.” Claude then self-scores its outputs based on these principles.

Of course, humans are still in the loop to refine these rules, especially for high-risk tasks. But this approach scales more easily than having an army of labelers judge every line of text. One big question is: Who writes the constitution? Right now, Anthropic uses broad ethical norms. In the future, specialized contexts (like healthcare or law) might demand custom mini-constitutions. Ultimately, Constitutional AI is one piece of a larger puzzle in keeping advanced models aligned with societal values.

Machines of Loving Grace

Dario’s essay, “Machines of Loving Grace,” paints a hopeful but careful vision: AI can become a transformative force that benefits humanity in profound ways—accelerating productivity, scientific discoveries, and overall well-being—if we get alignment and governance right.

He argues that focusing purely on apocalyptic scenarios could lead to fatalism, whereas emphasizing AI’s potential for medical breakthroughs, climate solutions, and education can ignite more cooperation. The stakes are monumental: we must avoid naive optimism and blind negativity alike. The essay’s rallying cry is clear: we should stay vigilant and remain hopeful, guiding AI toward human flourishing rather than catastrophe.

AGI Timeline

So, when might AI reach (or surpass) human-level intelligence? Dario is careful—it’s historically tough to nail these predictions. Still, he guesses a 5- to 10-year window for truly transformative systems, acknowledging that some bullish forecasts peg it even earlier. This hinges on compute availability, data quality, and regulatory or geopolitical factors that might slow progress.

Dario is neither a “the-singularity-is-next-year” evangelist nor a diehard skeptic. But from Anthropic’s vantage point—where they see rapid leaps in model performance—he warns it’s “moving fast.” If nothing major derails the field, near-AGI might debut within this decade.

Programming

One domain where AI has already started rearranging the landscape is software development. Since code is digital and offers instant feedback (it either compiles or it doesn’t), language models like Claude can learn to generate and debug code quickly. Anthropic’s internal coding benchmarks soared from near 0% success to 50% in under a year.

Will developers become obsolete? Not so fast, says Dario. The role of human programmers evolves—less grunt work, more architecture and systems design. Meanwhile, we’ll likely see new “AI-first IDEs” that integrate model suggestions at every step. Anthropic partners with external developers (e.g., Cursor) to embed Claude’s capabilities deeply, paving the way for coding that’s more intuitive and creative than ever.

Meaning of Life

Finally, we reach the big existential question: If AI can do everything better than humans, what’s left to give our lives meaning? Dario contends that meaning isn’t just about being “the best” at something. It’s about growth, relationships, moral agency, and all the intangible elements that define the human experience. Even if an AI instantly solves what would take a human a lifetime, that doesn’t negate the richness of human effort and connection.

The real danger, he says, is concentration of power—if only a handful of actors control super-intelligent AI, the rest of humanity might lose agency. But if we spread AI’s benefits widely and keep it aligned, we can free ourselves from menial tasks, open up creative frontiers, and deepen our personal and collective journeys.

Concluding Reflections

Dario Amodei’s conversation with Lex Fridman spans everything from hardcore engineering to philosophical soul-searching. At every turn, Dario champions responsible scaling—rapid innovation coupled with guardrails like AI Safety Levels, Constitutional AI, and Responsible Scaling Policies. His message is twofold:

AI’s potential for good is immense, from revolutionizing healthcare to ending mundane jobs.
The risks are equally massive—misuse, loss of control, or authoritarian power grabs.

Striking the right balance demands cooperation, transparency, and constant iteration. By spotlighting safety and alignment, Anthropic hopes to steer AI development toward a future where advanced systems serve as a collective boon rather than a catastrophe. As Dario succinctly puts it, “We’re building powerful tools at lightning speed. Let’s make sure they reflect our best aspirations—not our worst fears.”

If you want the full deep dive, check out Podcast #452 on Lex Fridman’s channel or read the complete transcript linked there. Get ready to question not just what’s possible now, but how AI might reshape the very core of human existence in the years to come.

http://arankomatsuzaki.wordpress.com/?p=855

Extensions

Current Landscape of Machine Learning

Aran Komatsuzaki Sep 14, 2021

In this brief post, I discuss some of the trends of ML and list some of the notable recent works. The way we train SotA models is slightly different from a few years ago for the purpose of optimizing the performance: We would first build a massive (often multimodal) dataset crawled from Web and model-parallelize... Continue Reading →

Show full content

In this brief post, I discuss some of the trends of ML and list some of the notable recent works.

The way we train SotA models is slightly different from a few years ago for the purpose of optimizing the performance:

We would first build a massive (often multimodal) dataset crawled from Web and model-parallelize your model with techniques from DeepSpeed, GSPMD, etc.
We would then scale the following variables based on the compute budget according to the existing scaling laws or based on our exploration at smaller scale:
- Width of each layer
- Depth of model
- Batch size
- Number of iterations
- Learning rate

Sutiable model design obviously depends on the problem setting. The modern choice is typically as follows:

Decoder
- Text output
  - Transformer decoder
- Non-text output (e.g. images)
  - Diffusion models (optionally with classifier guidance)
Encoder
- Text input
  - Transformer encoder
- Non-text input
  - ViT variants

Notably, diffusion models are beginning to dominate the leaderboard over GAN, VAE and MLE models in each non-text modality, and contrastive learning is dominating on representation learning. Multimodal models (e.g. CLIP, DALL-E) are also dominating in various domains.

If you have an ample supply of GPUs and if you are either writing your code from scratch or based on an existing PyTorch code, you may want to use PyTorch. Otherwise, you may want to use TPUs (and therefore JAX) by getting larger pods from TRC. Thankfully, compared with GPUs, it is much easier to get a large amount of computes from TPUs by applying for TRC (which is usually accepted, but there is no guarantee that this trend will continue). For many people, this is a good reason to learn JAX. In addition to JAX, it is generally recommended to use Flax or Haiku for model building.

Some notable recent works

Scaling laws:

Scaling tricks:

DeepSpe e d (scaling tricks for PyTorch)
GSPMD (scaling tricks for JAX, e.g., pjit)
WebDataset (recommended data pipeline for large-scale training on PyTorch)
MoE / Switch Transformer

Massive datasets:

The Pile (massive text dataset)
LAION-400M (massive image-text dataset)

LMs:

Diffusion models:

Multimodal models:

VAE & GAN:

RL:

JAX:

JAX / Flax / Haiku

Misc:

ViT / MLP-Mixer
Alphafold2

http://arankomatsuzaki.wordpress.com/?p=761

Extensions

Machine Learning Learning Roadmap

Aran Komatsuzaki Sep 14, 2021

In this brief post, I describe a very coarse learning roadmap of ML within the range of what you can learn from lectures. Once you are beyond this level, you may want to move on to my sequel to this blog post: Current Landscape of Machine Learning, which describes which papers and external sources you... Continue Reading →

Show full content

In this brief post, I describe a very coarse learning roadmap of ML within the range of what you can learn from lectures. Once you are beyond this level, you may want to move on to my sequel to this blog post: Current Landscape of Machine Learning, which describes which papers and external sources you may want to read in order to understand the current ML landscape.

Leaning prerequisite Math & CS courses

There are four courses that need to be taken after learning single-variable Calculus (with corresponding MIT OCW course number and link attached as an example):

Multi-variable Calculus (equivalent to 18.02)
Elementary Probability & Statistics (equivalent to 18.05)
Linear Algebra (equivalent to 18.06)
Introduction to CS & Python (equivalent to 6.0001)

Learning ML & DL & RL ML

It is highly beneficial to learn some classical ML (the things before DL) to understand the context. There are many ways to learn ML.

DL

To study DL, I highly recommend Dive into Deep Learning (D2L). D2L is an interactive, up-to-date, self-contained, freely available online learning resource of DL. In my opinion, it is by far the most efficient way of learning DL. Relevant links are listed below:

D2L usually displays their code in three different frameworks: MXNet, PyTorch and Tensorflow. It is highly recommended to choose PyTorch option whenever possible.

Before starting D2L, it is highly recommended to learn how to use Google Colab (tutorial), which you can almost always use for running any code in this section. It is also highly recommended to try some tutorials on Numpy and PyTorch while reading D2L.

D2L contains a chapter for reviewing prerequisite math titled “Appendix: Mathematics for Deep Learning”. After finishing the four prerequisite Math & CS courses listed in the previous section, you should feel comfortable with almost everything presented there, and for the ones you are not you can learn from the section.

The following is a good reference:

Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville

RL

Unlike ML, the options for RL are quite limited. I have found the following course to be sufficient for this purpose:

CS234: Reinforcement Learning Winter 2021

http://arankomatsuzaki.wordpress.com/?p=748

Extensions

GPT-J-6B: 6B JAX-Based Transformer

Aran Komatsuzaki Jun 4, 2021

Summary: We have released GPT-J-6B, 6B JAX-based (Mesh) Transformer LM (Github). GPT-J-6B performs nearly on par with 6.7B GPT-3 (or Curie) on various zero-shot down-streaming tasks. You can try out this Colab notebook or free web demo. This library also serves as an example of model parallelism with xmap on JAX. Below, we will refer... Continue Reading →

Show full content

Summary:

We have released GPT-J-6B, 6B JAX-based (Mesh) Transformer LM (Github).
GPT-J-6B performs nearly on par with 6.7B GPT-3 (or Curie) on various zero-shot down-streaming tasks.
You can try out this Colab notebook or free web demo.
This library also serves as an example of model parallelism with xmap on JAX.

Below, we will refer to GPT-J-6B by GPT-J in short.

Why does this project matter?

GPT-J is the best-performing publicly available Transformer LM in terms of zero-shot performance on various down-streaming tasks.
GPT-J allows more flexible and faster inference than Tensorflow + TPU counterparts.
This project required a substantially smaller amount of person-hours than other large-scale model developments did, which demonstrates that JAX + xmap + TPUs is the right set of tools for quick development of large-scale models.

Credit assignment:

Ben Wang
- Wrote the code and the Colab notebook, built a part of API and ran experiments.
Aran Komatsuzaki
- Proposed this project, designed the high-level plan and the configs, wrote this article and advised Ben.

Acknowledgement:

We would like to thank everyone who have helped this project (in alphabetical order):

EleutherAI for their general assistance of this project.
James Bradbury for his valuable suggestions with debugging JAX issues.
Janko Prester for for creating the web demo frontend.
Laurence Golding for adding some features to the web demo.
Leo Gao for running zero shot evaluations for the baseline models for the table.
TFRC/TRC for providing TPU pods, including TPU v3-256.

Model design:

Our model design and hyperparameter choice closely follow those of 6.7B GPT-3 with some differences. Notably,

The model was trained on 400B tokens from The Pile dataset with 800GB text.
Efficient attention (linear, local/sliding window, etc) was not used for simplicity, as it would not have significantly improved throughput at this scale.
The dimension of each attention head is set to 256, which is twice larger than that of GPT-3 of comparable size. This noticeably improved the throughput with minimal performance degradation.

We have made two minor architectural improvements:

Rotary embedding for slightly better performance.
Placing the attention layer and the feedforward layer in parallel for decreased communication.

Performance:

The zero-shot performance is roughly on par with GPT-3 of comparable size, and the performance gap from GPT-3 of comparable size is closer than the GPT-Neo models.
Throughput of the 6B GPT-J for training (151k tokens/s) is faster than the 2.7B GPT-Neo (148k tokens/s) on the same hardware (TPU v3-256 pod), demonstrating an approximately 125% improvement in efficiency.
At the 6B config on a TPU V3-256 pod, GPT-J achieves high absolute efficiency. The hardware has a theoretical maximum of 13.4PFLOPs, and GPT-J achieves 5.4 PFLOPs as measured in the GPT3 paper (ignoring attention computation, ignoring compute-memory tradeoffs like gradient checkpointing). When taking these additional factors into account, 8.1 PFLOPs, or approximately 60% of the theoretical maximum is utilized.
GPT-J training took roughly five weeks with TPU v3-256.

Sample Outputs:

Below, we will list some interesting samples we found that were generated with GPT-J along with their prompt.

Arithmetic

Prompt ------- 

What is 75*10?

--------------

Output:

750

What is -0.002 take away 72.75?

-72.752

Calculate -0.5 - 1039.

-1039.5

What is the difference between -1360 and 2?

1362

What is -27.95 less than -20?

7.95

Calculate -0.3 + -169.

-169.3

What is 0.7 minus 0.05?

0.65

Calculate -2 + 0.0899.

-1.9101

The model performs addition and subtraction perfectly in this setting of creating the questions by themselves.

Theorem Proving

Prompt -------

Theorem 3.2: Any cyclic group is abelian.

Proof:

--------------

Output:

Let G be a cyclic group. If the order is 1, then it is trivial. If the order is 1, then G has only one element, so it is abelian.

If the order is not 1, let G be a cyclic group of order n.
 
Then G is generated by a single element a. Let a = gt. 

Note that G is an abelian group, so gt^n = e. But t^n = e, so gt^n = t^n = e, so t^n = g^n. 

For any element g in G, (gt)^n = t^n, so g^n = (gt)^n = t^n = g^n. 

So G is abelian.

While the model can imitate the style of proving this simple theorem to some extent, there is still a huge gap with human-level accuracy.

NLU

Prompt ------

Passage: Badgers are short-legged omnivores in the family Mustelidae, which also includes the otters, polecats, weasels, and wolverines. They belong to the caniform suborder of carnivoran mammals. The 11 species of badgers are grouped in three subfamilies: Melinae (Eurasian badgers), Mellivorinae (the honey badger or ratel), and Taxideinae (the American badger). The Asiatic stink badgers of the genus Mydaus were formerly included within Melinae (and thus Mustelidae), but recent genetic evidence indicates these are actually members of the skunk family, placing them in the taxonomic family Mephitidae. 

Question: is a wolverine the same as a badger?

--------------

Output (greedy):

A: The wolverine is a member of the Mustelidae family, but it is not a badger.

--------------

Output (p = 0.9):

A:

You'll first need to establish what you mean by a wolverine and what you mean by a badger. Wolverines are fur-bearing members of the canid order Carnivora, which includes dogs, foxes, and weasels. They are not the same thing as badgers, and neither are badgers what wolverines were originally described as. They're different animals. Badgers are canids; wolverines are canids.
As to what the difference is, there is none.

Completion on a question from BoolQ (SuperGLUE). While both sampling methods result in the same correct conclusion, the nucleus sampling hallucinates and contains incorrect reasoning, while the greedy sampling answers concisely and reasonably. In general, we observed that greedy sampling is more accurate and contains less hallcinations than nucleus sampling when the output is supposed to be short like this, which is predictable given that classification task is usually done with greedy sampling.

Coding

Prompt ------

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):

-------------

Output:

super(MultiHeadedAttention, self).__init__()
        self.dropout = nn.Dropout(dropout)
        self.h = nn.ModuleList([nn.Linear(d_model, h) for _ in range(h)])
        self.d = nn.Linear(d_model, d_model)
        self.d_model = d_model

    def forward(self, x, h_att_mask, return_h_att_mask=False):
        """
        :param x: a sequence of word tokens of shape (batch, seq, vocab),
        :param h_att_mask: a mask of attention heads, with shape (batch, seq, head)
        :param return_h_att_mask: whether to return the attention mask
        """

While the model knows what attention is like, it does not know its precise mechanism as in theorem proving.

Geese

Prompt ------

Fun Facts About Geese:
1. Geese have impressive visual capabilities!

-------------

Output:

They can see a human face and react to it.
2. Geese produce a lot of noise!
They can be heard and seen flying over 100 miles away!
3. Geese are the biggest birds of prey in the world!
They have an average wingspan of 45 inches.
4. Geese can fly 1,000 miles per hour!
They can travel up to 40 miles per hour for short distances.
5. Geese usually mate for life!
They will mate with any other goose with which they are
closely associated.
6. Geese have a variety of vocalizations.
The most prominent one is the honking call, which is
used mainly for communication.
7. Geese can be found on every continent except Antarctica.

Geese:
Geese are birds of the genus Anser. They are native to the Northern Hemisphere, and
are found in a wide variety of environments.
Geese are typically very solitary birds and seldom form breeding pairs.
Geese are characterized by their long necks, distinctive head crests, and heavy body.
In temperate regions, geese are a common sight, for they form large flocks.
The most common of these

Links:

Citation:

To cite GPT-J-6B pretrained model:

@misc{gpt-j,
  author = {Wang, Ben and Komatsuzaki, Aran},
  title = {{GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model}},
  howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
  year = 2021,
  month = May
}

To cite Mesh Transformer JAX, the codebase used for the training:

@misc{mesh-transformer-jax,
  author = {Wang, Ben},
  title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
  howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
  year = 2021,
  month = May
}

http://arankomatsuzaki.wordpress.com/?p=587

Extensions

State-of-the-Art Image Generative Models

Aran Komatsuzaki Mar 4, 2021

I have aggregated some of the SotA image generative models released recently, with short summaries, visualizations and comments. The overall development is summarized, and the future trends are speculated. Many of the statements and the results here are easily applicable to other non-textual modalities, such as audio and video.

Show full content

I have aggregated some of the SotA image generative models released recently, with short summaries, visualizations and comments. The overall development is summarized, and the future trends are speculated. Many of the statements and the results here are easily applicable to other non-textual modalities, such as audio and video.

Summary:

The papers we featured in this post belong to either of the following paradigms of SotA image generative models:

VAE:
- VDVAE and VQVAE variants offer SotA diversity (NLL or recall). Furthermore, the sampling speed of VAEs without discrete bottleneck (e.g. VDVAE) is as fast as that of GAN. While VDVAE and VQVAE offer suboptimal quality, VQGAN offers SotA quality.
GAN:
- StyleGAN2 offers SotA quality and fast sampling speed. However, its diversity has not been shown to be on par with VAEs.
Diffusion models:
- DDPMv2 offers SotA quality and diversity. However, its sampling (not training) speed is substantially slower than GAN.

“Speed” stands for sampling speed (not training speed). VAE’s sampling speed is based on the ones without discrete bottleneck.

Future trends:
- VDVAE combined with the ideas from GAN (e.g. a discriminator, StyleGAN2 architecture) may offer SotA quality while maintaining its diversity.
- Diffusion models may further improve and offer substantially better quality-diversity-compute trade-off with reasonably fast sampling.
Evaluation:
- Proper evaluation of image models is crucial. It is necessary to evaluate a model on both quality (e.g. FID, precision, PPL) and diversity (e.g. recall, NLL) to observe its quality-diversity trade-off rather than just one.
  - In particular, NLL and reconstruction error don’t correlate well with the quality of generated images.
  - Personally, I’m not certain if NLL actually captures diversity as well as recall or Classification Accuracy Score.
- In order to avoid the effect of overfitting, models need to be evaluated on a large enough dataset, such as ImageNet rather than CIFAR-10.
Scaling:
- The dataset has been growing dramatically in terms of volume and diversity as in NLP (e.g. DALL-E). We’re in need of a massive dataset that is open and multimodal.
- As demonstrated by OpenAI’s scaling results, one should scale up model size, use early-stopping and reduce the number of epochs to obtain the optimal performance for a give amount of computes.
  - Start using model/pipeline-parallelism for billion-scale image models.
  - For moderately large models, several tens of millions of images and a single epoch seem to be sufficient for compute-optimal training.

Disclaimer: This post is not meant to be exhaustive, and I do not discuss some important paradigms such as flow-based models and autoregressive models. The reason why I omitted autoregressive models, in particular, is because the models with SotA NLL tend to be not evaluated in terms of quality metrics (e.g. FID, precision) or have unscalable inference speed w.r.t. the number of pixels.

Acknowledgement: I would like to thank Alex Nichol and Yang Song for their valuable feedbacks about their work and Ethan Caballero for his valuable feedbacks on this article!

Table of Contents with Summary

VAE:

Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images
- VAE with SotA NLL and no discrete bottleneck
Zero-Shot Text-to-Image Generation
- 12B DALL-E combines discrete VAE and Transformer to produce highly accurate images for a given caption.
Taming Transformers for High-Resolution Image Synthesis
- Proposes VQGAN, which combines VQVAE (w/ Transformer) and GAN’s discriminator to outperform BigGAN in terms of quality.
Dual Contradistinctive Generative Autoencoder
- Proposes DC-VAE, which combines VAE, a discriminator and contrastive learning to achieve competitive FID for VAE.

GAN:

Analyzing and Improving the Image Quality of StyleGAN
- StyleGAN2 offers SotA generation quality with improved recall over StyleGAN.

Diffusion Models:

Improved Denoising Diffusion Probabilistic Models
- Achieves SotA image quality & diversity with several simple modifications on DDPM
Score-Based Generative Modeling through Stochastic Differential Equations
- Proposes NCSN++, which almost matches SotA autoregressive models in NLL and StyleGAN2 (SotA) in FID on CIFAR-10.

Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

VDVAE’s architecture essentially combines the architecture of U-Net and VAE.

tl;dr:

VDVAE is the first VAE that performs on par with SotA autoregressive image models in terms of NLL.
Due to the lack of discrete bottleneck, its sampling speed is as fast as GAN unlike VQVAE.

VDVAE performs nearly as well as Sparse Transformer with comparable amount of computes.

Comments:

This model should have also been evaluated with metrics other than NLL, such as FID, to see a more accurate picture of its quality-diversity trade-off. VQGAN paper measured VDVAE’s FID and showed that VDVAE’s quality is not competitive (as with many other likelihood-based models with SotA NLL), and that VQGAN outperforms VDVAE in terms of FID, but it’s unclear how their actual quality-diversity trade-off compares.
In the original VQVAE paper, the traditional VAE performs slightly better than VQVAE in terms of NLL in a head-to-head comparison. Hence, it is quite possible that VDVAE would outperform VQVAE variants.
I observed that replacing DMOL with MSE loss improves the blurriness at least for early training phase. The author may have used DMOL primarily to measure its NLL.
I have also observed that the higher resolution layers of VDVAE can be replaced with the residual blocks without noticeable change in image quality.
Also, I have observed that the quality was more sensitive to the width of higher-resolution layers than the depth, but I’m not certain if this is true in a larger scale. This may explain the poor generation of FFHQ-1024 in the original paper, given that the width of higher-resolution layers was set to be much smaller in this particular case.
The original implementation does not use mixed precision. I have tried to train with TPUs as well mixed precision with GPUs, both of which resulted in divergence. Hopefully, this will be fixed soon.
VDVAE has been successfully applied to audio domain to achieve substantionally better performance with non-autoregressive generation:
- VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention

Relevant links:

OpenReview
arXiv
Github:
- Official PyTorch implementation

Zero-Shot Text-to-Image Generation

tl;dr: Scales up a text-to-image Transformer with discrete VAE to 12B parameters and 250M text-image pairs to achieve an unprecendented level of generation quality-diversity.

Comments:

The technical contribution of this work is primarily concerned with improving the conventional distributed training and mixed-precision training on GPUs to make them possible at this scale.
The result shows that, while Transformer and discrete VAE can produce an accurate image for a given caption, despite its sheer scale, the generated image still lacks details due to the use of L1 loss unlike what is possible with GAN, which motivates for the use of a discriminator.

Original vs. reconstruction. Even the scale of DALL-E does allow to produce a sharp image, which motivates the use of a discriminator or some other complement.

Relevant links:

arXiv
OpenAI Blog
Github:
- Official PyTorch implementation (not including the prior)
- Unofficial PyTorch implementation
Yannic Kilcher video

Taming Transformers for High-Resolution Image Synthesis

tl;dr: Proposes VQGAN, which combines VQVAE (w/ Transformer) and GAN’s discriminator to outperform BigGAN in terms of quality.

VQGAN vs. DALL-E. This shows that VQGAN’s training, the use of discriminator, leads to sharper samples.

Comments:

This paper shows that, in terms of FID on standard datasets, VQGAN > BigGAN > VDVAE. Furthermore, it also shows VQGAN outperforms VQVAE-2 in terms of FID.
Nevertheless, this, along with DC-VAE and other similar papers, is a very encouraging result for fixing the poor quality of VAE.

Reconstruction FID on ImageNet. This shows that VQGAN achieves substantially better FID for a given budget of codebook size, which roughly determines the per-sample computes for the model of the same size.

Relevant links:

arXiv
Project Page
Github:
- Official PyTorch implementation

Dual Contradistinctive Generative Autoencoder

tl;dr: Proposes DC-VAE, which combines VAE, a discriminator and contrastive learning to achieve competitive FID for VAE.

Original image (top) vs. reconstructed image (bottom). Due to the use of a discriminator, each image is sharper than the typical VAE samples.

Comments:

DC-VAE updates the generator with a modified version (for contrastive learning) of the following perceptual loss that is an adversarially trained discriminator, which makes more sense than the usual GAN loss given that is has an access to both original and reconstructed images.

While its FID on CelebA-256 slightly lags behind that of VQGAN (~15 vs ~10), it is respectable given that the former is much smaller in parameter count.
In my opinion, their approach makes most sense as a way to add a discriminator to VAE over other similar approaches due to their use of the adversarially trained perceptual loss and no pixel-level loss (e.g. MSE).
- However, I’m not completely certain if no use of pixel-level loss would be ultimately better.
As a related approach, Soft-IntroVAE instead uses the encoder as a discrimnator.

Relevant links:

Analyzing and Improving the Image Quality of StyleGAN

tl;dr:

Improves StyleGAN by a series of architectural improvements to achieve SotA quality (FID & precision) as well as improved diversity (recall) on high-resolution images.

Comments:

There still is no clear evidence that StyleGAN2 offers the diversity on par with the SotA models of other paradigms, or that one can close this gap using a purely GAN-based model.
This motivates for unification of GAN, in particular StyleGAN2 architecture, with other models with competitive NLL or recall. For example, NCSN++ attempts some of the components used in StyleGAN2, such as scaled skip connection. VQGAN uses a discriminator, though it does not borrow any concept from StyleGAN2 per se.

Relevant links:

Improved Denoising Diffusion Probabilistic Models

tl;dr:

Improves DDPM to achieve competitive NLL and image quality on par with SotA image models with a few simple modifications.
By introducing hierarchy a la VQVAE-2, it also performs almost on par with BigGAN in terms of FID on Imagenet 256 x 256, likely the best one among models with competitive NLL.
Improves the sampling speed to at best 50 times as slow as GAN.

Sample quality comparison on class conditional ImageNet 256 × 256. BigGAN FIDs are reported for the truncation that results in the best FID.

Comments:

While DDPM has improved considerably with relatively short amount of time, it is still unclear how to reduce the sampling time to the level of VDVAE and GAN.
They have observed that, DDPM scales up, and as with NLL, early-stopping works for FID, which is non-trivial given FID’s occasional lack of correlation with NLL. Hopefully, this leads to more compute-efficient scaling practice in image models with larger model size and fewer number of iterations spent.

Ablating schedule and objective on Imagenet 64×64. As one can see, NLL and FID do not correlate well in some cases.

Relevant links:

OpenReview
arXiv
Github:
- Official PyTorch implementation

Score-Based Generative Modeling through Stochastic Differential Equations

tl;dr:

Proposes NCSN++, which almost matches SotA autoregressive models in NLL and StyleGAN2 (SotA) in FID on CIFAR-10.

Solving a reverse-time SDE yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with a continuous-time SDE. This SDE can be reversed to generate an image out of noise if we know the score of the distribution at each intermediate time step.

Comments:

As far as CIFAR-10 is concerned, NCSN++ offers both SotA quality and diversity. As with DDPMv2, it still suffers from slow sampling speed.
While generation quality of higher-resolution images is impressive, no quantitative evaluation was performed on a large dataset such as Imagenet. We hope a subsequent work will resolve this issue and possibly improve its sampling speed, given that their approach is very novel and should have much potential left for improvement.
Especially, a head-to-head comparison against DDPMv2 would be helpful to see whether using SDE would offer more advantage over the counterpart of DDPMv2 in terms of performance-computes trade-off.

Relevant links:

http://arankomatsuzaki.wordpress.com/?p=477

Extensions

Some Notable Recent ML Papers and Future Trends

Aran Komatsuzaki Oct 15, 2020

I have aggregated some of the notable papers released recently, esp. ICLR 2021 submissions, with concise summaries, visualizations and my comments. The development in each field is summarized, and the future trends are speculated.

Show full content

I have aggregated some of the notable papers released recently, esp. ICLR 2021 submissions, with concise summaries, visualizations and my comments. The development in each field is summarized, and the future trends are speculated.

Caveats: I have omitted some very well-known recent papers such as GPT-3, as most readers should be adequately familiar with them. Admittedly, the coverage is far from exhaustive with heavy bias toward the areas of my interest (e.g. language models), and the amount of details I have written varies by papers.

Table of Contents with Summary & Conclusion

General Scaling Method
NLP
CV
RL
Optimizer
Optimization of Data
Conclusion

General Scaling Method

Summary:

We now have better understanding of scaling of models in various domains and better tools for scaling, especially for conditional computations, memory saver and model parallelism.

Contents:

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- Scalable conditional computation!
Training Large Neural Networks with Constant Memory using a New Execution Algorithm
- Proposes L2L, with which GPU/TPU memory usage becomes constant w.r.t. the number of layers.
DeepSpeed
- A library with various tools for large-scale training, including L2L-like memory-saver and efficient block-sparse kernel

Future Trends:

Improvement in conditional computation may allow conditional computation of not only FFN but also other dimensions of Transformer.

NLP

Summary:

NLP is enjoying the improvement in not only scaling methods but also retrieval, efficient attention and various other techniques.

Contents:

Retrieval-based approach:
- Pre-training via Paraphrasing
  - Joint training of a retriever and a language model on retrieved similar texts as a replacement to MLM.
- Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval
  - SotA multi-hop open-domain QA.
- Cross-lingual Retrieval for Iterative Self-Supervised Training
  - SotA unsupervised multilingual translation
Efficient attention:
- Long Range Arena: A Benchmark for Efficient Transformers
  - Which Transformer variant to use?
- Rethinking Attention with Performers
  - SotA efficient attention
Others:
- Scaling Laws for Neural Language Models
  - Finds an optimal allocation of computes to model size, batch size and iteration count based on the observed robust power law.
- Learning to Summarize From Human Feedback
  - Fine-tuning a pre-trained LM with a RL objective shows a promising result.
- Measuring Massive Multitask Language Understanding
  - A smaller, fine-tuned T5-like model outperforms GPT-3 on solving various academic problems with few-shot learning.
- Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data
  - Multi-task fine-tuning is more performant and memory-saving than the conventional per-task fine-tuning.
- Current Limitations of Language Models: What You Need is Retrieval
  - Argues how retrieval changes the landscape of LM.

Future Trends:

Increasing model size is important yet not the only factor for performance improvement. In particular, there will be more notable development in augmenting a language model with a retriever (e.g. a la MARGE and knn-LM) for further performance improvement with less supervision.
GPT-3-like models excel when the target of a given task is not strictly constrained by the input (e.g. open-ended text generation), whereas T5-like models (e.g. MLM) excel otherwise. Unifying these approaches to have the best of both worlds may be an inevitable path.
For more future trends, details and justifications!

CV

Summary:

(Efficient) Transformer and scaling are further eroding into CV as with other non-textual domains.

Contents:

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
- Transformer performs better than CNN on image classification after pre-training and fine-tuning.
Generative Pretraining from Pixels
- GPT-2 scale image model learns strong image representations and generation!
VideoGen: Generative Modeling of Videos using VQ-VAE and Transformers
- SotA video generation

Future Trends:

The current SotA image generative models (e.g. Sparse Transformer) process each pixel in each layer. Given that nearby pixels are locally strongly correlated, this approach may be redundant and does not scale for images of higher resolution and videos. Since images often contain informative fine details (e.g. symbols), the model has to capture such details. I argue that approaches that resolve this problem will be investigated further, e.g.:
- Some pixel-wise conditional computation methods,
- Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling

RL

Summary:

World model has further improved.
RL has benefitted from unsupervised pre-training, e.g.:
- Unsupervised Active Pre-Training for Reinforcement Learning,
- Learning to Summarize From Human Feedback.

Contents:

Mastering Atari with Discrete World Models
- Achieves highly competitive performance on Atari with a small number of steps using world model.
Unsupervised Active Pre-Training for Reinforcement Learning
- Competitive performance on Atari using only 100k steps by a novel unsupervised active pre-training.

Future Trends:

Unsupervised pre-training for RL, world model, sequence/video modeling and the notion of optimization of data will improve further, both individually and synergically, so that

model will become larger,
data efficiency during fine-tuning phase will improve further,
pre-training with larger generic text, sequence and video datasets will improve the performance further, and
more diverse tasks will be accommodated as parts of the many fine-tuning tasks.

Optimizer

Summary:

Learned optimizers and second order optimizers are beginning to outperform the conventional optimizers in a large-scale setting with a substantial margin.

Contents:

Using a thousand optimization tasks to learn hyperparameter search strategies
- The learned optimizer outperforms the conventional optimizers on large-scale settings.
Second Order Optimization Made Practical
- The second order optimizer outperforms the conventional optimizers on large-scale settings.

Optimization of Data

What is it?:

Optimization of data, in this linked post listing many relevant papers, refers to the idea of treating RL as (un-)supervised learning on “good data” that the model finds from its interaction with the environment and therefore a joint optimization of the model and the data. In this section, it is argued that this joint optimization also applies to ML as a whole.

Optimization of data for ML in general:

I believe it is natural to argue that the notion of optimization of data is also applicable to ML in general. Let us consider two examples:

For training any ML model, there are two steps:
1. A human constructs the dataset from the environment (e.g. Internet) according to his/her objective. Here, a human provides the full supervision at dataset construction for the model.
2. The model is trained on the dataset.
Some retrieval-based models as MARGE alternately
1. train the model to model a segment of text conditioned on its kNN segments (in a certain sense) and
2. construct kNN clusters to be used for the step (1) from the dataset (the environment).

Summary:

Thus, ML, including RL, can be broadly thought of as joint optimization of

model (learning from the data, often (un-)supervisedly) and
data (obtained and modified by the model from the environment).
- The data a model possesses at a given moment, along with the parameters of the model, can be thought of as the latent variables or the memory of the model.

Conclusion

The recent development can be summarized as approaching problems according to the paradigm of (efficient) Transformer, scaling, pre-training, retrieval and joint optimization of model and data. This trend will continue and simplify ML research toward a unified model.

Acknowledgement: I would like to thank Madison May for his valuable feedbacks and his blog posts that inspired this blog post!

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

SGMoE replaces every other FFN layer of Transformer

tl;dr:

GShard enables to scale up multilingual NMT Transformer with Sparsely-Gated MoE beyond 600 billion parameters using automatic sharding.
Sparsely-Gated MoE reduces performance-computes trade-off dramatically with conditional computation.
It allows 100x reduction in computes to achieve the performance obtained by the baseline Transformer on multilingual translation.

Details:

(Shazeer, 2017) A SGMoE layer. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network

Sparsely-Gated MoE in this paper is a non-trivial modification of the same component in Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer for Transformer.
GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. The compiler scales with thousands of devices for parallel execution through SPMD.

Comments:

Though SGMoE would increase GPU memory usage due to its enlarged parameter count, L2L can reduce the GPU memory cost from O(L) to O(1), where L is the number of layers.
There are several other notable conditional computation approaches for FFN layer, especially Product Key Memory (PKM). SGMoE has many advantages over its competitors, notably its proven scalability and speed. Hence, I recommend SGMoE over its alternatives.
As discussed in Current Limitations of Language Models: What You Need is Retrieval, SGMoE in place of the feedforward network essentially conditionally computes a single FFN with larger hidden dimension with each of expert being a part of this FFN. In this sense, SGMoE is conditional computations w.r.t. d_ff.
- However, there are many other dimensions, such as depth, that may be conditionally computed for further scaling. For depth in particular, there have been many attempts, but there has been no case of successfully improving the performance-computes trade-off dramatically as SGMoE. Since scaling up a model uniformly tends to perform better, attempts to conditionally compute different dimensions are worthwhile.
- This does not necessarily mean that SGMoE can naively effectively replace components other than FFN, such as the self-attention module.

Relevant links:

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

tl;dr: Proposes L2L, with which GPU/TPU memory usage becomes constant w.r.t. the number of layers. Able to fit 50B parameters with a single V100 and 500GB CPU memory with no speed loss.

Comments:

L2L is particularly effective for conditional computation, as the increased number of parameters by CC cost additional GPU memory.
While L2L can save the GPU memory spent for saving activations, the amount of CPU memory required for this may be beyond one’s budget. In that case, L2L can be combined with gradient checkpointing or RevNet.

Relevant links:

DeepSpeed

tl;dr: A library with various scaling tools, notably the following:

3D parallelism: Scaling to trillion-parameter models
ZeRO-Offload: 10x bigger model training using a single GPU
Efficient block sparse kernel: Powering 10x longer sequences with 6x faster execution
1-bit Adam: 5x less communication and 3.4x faster training

Comments:

This is probably the best PyTorch library overall for large-scale training with various useful tools.
ZeRO-Offload seems identical to L2L.

Relevant links:

Pre-training via Paraphrasing

tl;dr: By training a language model and a retriever jointly and modeling a passage from the retrieved similar passages, MARGE achieves:

Strong zero-shot performance on several tasks, notably BLEU of 35.8 for WMT19 EnDe, with minimal inductive bias.
Strong performance on various tasks after fine-tuning as a potential pre-training alternative to MLM.

Details:

MARGE self-supervises the reconstruction of target text by retrieving a set of related texts (in many languages) and conditioning on them to maximize the likelihood of generating the original.
MARGE adds attention bias such that the attention between the samples with closer embeddings have larger weights, which allows the joint training of the embedder (retriever) and the language model.

Comments:

While MARGE slightly lags behind BERT on monolingual NLU tasks, there is a substantial potential of MARGE-like pre-training for a better alternative to the existing pre-training methods.
Architecture-wise, FiD is similar to MARGE, though it does not have a similar embedding bias. Examples of joint training of a retriever and a language model includes REALM, CRISS and KIF.
This paper devotes a section to discuss the significance and implications of MARGE, including its potential application to improve GPT-3-like models.

Relevant links:

arXiv
Tweets:
- Mike Lewis

Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval

tl;dr: Achieves the SOTA performance-computes trade-off in multi-hop open-domain QA (better than Fusion-in-Decoder). Best published accuracy on HotpotQA with 10x faster inference.

Details:

Efficiency-performance trade-off comparison with published HotpotQA systems. The curve is plotted with different number of top k (k=1,5,10,20,50,100,200) passage sequences we feed into the reader model. seq/Q denotes the time required for each query.

While RAG and FiD have shown strong improvements over extractive models on single-hop datasets such as NaturalQuestions, they do not show an advantage in the multi-hop case. Despite having twice as many parameters as ELECTRA, FiD fails to outperform it using the same amount of context (top 50).

Relevant links:

Cross-lingual Retrieval for Iterative Self-Supervised Training

tl;dr: Achieves SotA (unsupervised NMT) BLEU on 9 language directions (+2.4 BLEU on avg.) without back-translation by retrieving the target with faiss.

Comments:

This work is similar to that of other models jointly training a retriever and a language model, such as MARGE. They tend to be a variant of the following and show the SotA performance in their respective setting:
- Essentially, it retrieves the kNN samples from a given query and models the query conditioned on the retrieved samples.
- After a certain number of iterations, it evaluates the embeddings of each sample and identifies the kNNs of each sample (e.g. with faiss).
- One can repeat this process from the beginning using this newly updated kNN relationship.

Relevant links:

arXiv
Tweets:
- Chau Tran

Long Range Arena: A Benchmark for Efficient Transformers

tl;dr: Various Transformer variants are benchmarked over various tasks. Performance-computes trade-off for each model is obtained as above.

Details:

BigBird and Performer may have the best trade-off overall, the latter of which also excels at textual tasks in particular.
Predictably, Sparse Transformer, the model with explicit inductive bias of images, outperforms other models at image modeling with a substantial margin.

Comments:

Speed is subject to implementation, so the result concerning speed may change, especially if we change the batch length. However, this result offers a good coarse picture.
Routing Transformer is not considered despite its superior performance in text.
The performance of Performer may be underestimated, since they use FAVOR instead of more performant FAVOR+.
There are some notable related surveys on efficient attention:
- Efficient Transformers: A Survey
- A Survey of Long-Term Context in Transformers

Relevant links:

OpenReview

Rethinking Attention with Performers

tl;dr: O(N)-Transformer with competitive performance that approximates regular attention with provable accuracy. Performer outperforms Reformer and Linformer.

Comments:

As discussed in Long Range Arena: A Benchmark for Efficient Transformers, Performer is likely to have the best performance-computes trade-off overall, possibly excluding Routing Transformer, which was not compared. Please refer to this paper for comprehensive performance evaluation (note: their version of Performer is not up-to-date).

Relevant links:

arXiv
OpenReview
Tweets:
- Xingyou Song
Github:
- PyTorch implementation by Phillip Wang

Scaling Laws for Neural Language Models

tl;dr:

The loss of LM scales as a power-law with model size (N), dataset size (D), and the amount of computes spent for training (C) up to seven order of magnitudes.
They find the optimal allocation of a fixed compute budget to maximize the performance.
Most additional computes need to be spent for enlarging the model size.

Details:

The optimal allocation of a fixed compute budget and sufficiently large dataset available. B stands for the batch size.

Dataset size needs to be large enough that the same sample is never reused.
This power law translates to the scaling depicted as in the following figure that shows most computes should go to model enlargement:

For a billion-fold increase in compute and optimally compute-efficient training, most of the increase should go towards increased model size, with a relatively small increase in batch size, and almost none toward serial step count.

Comments:

Since the exponent for serial step count is almost zero, one can consider this as a constant number. According to the author, setting S = 2S_min ~ 10k and B = B_crit is an empirically reasonable choice.
- -> This implies that 10k iterations are all you need!
Since the exponent for batch size or dataset size is somewhat small, we can expect that even the most performant models would require not much larger than several millions of batch size or 10B tokens of dataset size, respectively.
- For batch size, this implies that the upper bound for a reasonable batch size is not particularly large.
- For dataset size, this somewhat matches with the fact that humans would not process much larger than several billions of tokens in their lifetime.

Relevant links:

arXiv

Learning to Summarize From Human Feedback

tl;dr: Achieves super human-level summarization on TL;DR dataset by training a reward function on human feedback and fine-tuning a pre-trained generator (GPT-3 variants) with PPO.

Performance of various training procedures for different model sizes on TL;DR dataset. Supervised models are fine-tuned with an additional dataset upon the pre-trained GPT-3 variants.

Comments:

One can attempt the same approach to any data that can be ranked with reasonable cost (e.g. by human annotations). In this sense, the generality of this method is quite appealing and remarkable.
This work is also notable for showing that fine-tuning an unsupervisedly pre-trained model with a RL objective produces competitive results.

Relevant links:

Measuring Massive Multitask Language Understanding

tl;dr: A smaller, fine-tuned T5-like model (UnifiedQA) outperforms GPT-3 on solving various academic problems, ranging from elementary mathematics to US history, with few-shot learning.

Details:

UnifiedQA uses the T5 text-to-text backbone and is fine-tuned on a certain QA dataset. Then, it is evaluated with few-shot learning without further fine-tuning specific to the domain of the problem in order to remove the need for a large fine-tuning set specifically for a narrow domain, whose availability cannot be assumed in general (relevant section).

Relevant links:

Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data

tl;dr: Proposes a multi-task fine-tuning that is more performant and memory-saving than the conventional per-task fine-tuning.

Details:

Per-task fine-tuning of a model produces a set of parameters specifically for each task, which means fine-tuning individually for each task costs memory for the set of parameters each task produces.

Notably, the proposed method outperforms T5 by 6 points on Super-GLUE by using one-fourth of parameter count.

Comments:

The following papers also propose multi-task fine-tuning:
- Multilingual NMT: Multi-task Learning for Multilingual Neural Machine Translation
- NLU: HyperGrid Transformers: Towards A Single Model for Multiple Tasks
HyperGrid Transformers seem to not require a per-task decoder, which may be more desirable for extension.

Relevant links:

arXiv

Current Limitations of Language Models: What You Need is Retrieval

tl;dr:

Classifies and analyzes LM approaches as efficient attention, recurrence and scaling up model size.
There are some limitations on these approaches for improvement.
Retrieval should solve many of them.
Speculates on jointly training GPT-3 and a retriever a la MARGE for further improvement.

Details:

Efficient attention and recurrence do not improve the prediction of the first ∼ 10^3 tokens, which becomes a bottleneck.
Scaling up a model size (e.g. efficiently with conditional computation) still results in poor performance scaling for some tasks (e.g. GPT-3 vs FiD or iPET).
The usual assumption of the availability of a large training dataset (e.g. for fine-tuning) that is very similar to the test dataset is unrealistic, which needs to be re-examined.
MLM and GPT-2-like models excel at different tasks, which means that there needs a single model to unify these.

The conventional causal modeling (left) vs. the speculated modified MARGE (right). If the tokens in last two blocks are to be predicted (red), the attention is paid to the past tokens in the red region and the tokens in the context (green). The modified MARGE reads the immediate context (purple) to find the relevant parts across the dataset and the past of the current sample in order to use them as a context.
- Essentially, the right approach predicts by efficiently sparsely referencing to the indefinite past as well as cross-referencing to other samples.

Comments:

Admittedly, this paper may not be considered as a notable work comparable to other works listed in this blog post. However, as the author of this paper, I believe it is noteworthy and relevant for this blog post as providing some details for my view of the future trends of LM.

Relevant links:

arXiv
Tweets:
- Aran Komatsuzaki

More Future Trends of NLP

This section expands and elaborates what is described in Future Trends of NLP section of Table of Contents.

The usual assumption of the availability of a large training dataset (e.g. for fine-tuning) that is very similar to the test dataset is unrealistic, which needs to be resolved.
- Measuring Massive Multitask Language Understanding avoids this problem with a large, general fine-tuning dataset and more specialized few-shot learning samples as an alternative.
- Few-shot learning samples may also be replaced with an unsupervisedly retrieved samples to minimize the assumption further.
Retrieval-augmented LM may be able to not only attend to any training sample but also the (indefinite) past in a similar way by segmenting each sample into embeddable segments, from which we can find the relevant segments with kNN search.
For more details and justifications

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

Architecture is essentially the same as vanilla BERT, except that the embedding is a linear projection of each patch into a dense vector.

tl;dr: When pre-trained and transferred to CV tasks, Vision Transformer, needing minimal amount of inductive bias only at preprocessing, attains excellent results compared to SOTA CNNs while requiring much fewer computational resources to train.

Performance versus cost for Vision Transformers and ResNets. Vision Transformers generally outperform ResNets with the same computational budget.

Comments:

Notable differences between this paper and other papers applying self-attention to CV tasks:
- The model they use is essentially the vanilla Transformer with no 2D inductive bias except at the beginning.
  - Suggests the possibility of seamless unification of architecture over various modalities.
- They perform larger-scale evaluation with focus on performance-computes trade-off.
This work further motivates for Transformer-ization of research of CV and other modalities.
Unification of architecture across various modalities would simplify and unify ML research.
The poor performance-computes trade-off of iGPT relative to the SotA CNN models in some tasks may be easily fixed.

Relevant links:

OpenReview
Videos:
- Yannic Kilcher

Generative Pretraining from Pixels

tl;dr: GPT-2 scale image model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification despite having minimal 2D inductive bias and low-resolution ImageNet without label. The generated images are stunning!

Details:

Pre-trains Sparse Transformer of varying sizes with low-resolution ImageNet with the ordinary causal modeling or BERT modeling and then fine-tune or linear-probe the model for classification.

Comments:

One significance of this paper is that it shows a robust scaling of the loss w.r.t. the size of the generative image model, which had not been performed before even if it was predictable.
One of the causes of poor performance of iGPT relative to Vision Transformer is presumably that the former is trained unsupervisedly, whereas the latter is trained supervisedly with essentially the same task for both pre-training and fine-tuning (classification). If a down-streaming task differs from the pre-training, the performance difference may diminish.

Relevant links:

VideoGen: Generative Modeling of Videos using VQ-VAE and Transformers

There are two sequential stages in training: training VQ-VAE (Left) and training an autoregressive transformer in the latent space (Right).

tl;dr: VQ-VAE-based, GPT-like model with 3D convolutions and axial self-attention improves the SotA bits/dim on BAIR dataset from 3.94 (Axial Attention) to 3.62.

ViZDoom HGS Unconditional Samples

Comments:

There have been many other competitive video modeling papers submitted to ICLR 2021, notably:
- A Good Image Generator Is What You Need for High-Resolution Video Synthesis (Twitter summary)

Relevant links:

Mastering Atari with Discrete World Models

Components of Dreamer. Taken from DreamerV1 paper (Hafner, 2019)

tl;dr: Proposes DreamerV2, the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model.

Performance on the Atari benchmark of 55 games over 200M steps

Details:

MuZero performs better for the same number of environment frames, but its MCTS makes it harder to parallelize. The advanced planning components of MuZero are complementary and could be applied to the accurate world models learned by DreamerV2.
The above figure, in fact, under-estimates the dominance of DreamerV2 in terms of a more robust measurement of performance. Please refer to the paper for further details on this.

Relevant links:

arXiv (DreamerV2); arXiv (DreamerV1)
OpenReview
Tweets:
- Danijar Hafner
Github:
- DreamerV1 (needs to be modified for DreamerV2 as described in the appendix)

Unsupervised Active Pre-Training for Reinforcement Learning

tl;dr: Highly competitive performance compared to canonical RL algorithms is achieved using only 100k steps on Atari by a novel unsupervised active pre-training.

Details:

APT alternates between contrastive learning on data collected by the agent and RL optimization to maximize particle based entropy. After pre-training, the encoder f_θ and the RL policy initialization can be fine-tuned for different downstream tasks to maximize task-specific reward.

Relevant links:

OpenReview

Using a thousand optimization tasks to learn hyperparameter search strategies

A 2D TSNE embedding of all 1162 tasks used for training the learned optimizer.

tl;dr: An optimizer learned from a dataset of hyperparameters of a thousand tasks leads to a substantial improvement over various tasks, including large-scale problems such as LM1B and ImageNet.

Details:

The learned optimizer outperforms the learning rate tuned Adam by a large margin and matches the default.

The learned optimizer outperforms the learning rate tuned Adam with both a constant learning rate, and a fixed learning rate schedule on a 53M parameter Transformer trained on LM1B.

Comments:

The following is a relevant paper:
- Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves
  - Learned optimizer that generalizes better to unseen tasks and enables automatic regularization.

Relevant links:

arXiv

Second Order Optimization Made Practical

Preconditioner statistics are computed at each step by the accelerators. Preconditioners are only computed every N steps and this computation is distributed to all available CPU cores.

tl;dr:

Presents a scalable implementation of a second-order optimizer (Shampoo) that defeats the first order baselines on large-scale problems in terms of wall-clock time.
This is the first implementation of 2nd order optimizer that outperforms Adam on large-scale NMT in terms of wall-clock time (40% reduction).

Test log-perplexity of a Transformer-Big model on WMT’14 en→fr with the batch size of 1536. Improvement of 41% in wall-time for convergence.

Relevant links:

OpenReview (with a different title)
arXiv
Tweets:
- Rohan Anil

http://arankomatsuzaki.wordpress.com/?p=247

Extensions

https://arankomatsuzaki.wordpress.com/feed

Posts