Who created the first modern IQ test?

Alfred Binet and Théodore Simon developed the first widely used intelligence scale in 1905 to identify schoolchildren who needed tailored support.

What is the difference between ratio IQ and deviation IQ?

Ratio IQ uses (mental age / chronological age) × 100, which worked mainly for children. Deviation IQ expresses how far a score sits from the age-based average in standard deviation units (mean=100, SD≈15) and is used today.

What are Raven’s Progressive Matrices?

A largely nonverbal test that measures abstract pattern reasoning with minimal language, often used to tap fluid intelligence and reduce cultural loading.

What is the Flynn Effect?

A long-term rise in average scores on many IQ subtests during the 20th century, likely driven by improved nutrition, education, and cognitively demanding environments. Tests are periodically renormed so the population mean remains 100.

The History of IQ Tests: From “Mental Age” to Modern Psychometrics

Resources › IQ Basics

The History of IQ Tests: From “Mental Age” to Modern Psychometrics

Q: Do modern IQ tests measure more than a single ability?

Yes. Contemporary batteries sample multiple broad abilities (e.g., verbal comprehension, working memory, processing speed, visual–spatial reasoning) and summarize them alongside a global score.

Last updated: October 9, 2025 • Reading time: ~10 min

How did IQ testing evolve from a classroom tool into today’s statistically robust, multi-index assessments? This guide walks the timeline—from Binet’s mental age to Wechsler’s deviation IQ, Raven’s nonverbal reasoning, CHC theory, the Flynn Effect, and computer-adaptive testing.

Key takeaways

Early tests used mental age; modern tests report a deviation IQ (mean 100, SD ~15).
Contemporary batteries measure multiple broad abilities (e.g., verbal, fluid reasoning, working memory, processing speed).
The Flynn Effect raised average scores across the 20th century; tests are regularly renormed.
Online, computer-adaptive designs use IRT to improve fairness, precision, and efficiency.
IQ predicts some outcomes—but it doesn’t capture creativity, personality, or emotional intelligence.

Before IQ: the search for a general ability (late 1800s–early 1900s)

Late-19th-century pioneers tried to quantify individual differences using sensory and reaction-time tasks. The decisive turn came with Charles Spearman (1904), who observed positive correlations across diverse tasks and proposed a general factor, g, underlying cognitive performance.

Binet & Simon (1905): a practical school tool

Alfred Binet and Théodore Simon built the first widely adopted intelligence scale to help schools identify children who might benefit from tailored instruction. Items were arranged by the typical age of mastery, yielding a child’s mental age based on performance.

Stern’s ratio IQ (1912)

William Stern introduced the tidy formula IQ = (mental age / chronological age) × 100. It worked reasonably for children but not for adults, whose “mental age” does not scale linearly with years.

Stanford–Binet & mass testing (1916–1920s)

Lewis Terman’s Stanford–Binet adaptation popularized IQ in the U.S. Group tests like Army Alpha and Army Beta scaled administration during WWI, demonstrating efficiency—and the risk of cultural and educational bias when interpretation lags behind usage.

Wechsler & deviation IQ (1939 →)

David Wechsler’s scales (e.g., WAIS for adults, WISC for children) reframed IQ as a deviation score: how far an individual’s performance sits from the age-based mean in standard deviation units (mean 100, typically SD ≈ 15). He also emphasized multi-index profiles—verbal comprehension, working memory, processing speed, and visual–spatial reasoning—moving beyond a single number.

Nonverbal & “culture-fair” tests

Raven’s Progressive Matrices (late 1930s) and Cattell’s Culture Fair Test sought to reduce language and cultural loading, focusing on abstract pattern reasoning (fluid intelligence). Fairness improved, though no test is entirely culture-free; these tools remain valuable for tapping reasoning with minimal verbal demands.

From g to CHC: modern ability models

Raymond Cattell distinguished fluid (Gf) and crystallized (Gc) intelligence. Work by Horn and Carroll converged into the CHC framework (Cattell–Horn–Carroll), a hierarchical model with g at the top, broad abilities in the middle (e.g., Gf, Gc, Gv, Gs, Gwm), and specific skills at the base. Modern batteries explicitly sample these domains.

The Flynn Effect

Throughout the 20th century, average performance on many subtests rose by a few points per decade—a pattern dubbed the Flynn Effect. Proposed drivers include improved nutrition, expanded schooling, cognitively richer environments, and test familiarity. To keep the population mean at 100, publishers renorm tests regularly.

Measurement advances: IRT & computer-adaptive testing

Contemporary assessments leverage Item Response Theory (IRT) and computer-adaptive testing (CAT), estimating both item difficulty and a person’s latent ability. This improves precision and efficiency, and supports transparent reporting with confidence intervals and domain-level indices.

What IQ does—and doesn’t—tell us

IQ correlates with academic achievement and some job performance, particularly where complex learning and reasoning matter. Yet IQ is not a measure of creativity, wisdom, personality, motivation, values, or emotional intelligence. Responsible practice emphasizes multi-method assessment, attention to language and cultural context, and cautious, humane interpretation.

Timeline at a glance

1904 — Spearman proposes g
1905 — Binet–Simon scale (mental age)
1912 — Stern’s ratio IQ
1916 — Stanford–Binet (U.S. adaptation)
WWI — Army Alpha/Beta group tests
1930s–40s — Raven’s Matrices; Wechsler scales
1960s–1990s — Gf/Gc and CHC frameworks mature
1980s–2000s — Flynn Effect widely documented; regular renorming
2000s → — IRT/CAT mainstream; online delivery; richer index reporting

What to read next

Try the main IQ test now