InfantSimulator | Why do infants learn language so fast? A reverse engineering approach

Summary
How do infants learn their first language(s)? The popular yet controversial 'statistical learning hypothesis' posits that they learn by gradually collecting statistics over their language inputs. This is strickingly similar to how current AI's Large Language Models (LLMs) learn and shows that simple statistical mechanisms may be sufficient to attain adult-like language competence. But does it? Estimates of language inputs to children show that by age 3, they have received 2 or 3 orders of magnitude less data than LLMs of similar performance. And the gap grows exponentially larger with children's age. Worse, when models are fed with speech instead of text they learn even slower. How are infants so efficient learners?

This project tests the hypothesis that in addition to statistical learning, infants benefit from 3 mechanisms that accelerate their learning rate. (1) They are born with a \textit{vocal tract} which helps them understand the link between abstract motor commands and speech sounds, and decode noisy speech inputs more efficiently. (2) They have an \textit{episodic memory} enabling them to learn from unique events, instead of gradually learning from thousands of repetitions. (3) They start with a \textit{evolved learning architecture} optimized for generalisation from few and noisy inputs.

Our approach is to build a computational model of the learner (an infant simulator), which when fed by realistic language input produces outcome measures comparable to children's (laboratory experiments, vocabulary estimates). This gives a quantitative estimate of the efficiency of each 3 mechanisms, as well as new testable predictions. We start with English and French that have both accessible large annotated speech corpora and documented acquisition landmarks and focus on the first three years of life. We then help building similar resources across a larger set of languages by fostering a cross-disciplinary community that shares tools, data and analysis methods.
Unfold all
/
Fold all
More information & hyperlinks
Web resources: https://cordis.europa.eu/project/id/101142705
Start date: 01-01-2025
End date: 31-12-2029
Total budget - Public funding: 2 494 625,00 Euro - 2 494 625,00 Euro
Cordis data

Original description

How do infants learn their first language(s)? The popular yet controversial 'statistical learning hypothesis' posits that they learn by gradually collecting statistics over their language inputs. This is strickingly similar to how current AI's Large Language Models (LLMs) learn and shows that simple statistical mechanisms may be sufficient to attain adult-like language competence. But does it? Estimates of language inputs to children show that by age 3, they have received 2 or 3 orders of magnitude less data than LLMs of similar performance. And the gap grows exponentially larger with children's age. Worse, when models are fed with speech instead of text they learn even slower. How are infants so efficient learners?

This project tests the hypothesis that in addition to statistical learning, infants benefit from 3 mechanisms that accelerate their learning rate. (1) They are born with a \textit{vocal tract} which helps them understand the link between abstract motor commands and speech sounds, and decode noisy speech inputs more efficiently. (2) They have an \textit{episodic memory} enabling them to learn from unique events, instead of gradually learning from thousands of repetitions. (3) They start with a \textit{evolved learning architecture} optimized for generalisation from few and noisy inputs.

Our approach is to build a computational model of the learner (an infant simulator), which when fed by realistic language input produces outcome measures comparable to children's (laboratory experiments, vocabulary estimates). This gives a quantitative estimate of the efficiency of each 3 mechanisms, as well as new testable predictions. We start with English and French that have both accessible large annotated speech corpora and documented acquisition landmarks and focus on the first three years of life. We then help building similar resources across a larger set of languages by fostering a cross-disciplinary community that shares tools, data and analysis methods.

Status

SIGNED

Call topic

ERC-2023-ADG

Update Date

21-11-2024
Images
No images available.
Geographical location(s)
Structured mapping
Unfold all
/
Fold all
Horizon Europe
HORIZON.1 Excellent Science
HORIZON.1.1 European Research Council (ERC)
HORIZON.1.1.1 Frontier science
ERC-2023-ADG ERC ADVANCED GRANTS