HPLT | High Performance Language Technologies

Summary
High Performance Language Technologies (HPLT) is a space combining petabytes of natural language data with large-scale model training. With trillions of words of text, the space will be the largest open text collection. Cleaning and privacy protecting services improve the quality and ethical properties of the text. Going beyond static repositories that require the user to individually analyze each data set, the project will rate data sets by how much they improve end-to-end language models and machine translation systems. Continuous integration of models and data will result in free downloadable high-quality models for all official European Union languages and beyond. The models will be reproducible with information and evaluation metrics shown in a publicly available dashboard. By focusing on training at scale, the project complements the inference-focused European Language Grid, which in turn will be used for model deployment. Datasets, models and information about them will be published in recognized FAIR data repositories, aggregation catalogues and marketplaces for easy discovery, access, replication, and exploitation.
Unfold all
/
Fold all
More information & hyperlinks
Web resources: https://cordis.europa.eu/project/id/101070350
Start date: 01-09-2022
End date: 31-08-2025
Total budget - Public funding: 4 058 287,50 Euro - 3 880 687,00 Euro
Cordis data

Original description

High Performance Language Technologies (HPLT) is a space combining petabytes of natural language data with large-scale model training. With trillions of words of text, the space will be the largest open text collection. Cleaning and privacy protecting services improve the quality and ethical properties of the text. Going beyond static repositories that require the user to individually analyze each data set, the project will rate data sets by how much they improve end-to-end language models and machine translation systems. Continuous integration of models and data will result in free downloadable high-quality models for all official European Union languages and beyond. The models will be reproducible with information and evaluation metrics shown in a publicly available dashboard. By focusing on training at scale, the project complements the inference-focused European Language Grid, which in turn will be used for model deployment. Datasets, models and information about them will be published in recognized FAIR data repositories, aggregation catalogues and marketplaces for easy discovery, access, replication, and exploitation.

Status

SIGNED

Call topic

HORIZON-CL4-2021-DATA-01-03

Update Date

09-02-2023
Images
No images available.
Geographical location(s)
Structured mapping
Unfold all
/
Fold all
Artificial Intelligence, Data and Robotics Partnership (ADR)
ADR Partnership Call 2021
HORIZON-CL4-2021-DATA-01-03 Technologies for data management (AI, Data and Robotics Partnership) (IA)
Horizon Europe
HORIZON.2 Global Challenges and European Industrial Competitiveness
HORIZON.2.4 Digital, Industry and Space
HORIZON.2.4.7 Advanced Computing and Big Data
HORIZON-CL4-2021-DATA-01
HORIZON-CL4-2021-DATA-01-03 Technologies for data management (AI, Data and Robotics Partnership) (IA)