Summary
Task 7.5: Create and operate a platform and common data models for sharing data and remote access (M1-M60) (UMCU, ARS, BBMRI, GSK, JAN, Elevate). A state-of-the-art digital research environment with ISO certified and GDPR proof services for remote collaborations will be subcontracted and operated. Access to the application server will be only allowed using two-factor authentication. The environment will be able to host multiple research projects, each with its own secured area to share data and results and provide access through remote desk tops clients. The infrastructure will offer several analytical tools (e.g. R, SQL database, Shiny, Stata) word processing software, and utilities. To streamline and conduct the data characterisation and distributed data analytics for the different WPs (1 & 2) that want to use the platform for distributed analytics, an operations team will be installed that will coordinate the various tasks This will involve the negotiations for data access and access rules, distribution of instructions and scripts, as well as facilitating the transfer of results to the DRE. For management, documentation and tracking of the different tasks we will operate TASKA which was developed in the IMI-EMIF project. Standard operating procedures & training webinars for DAPs will be developed, whenever necessary.
Common data models for data characterisation and demonstration studies will be defined together with WP1 and 2 and based on existing CDMs (EUROlinkCAT, EU-ADR, OMOP, EMIF Use Cases, SENTINEL, PCORnet, LifeCycle) as a starting point, and standard procedures will be developed that run against the chosen CDMs.
We envision that different CDMs will be chosen at different steps of the data flow. First, a set of common input files (D2), which will encompass approximately 4 tables (see Figure 3.1b, Part B): for instance, the identifiers of mothers and children with birthdate will be stored (in Population), each event of delivery with be stored (in Events), gestational age with the same date will be stored (in Measurement). Second, datasets of study-specific variables (D3 in the figure): for instance, LMP will be stored as a derived variable from the event of delivery and the gestational age. Third, datasets specific to the study design (D4 in the figure): for instance, if the study design is a case-control, D4 will encompass the dataset of case-sets. Even though we will aim to create syntactically stable CDMs, the content stored in the CDMs, and the values allowed for the different columns, will be data source- and project-specific: for instance, if in a data source Last Menstrual Date is directly recorded this will be stored and used along with the derived LMP. However, the data transformation procedures will be programmed centrally, as far as possible.
All the datasets will remain stored within the premises of the DAPs up to D3. D4 will be shared within the study team using the secure remote environment. In case one or more of the study designs of the demonstration projects require that D4 contains information that the DAPs are not allowed to share, the distributed implementation of the statistical analysis (t4) will be developed. For instance, if estimation of a propensity score is needed, a distributed estimation of regression will be implemented, following similar experience in Sentinel
To support the semantic and syntactic harmonisation, available tools will be leveraged, which include UMLS OHDSI ontologies and tools, the IDMP standards and Article 57 database, ADVANCE Codemapper and VaccO. DAPs will be trained with e-learning materials to use these tools. Since we are prepared to work both with data sources that have already been mapped to the OMOP structure, as well as data sources which are in the original format, different processes will be supported to map the local data to the common input files, according to whether the local data is in OMOP or not. In particular, if the local data is in OMOP, data so
More information & hyperlinks