Community Data

Covid-19 Data Commons Toolkit

This is a collection of approximately 6000 datasets (after preprocessing) related to covid-19. The T-SNE plot is presented to visualize the BioBert embeddings created using the abstracts of the datasets. There are multiple interesting clusters formed in the dataset related to keywords like vaccine, icu, etc which we are exploring.


Exploratory Data Analysis

This collection of approximately 26554 datasets, post preprocessing, concerning COVID-19 was sourced from Figshare and Zenodo FAIR Stations. Users can utilize this resource to conduct Exploratory Data Analysis and freely download the data according to their preferences.


Harmonising FAIR DataHub

There are three levels of analysis to select from: data level, field (column names), and description. Next, choose your preferred FAIR station. Select data based on the keyword. Decide whether to view mapped terms, CSV, or closely related files based on similarity score. Additionally, the similarity ratings at various levels between two files are available.


Select Analysis Type:

? The dataset descriptions were processed using the BioBERT model, which generated vectors for every description. The 768-dimensional vectors produced by BioBERT represent each description's contextual data and semantic characteristics in a high-dimensional space. A cosine similarity matrix was produced after computing the cosine distance metric between descriptions using these vectors. This matrix provides an essential measure of the relatedness of datasets by quantifying the similarity between pairs based on their respective descriptions. ? Field-level assessed the similarity of field compositions between two datasets at the concept level using the Jaccard similarity score. Mapping strategies employed were: 1) 1:1 Mapping, where each column value directly corresponded to a unique SNOMED CT concept, ensuring strict equivalence between data and medical terms, and 2) 1:N Mapping, where column headers were mapped to single SNOMED CT terms after additional processing, allowing flexibility for handling ambiguities or grouping related terms. ? KL Divergence is used to measure differences between similar columns (data) identified from the Field-Layer analysis