Community Data

Covid-19 Data Commons Toolkit

This is a collection of approximately 6000 datasets (after preprocessing) related to covid-19. The T-SNE plot is presented to visualize the BioBert embeddings created using the abstracts of the datasets. There are multiple interesting clusters formed in the dataset related to keywords like vaccine, icu, etc which we are exploring.

Exploratory Data Analysis

This is a collection of approximately 1250 datasets (after preprocessing) related to covid-19. Available for Users to perform Exploratory data analysis.

Tindering FAIR DataHub

Select Analysis Type:

? The dataset descriptions were processed using the BioBERT model, which generated vectors for every description. The 768-dimensional vectors produced by BioBERT represent each description's contextual data and semantic characteristics in a high-dimensional space. A cosine similarity matrix was produced after computing the cosine distance metric between descriptions using these vectors. This matrix provides an essential measure of the relatedness of datasets by quantifying the similarity between pairs based on their respective descriptions. ? Field-level assessed the similarity of field compositions between two datasets at the concept level using the Jaccard similarity score. Mapping strategies employed were: 1) 1:1 Mapping, where each column value directly corresponded to a unique SNOMED CT concept, ensuring strict equivalence between data and medical terms, and 2) 1:N Mapping, where column headers were mapped to single SNOMED CT terms after additional processing, allowing flexibility for handling ambiguities or grouping related terms. ? KL Divergence is used to measure differences between similar columns (data) identified from the Field-Layer analysis