Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • A Note on Data:
  • Dataset Level Screen
  • Manage Tables
  • Table Relationships
  • Summary Page
  • Resources

Was this helpful?

Export as PDF
  1. AI/ ML Accelerator
  2. Data Profiler

Dataset Level Screen

PreviousUtilizing Data Profiler NavigatorNextTable Level Screen

Last updated 2 months ago

Was this helpful?

A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via ).

A Note on Data:

The data used in this section of Academy documentation can be found here to download:

The citation for this synthetic dataset is:

Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007.

Dataset Level Screen

Dataset-level screen is the default screen when you open Data Profiler. It has the Table Relationship and Table Summary pages. In this section, we describe each component of the screen and its key values.

The default screen of Data Profiler is at the Table Relationships page of the Dataset level

Manage Tables

The Manage Tables controller allows you to hide/show the table(s) from the data profile. The table(s) which are hidden from the ERD will also be hidden from the Data Hierarchy. In order to manage the table display, click on the ‘Manage’ button on the bottom right corner of the screen, then use the toggle to hide/show the tables, and click on the ‘Apply’ button to apply the changes.

Open the ‘Manage Tables’ controller to show/hide the table(s)

The data profile is updated after the ‘patients’ table is hidden

Table Relationships

A Relationship Diagram (left) with some selected edges highlighted in blue. The selected edges create a Diagram of Overlaps (right)

This is a simplified Entity Relationship Diagram displayed as a graph. Each node represents a table in your dataset, and each edge represents a column that links two tables. The linked columns are the referenced_entity_field in the data_dictionary. The direction of an edge represents the reference from a foreign-key column to a primary-key column

FAQs

Question: There are tables supposed to be linked to each other. Why do they appear unlinked in Data Profiler?

Answer: The linkage between any two tables are determined by the data_dictionary. Data Profiler does not remove or add linkages to a dataset. You should check your data_dictionary again and make sure that the linkage is correctly specified.

By clicking on one or more edges, you can view a Diagram of Overlaps that shows how many values the linked columns share between the tables. There are several chart types for a Diagram of Overlaps:

Venn Diagram

Venn diagram is the default chart type of Diagram of Overlaps. Each set in this diagram is a table in the selection. The numbers are the values from the column in the selection.

Question: How should I interpret a Venn diagram having 2 tables, patients and measurements, and the value of their intersection is 90? The column is patient_id.

Answer: When patients and measurement tables share some patient_ids, It basically means there are 90 patients having measurements data.

Euler Diagram

Euler diagrams share the same concept with Venn diagrams. The only difference is the size of overlap sections are proportional to the overlap value.

Upset Plot

Upset plot counts the value of all non-empty possible combinations from the selected tables. This plot type is more scalable than the Venn or Euler diagram.

A common use case of Upset plot is to help answer questions such as “How many patients have full information across tables?”. By creating an Upset plot between the “patients” table and other tables (e.g. diagnosis, measurement, sequence_run, etc.), we can answer the questions by looking at the number of patient ids that are shared across all tables.

Summary Page

The Summary page provides summary for both tables and columns in the Dataset. Below are the details of each section.

The summary of all Tables and Columns in the Dataset

Table Summary

The Table Summary shows information about all tables in the dataset. Each row displays various statistics for a table in your dataset, including:

  • # Columns, # Rows: the number of columns, the number of rows

  • Column types: data type of all columns in a table

  • Duplication Rate: the rate of duplication of a whole row in the table

  • Missing Rate: the rate of having an empty cell in the table

You can click on the hamburger button at the header of each column to sort or filter the data as needed.

Clicking on the hamburger button to sort or filter the data

Column Summary

The Column Summary provides details about every column in the dataset, with each row presenting below information for a specific column.

  • Column name: name of the column

  • Key type: the attributes that are used to define the relationships of tables

  • Description: the title of a column (if provided in the data dictionary file)

  • Provided type: the type of data in the column which is specified in the data dictionary file. If the data dictionary is not provided, it is ‘unknown’

  • Inferred types: the type of data in the column inferred by Data Profiler if the data dictionary is not provided. If the data dictionary is provided, it will be the same as the Provided type

  • Missing Rate: the rate of having an empty cell in a column

  • Duplication Rate: the rate of duplication of values in a column

You can also click on the hamburger button at the header of each column to sort or filter the data as needed.

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select “Contact Support”

  3. Fill in the Subject and Message to submit a support ticket.

Full Documentation
sales@dnanexus.com
https://synthea.mitre.org/downloads
https://doi.org/10.1016/j.ibmed.2020.100007