Dataset Level Screen

A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).

A Note on Data:

The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads

The citation for this synthetic dataset is:

Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007

Dataset Level Screen

Dataset-level screen is the default screen when you open Data Profiler. It has the Table Relationship and Table Summary pages. In this section, we describe each component of the screen and its key values.

The default screen of Data Profiler is at the Table Relationships page of the Dataset level

Manage Tables

The Manage Tables controller allows you to hide/show the table(s) from the data profile. The table(s) which are hidden from the ERD will also be hidden from the Data Hierarchy. In order to manage the table display, click on the ‘Manage’ button on the bottom right corner of the screen, then use the toggle to hide/show the tables, and click on the ‘Apply’ button to apply the changes.

Open the ‘Manage Tables’ controller to show/hide the table(s)

The data profile is updated after the ‘patients’ table is hidden

Table Relationships

A Relationship Diagram (left) with some selected edges highlighted in blue. The selected edges create a Diagram of Overlaps (right)

This is a simplified Entity Relationship Diagram displayed as a graph. Each node represents a table in your dataset, and each edge represents a column that links two tables. The linked columns are the referenced_entity_field in the data_dictionary. The direction of an edge represents the reference from a foreign-key column to a primary-key column

FAQs

Question: There are tables supposed to be linked to each other. Why do they appear unlinked in Data Profiler?

Answer: The linkage between any two tables are determined by the data_dictionary. Data Profiler does not remove or add linkages to a dataset. You should check your data_dictionary again and make sure that the linkage is correctly specified.

By clicking on one or more edges, you can view a Diagram of Overlaps that shows how many values the linked columns share between the tables. There are several chart types for a Diagram of Overlaps:

Venn Diagram

Venn diagram is the default chart type of Diagram of Overlaps. Each set in this diagram is a table in the selection. The numbers are the values from the column in the selection.

Question: How should I interpret a Venn diagram having 2 tables, patients and measurements, and the value of their intersection is 90? The column is patient_id.

Answer: When patients and measurement tables share some patient_ids, It basically means there are 90 patients having measurements data.

Euler Diagram

Euler diagrams share the same concept with Venn diagrams. The only difference is the size of overlap sections are proportional to the overlap value.

Upset Plot

Upset plot counts the value of all non-empty possible combinations from the selected tables. This plot type is more scalable than the Venn or Euler diagram.

A common use case of Upset plot is to help answer questions such as “How many patients have full information across tables?”. By creating an Upset plot between the “patients” table and other tables (e.g. diagnosis, measurement, sequence_run, etc.), we can answer the questions by looking at the number of patient ids that are shared across all tables.

Summary Page

The Summary page provides summary for both tables and columns in the Dataset. Below are the details of each section.

The summary of all Tables and Columns in the Dataset

Table Summary

The Table Summary shows information about all tables in the dataset. Each row displays various statistics for a table in your dataset, including:

# Columns, # Rows: the number of columns, the number of rows
Column types: data type of all columns in a table
Duplication Rate: the rate of duplication of a whole row in the table
Missing Rate: the rate of having an empty cell in the table

You can click on the hamburger button at the header of each column to sort or filter the data as needed.

Clicking on the hamburger button to sort or filter the data

Column Summary

The Column Summary provides details about every column in the dataset, with each row presenting below information for a specific column.

Column name: name of the column
Key type: the attributes that are used to define the relationships of tables
Description: the title of a column (if provided in the data dictionary file)
Provided type: the type of data in the column which is specified in the data dictionary file. If the data dictionary is not provided, it is ‘unknown’
Inferred types: the type of data in the column inferred by Data Profiler if the data dictionary is not provided. If the data dictionary is provided, it will be the same as the Provided type
Missing Rate: the rate of having an empty cell in a column
Duplication Rate: the rate of duplication of values in a column

You can also click on the hamburger button at the header of each column to sort or filter the data as needed.

Resources

Full Documentation

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.

PreviousUtilizing Data Profiler Navigator NextTable Level Screen

Last updated 4 months ago

Was this helpful?