Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • String Column
  • Float & Integer
  • Datetime
  • Pairwise plot between columns
  • Resources

Was this helpful?

Export as PDF
  1. AI/ ML Accelerator
  2. Data Profiler

Column Level Screen

PreviousTable Level ScreenNextExplorer Mode

Last updated 7 months ago

Was this helpful?

A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via ).

A Note on Data:

The data used in this section of Academy documentation can be found here to download:

The citation for this synthetic dataset is:

Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007.

String Column

Column-level screen shows a string column

For columns containing string data, the data profiler will display several statistics and charts to help analyze the data.

The statistics include:

  • The missing rate, expressed as a percentage of the missing values in the column.

  • The number of unique values present in the column.

The charts provided include:

  • Top Records Bar Chart: This chart displays the top values that occur most frequently in the column. You can select how many top records to display using a dropdown list. By hovering over the bars, you can see the exact count of records for each value.

  • Character Length Distribution Chart: This chart shows how the lengths of the strings are distributed. By hovering over different parts of the chart, you can view the range of character lengths and how frequently each range occurs. Besides, the average length of the strings in the column and standard deviation (which measures the amount of variation in the string lengths) are also reported.

  • Boxplot: The boxplot provides a visual summary of the data in terms of its distribution, showing the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.

  • Grouping Frequency Chart: This chart displays how often unique values in the current column occur when grouped with values from another column. You can choose the column to group by using a dropdown list.

Float & Integer

Column-level screen shows a float column

For columns containing float data, the data profiler provides several statistics and charts to help analyze the data.

The statistics include:

  • The missing rate, displayed as a percentage of missing values.

  • The standard deviation, which measures the spread of the data values.

  • The Interquartile range, which measures the difference between the 75th and 25th percentiles of the data.

The charts provided include:

  • Distribution Chart: This chart displays the distribution of values in the column. You can hover over the chart to view the range of values and their frequencies.

  • Boxplot: The boxplot visually represents the distribution of the data, showing the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.

  • Grouping Frequency Chart (Two way plot): This chart shows the frequency of unique values in the current column, grouped with values from another column. You can select the column for grouping from a dropdown list.

Datetime

Column-level screen shows a datetime column

For columns containing datetime data, the data profiler provides several statistics and charts for in-depth analysis.

The statistics include:

  • The missing rate, displayed as a percentage of missing values.

  • The standard deviation, measuring the dispersion of the datetime values.

  • The Mode, showing the mode/format of the datetime data in the column.

The charts provided include:

  • Distribution Chart: This chart shows the distribution of datetime values in the column. You can hover over the chart to view the range of values and their frequencies.

  • Boxplot: The boxplot visually represents the distribution of the datetime data, displaying the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.

  • Radar Chart: This chart displays the frequency of values, grouped by year, month, or day. You can change the grouping option using the dropdown at the top.

  • Grouping Frequency Chart (Two Way Plot): This chart shows the frequency of unique datetime values in the current column, grouped with values from another column. You can select the column for grouping from a dropdown list.

Pairwise plot between columns

Even though each column type has a different layout on the Column-level Screen, Pairwise plot between columns is a common component.

The user can create a plot between the current column and any other column from the same table. However, not all columns are available for this feature. Data Profiler will show columns that satisfy the following conditions:

  • Not a string column

  • If it is a string column:

    • Not a primary key

    • The number of unique values count is no larger than 30

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Full Documentation
sales@dnanexus.com
https://synthea.mitre.org/downloads
https://doi.org/10.1016/j.ibmed.2020.100007