Introduction to Data Profiler

A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).

What is the Data Profiler?

The Data Profiler is an app within the DNAnexus Tool Library that supports data cleaning and harmonization. It organizes your data into three levels of information: Dataset level, Table level, and Column level. Each level surfaces interactive visualizations on data quality, data coverage, and descriptive statistics to help you understand and identify potential data issues. The Data Profiler also includes an Explorer Mode where you can create customizable visualization using simple drag-and-drop functionality, for deeper exploration beyond the standard metrics. Researchers can bring their data to the Platform and leverage the Data Profiler app to explore and quickly evaluate the readiness of the data for downstream analysis.

Why use the Data Profiler?

The Data Profiler app saves significant time by generating consistent and comprehensive reports on data quality. It helps support informed decision-making, allowing experts to fully understand the data before downstream analysis. From data collection and cleaning to feature engineering, continuously profiling data to understand its evolution and maintain consistent quality throughout the data transformation process is important to help identify potential issues early, enabling adjustments that optimize analysis and performance.

Core features of Data Profiler

This tool quickly analyzes and visualizes large dataset input from CSV,Parquet, or DNAnexus Apollo Dataset (or Cohort). The point-and-click solution efficiently provides summary statistics and visualizations, enabling a comprehensive understanding of the data. It also highlights data inconsistencies and complexities (e.g., missing and imbalanced data) in a logical and organized manner, guiding you through the structure and content of your data.

Getting Started

Access to the App

There are two ways to run the application:

  1. Direct Access: Go to this link to open the app.

  2. Platform Navigation: Click on the top navigation bar, then select Tools, proceed to the tool library, search for the “Data Profiler” app, select it, then select run within the documentation to start the app.

Inputs

To run the app, you need to provide the required input files, which are .csv or .parquet files , or a DNAnexus Apollo Dataset (or Cohort).

If you run the app with .csv files or .parquet files, there is an optional input for the Data Dictionary. This is the same Data Dictionary used by Data Model Loader to generate the DNAnexus Apollo Dataset.

Input name

Mandatory/ Optional

Input type/format

Description

input_files

Optional

A list of CSV, TSV, TXT, or parquet files

This is the data that will be profiled by this application. Each file is a table in your dataset. Only one of the following two options should be provided: input_files and dx_record

dx_record

Optional

A DNAnexus Apollo Dataset (or Cohort)

The data in this Dataset (or Cohort) will be profiled by this application.

data_dictionary

optional

A CSV file

This file indicates the relationship between the tables in input_files.

If not provided, the table relationship will be inferred in the job.

Tables for Inputs

For this example,there are 2 tables in your dataset:

  • patients.csv: a table with patient IDs and other clinical information of the patient

  • encounters.csv: a table of encounters (i.e. hospital visits) of all patients in the patient.csv

patients.csv

patient_id

name

P0001

John Doe

P0002

Jane Roe

encounters.csv

encounter_id

patient_id

E0001

P0001

E0002

P0001

E0003

P0002

E0004

P0002

In this example dataset, there are 2 patients in the patients.csv, each patient visited the hospital twice.

Data Dictionary

Even though data_dictionary is optional, it is crucial for cross-table functions in Data Profiler. We highly recommend creating one for your dataset.

The data_dictionary is a CSV file that tells Data Profiler how to connect patients.csv and encounters.csv. Given this example, the linked column between these tables is patient_id. The data_dictionary can be as simple as:

entity

name

type

primary_ key_type

referenced_entity_field

relationship

patients

patient_id

string

en counters

encounter_id

string

en counters

patient_id

string

patients: patient_id

many_to_one

There are more columns in the data_dictionary that are not mentioned in this example. However, those columns are not required. If you are interested in the full form of data_dictionary or the meaning of each column, please visit this documentation.

There is no need to specify anything in the OUTPUTS section. Once your inputs are ready, click Start Analysis to begin.

Job Settings

In the Review & Start modal, you can either customize the job settings before running the applet or leave them at their default values. The settings you can modify include:

  • Job Name

  • Output Location

  • Priority

  • Spending Limit

  • Instance Type

Once you’ve made your adjustments or are satisfied with the default settings, click Launch Analysis to start the job.

Opening the App

After launching the analysis, you will be redirected to the Monitor screen. From there, click the job name to view the job details.

It may take a few minutes for the applet to be ready. To check the status, click View Log and wait for the message indicating that the applet is ready. Once you see the message, click Open Worker URL to launch the app.

The Data Profiler is an HTTPS application on the DNAnexus Platform, which means it should be accessed via the Job URL. It typically takes a few minutes for the web interface to be ready. If you encounter any issues while visiting the Job URL, you can check the job logs for the following message:

Logs from a job instance of Data Profiler indicating the web interface is ready

If this line appears in your job logs, it confirms that the Data Profiler is ready to be accessed through the Job URL.

If you attempt to click the button before the URL is ready, you may encounter a “502 Bad Gateway” error. This is not a problem— it simply means you need to wait a bit longer before the environment is fully prepared.

Selecting the data fields to profile

If you run Data Profiler with a DNAnexus Apollo Dataset (or Cohort), you will be able to select the specific data fields to profile. If you want to profile the whole Dataset, select all data fields and start the job by clicking on the “Start profiling” button.

The table to select columns (data fields) to profile

Resources

Full Documentation

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select “Contact Support”

  3. Fill in the Subject and Message to submit a support ticket.

Last updated

Was this helpful?