Introduction to Data Profiler
Last updated
Was this helpful?
Last updated
Was this helpful?
A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via sales@dnanexus.com).
The Data Profiler is an app within the DNAnexus Tool Library that supports data cleaning and harmonization. It organizes your data into three levels of information: Dataset level, Table level, and Column level. Each level surfaces interactive visualizations on data quality, data coverage, and descriptive statistics to help you understand and identify potential data issues. The Data Profiler also includes an Explorer Mode where you can create customizable visualization using simple drag-and-drop functionality, for deeper exploration beyond the standard metrics. Researchers can bring their data to the Platform and leverage the Data Profiler app to explore and quickly evaluate the readiness of the data for downstream analysis.
The Data Profiler app saves significant time by generating consistent and comprehensive reports on data quality. It helps support informed decision-making, allowing experts to fully understand the data before downstream analysis. From data collection and cleaning to feature engineering, continuously profiling data to understand its evolution and maintain consistent quality throughout the data transformation process is important to help identify potential issues early, enabling adjustments that optimize analysis and performance.
This tool quickly analyzes and visualizes large dataset input from CSV,Parquet, or DNAnexus Apollo Dataset (or Cohort). The point-and-click solution efficiently provides summary statistics and visualizations, enabling a comprehensive understanding of the data. It also highlights data inconsistencies and complexities (e.g., missing and imbalanced data) in a logical and organized manner, guiding you through the structure and content of your data.
There are two ways to run the application:
Direct Access: Go to to open the app.
Platform Navigation: Click on the top navigation bar, then select Tools, proceed to the tool library, search for the “Data Profiler” app, select it, then select run within the documentation to start the app.
To run the app, you need to provide the required input files, which are .csv or .parquet files , or a DNAnexus Apollo Dataset (or Cohort).
If you run the app with .csv files or .parquet files, there is an optional input for the Data Dictionary. This is the same Data Dictionary used by Data Model Loader to generate the DNAnexus Apollo Dataset.
Input name
Mandatory/ Optional
Input type/format
Description
input_files
Optional
A list of CSV, TSV, TXT, or parquet files
This is the data that will be profiled by this application. Each file is a table in your dataset. Only one of the following two options should be provided: input_files and dx_record
dx_record
Optional
A DNAnexus Apollo Dataset (or Cohort)
The data in this Dataset (or Cohort) will be profiled by this application.
data_dictionary
optional
A CSV file
This file indicates the relationship between the tables in input_files.
If not provided, the table relationship will be inferred in the job.
Tables for Inputs
For this example,there are 2 tables in your dataset:
patients.csv: a table with patient IDs and other clinical information of the patient
encounters.csv: a table of encounters (i.e. hospital visits) of all patients in the patient.csv
patients.csv
patient_id
name
P0001
John Doe
P0002
Jane Roe
encounters.csv
encounter_id
patient_id
E0001
P0001
E0002
P0001
E0003
P0002
E0004
P0002
In this example dataset, there are 2 patients in the patients.csv, each patient visited the hospital twice.
Data Dictionary
Even though data_dictionary is optional, it is crucial for cross-table functions in Data Profiler. We highly recommend creating one for your dataset.
The data_dictionary is a CSV file that tells Data Profiler how to connect patients.csv and encounters.csv. Given this example, the linked column between these tables is patient_id. The data_dictionary can be as simple as:
entity
name
type
primary_ key_type
referenced_entity_field
relationship
patients
patient_id
string
en counters
encounter_id
string
en counters
patient_id
string
patients: patient_id
many_to_one
There is no need to specify anything in the OUTPUTS section. Once your inputs are ready, click Start Analysis to begin.
In the Review & Start modal, you can either customize the job settings before running the applet or leave them at their default values. The settings you can modify include:
Job Name
Output Location
Priority
Spending Limit
Instance Type
Once you’ve made your adjustments or are satisfied with the default settings, click Launch Analysis to start the job.
After launching the analysis, you will be redirected to the Monitor screen. From there, click the job name to view the job details.
It may take a few minutes for the applet to be ready. To check the status, click View Log and wait for the message indicating that the applet is ready. Once you see the message, click Open Worker URL to launch the app.
Logs from a job instance of Data Profiler indicating the web interface is ready
If this line appears in your job logs, it confirms that the Data Profiler is ready to be accessed through the Job URL.
If you attempt to click the button before the URL is ready, you may encounter a “502 Bad Gateway” error. This is not a problem— it simply means you need to wait a bit longer before the environment is fully prepared.
If you run Data Profiler with a DNAnexus Apollo Dataset (or Cohort), you will be able to select the specific data fields to profile. If you want to profile the whole Dataset, select all data fields and start the job by clicking on the “Start profiling” button.
The table to select columns (data fields) to profile
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.
There are more columns in the data_dictionary that are not mentioned in this example. However, those columns are not required. If you are interested in the full form of data_dictionary or the meaning of each column, please visit this .
The Data Profiler is an on the DNAnexus Platform, which means it should be accessed via the Job URL. It typically takes a few minutes for the web interface to be ready. If you encounter any issues while visiting the Job URL, you can check the job logs for the following message: