# Introduction to Data Profiler

*A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via <sales@dnanexus.com>).*

### What is the Data Profiler?

The Data Profiler is an app within the DNAnexus Tool Library that supports data cleaning and harmonization. It organizes your data  into three levels of information: Dataset level, Table level, and Column level. Each level surfaces interactive visualizations on data quality, data coverage, and descriptive statistics to help you understand and identify potential data issues. The Data Profiler also includes an Explorer Mode where you can create customizable visualization using simple drag-and-drop functionality, for deeper exploration beyond the standard metrics. Researchers can bring their data to the Platform and leverage the Data Profiler app to explore and quickly evaluate the readiness of the data for downstream analysis.

### Why use the Data Profiler?

The Data Profiler app saves significant time by generating consistent and comprehensive reports on data quality. It helps support informed decision-making, allowing experts to fully understand the data before downstream analysis. From data collection and cleaning to feature engineering, continuously profiling data to understand its evolution and maintain consistent quality throughout the data transformation process is important to help identify potential issues early, enabling adjustments that optimize analysis and performance.&#x20;

### Core features of Data Profiler

This tool quickly analyzes and visualizes large dataset input from CSV,Parquet, or DNAnexus Apollo Dataset (or Cohort). The point-and-click solution efficiently provides summary statistics and visualizations, enabling a comprehensive understanding of the data. It also highlights data inconsistencies and complexities (e.g., missing and imbalanced data) in a logical and organized manner, guiding you through the structure and content of your data.

### Getting Started

#### Access to the App

There are two ways to run the application:

1. Direct Access: Go to [this link](https://platform.dnanexus.com/app/data-profiler) to open the app.
2. Platform Navigation: Click on the top navigation bar, then select Tools, proceed to the tool library, search for the “Data Profiler” app, select it, then select run within the documentation to start the app.<br>

   <figure><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfvpZ1-x2HDETSbFampKwksOgtJiFGXH4f83iXNINyr-baFsImadU7sCIqmt6E6syVgrvT1e4VNai2TYT28y5Bf1VOBIeuSPcP0mXbwgDQAGBf42ssTd0FjOzg-HpNiH8cZ-b8dYC8UUXqDHNQm2KE?key=T_OUW-aqDrRPdE-yzmUnjwHD" alt=""><figcaption></figcaption></figure>

#### Inputs

To run the app, you need to provide the required input files, which are .csv or .parquet files , or a DNAnexus Apollo Dataset (or Cohort).

If you run the app with .csv files or .parquet files, there is an optional input for the Data Dictionary. This is the same Data Dictionary used by Data Model Loader to generate the DNAnexus Apollo Dataset.

<br>

| Input name       | Mandatory/ Optional | Input type/format                         | Description                                                                                                                                                                             |
| ---------------- | ------------------- | ----------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| input\_files     | Optional            | A list of CSV, TSV, TXT, or parquet files | This is the data that will be profiled by this application. Each file is a table in your dataset. Only one of the following two options should be provided: input\_files and dx\_record |
| dx\_record       | Optional            | A DNAnexus Apollo Dataset (or Cohort)     | The data in this Dataset (or Cohort) will be profiled by this application.                                                                                                              |
| data\_dictionary | optional            | A CSV file                                | <p>This file indicates the relationship between the tables in input\_files.</p><p>If not provided, the table relationship will be inferred in the job.</p>                              |

**Tables for Inputs**

For this example,there are 2 tables in your dataset:

* patients.csv: a table with patient IDs and other clinical information of the patient
* encounters.csv: a table of encounters (i.e. hospital visits) of all patients in the patient.csv

patients.csv

| patient\_id | name     |
| ----------- | -------- |
| P0001       | John Doe |
| P0002       | Jane Roe |

encounters.csv

| encounter\_id | patient\_id |
| ------------- | ----------- |
| E0001         | P0001       |
| E0002         | P0001       |
| E0003         | P0002       |
| E0004         | P0002       |

In this example dataset, there are 2 patients in the patients.csv, each patient visited the hospital twice.

**Data Dictionary**

Even though data\_dictionary is optional, it is crucial for cross-table functions in Data Profiler. We highly recommend creating one for your dataset.

The data\_dictionary is a CSV file that tells Data Profiler how to connect patients.csv and encounters.csv. Given this example, the linked column between these tables is patient\_id. The data\_dictionary can be as simple as:

| entity      | name          | type   | primary\_ key\_type | referenced\_entity\_field | relationship  |
| ----------- | ------------- | ------ | ------------------- | ------------------------- | ------------- |
| patients    | patient\_id   | string | <p><br></p>         | <p><br></p>               | <p><br></p>   |
| en counters | encounter\_id | string | <p><br></p>         | <p><br></p>               | <p><br></p>   |
| en counters | patient\_id   | string | <p><br></p>         | patients: patient\_id     | many\_to\_one |

There are more columns in the data\_dictionary that are not mentioned in this example. However, those columns are not required. If you are interested in the full form of data\_dictionary or the meaning of each column, please visit this [documentation](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#table-1-data-dictionary-file-description-data_dictionary.csv).

There is no need to specify anything in the OUTPUTS section. Once your inputs are ready, click Start Analysis to begin.

### Job Settings

In the Review & Start modal, you can either customize the job settings before running the applet or leave them at their default values. The settings you can modify include:

* Job Name
* Output Location
* Priority
* Spending Limit
* Instance Type

Once you’ve made your adjustments or are satisfied with the default settings, click Launch Analysis to start the job.

### Opening the App

After launching the analysis, you will be redirected to the Monitor screen. From there, click the job name to view the job details.

It may take a few minutes for the applet to be ready. To check the status, click View Log and wait for the message indicating that the applet is ready. Once you see the message, click Open Worker URL to launch the app.&#x20;

The Data Profiler is an [HTTPS application](https://documentation.dnanexus.com/developer/apps/https-applications) on the DNAnexus Platform, which means it should be accessed via the Job URL. It typically takes a few minutes for the web interface to be ready. If you encounter any issues while visiting the Job URL, you can check the job logs for the following message:

<figure><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeAmNWMz_mWrvlqqnrkZGpPDi5TuEwuIlQluqBlB7snM0Jqw46n8hI9Q_pFmh0JmbUw1iyEI6n7f3FUZ4GedzyG3fJwb7SMbhwdzugAG6BhdCWRY_QCsvszLLGvMxVHgjYaS4tXK9xX3AA35vphNSc?key=T_OUW-aqDrRPdE-yzmUnjwHD" alt=""><figcaption></figcaption></figure>

Logs from a job instance of Data Profiler indicating the web interface is ready

If this line appears in your job logs, it confirms that the Data Profiler is ready to be accessed through the Job URL.

If you attempt to click the button before the URL is ready, you may encounter a “502 Bad Gateway” error. This is not a problem— it simply means you need to wait a bit longer before the environment is fully prepared.

### Selecting the data fields to profile

If you run Data Profiler with a DNAnexus Apollo Dataset (or Cohort), you will be able to select the specific data fields to profile. If you want to profile the whole Dataset, select all data fields and start the job by clicking on the “Start profiling” button.

<figure><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXd6eawIif24GZmgfM3qABKdTEV2JXGYhJCSv7J7B91A3oLynN31ieEyUWfXx-Vd2dpDK1DelV6c_O19TZd9yEynIXuWfTZ_Oi1dnvCXc3ne1E0j6bOxFizljPB2OdxOs5JHG0yBTjYeoFaBSJ-CnOI?key=T_OUW-aqDrRPdE-yzmUnjwHD" alt=""><figcaption></figcaption></figure>

The table to select columns (data fields) to profile

### Resources

[Full Documentation](https://documentation.dnanexus.com/)

To create a support ticket if there are technical issues:

1. Go to the Help header (same section where Projects and Tools are) inside the platform
2. Select “Contact Support”
3. Fill in the Subject and Message to submit a support ticket.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.dnanexus.com/mlaccelerator/dataprofiler/introductiondataprofiler.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
