Introduction to ML JupyterLab

ML JupyterLab is an app in the AI/ML Accelerator package. A license is required in order to use the AI/ML Accelerator package. For more information, please contact DNAnexus Sales via [email protected].

What is ML JupyterLab?

ML JupyterLab is an exclusive version of JupyterLab on DNAnexus that is designed for machine learning (ML) development on clinical and multi omics data. It retains the core features of JupyterLab, such as a web-based interactive development environment for notebooks, code, and data, while adding specific enhancements for ML, data science, and distributed computing. This module empowers users to work efficiently with large datasets using Ray distributed engines and seamless integration of ML libraries.

Why use the ML JupyterLab?

ML JupyterLab is the ideal environment for data scientists, researchers, and engineers working on complex ML workflows, large-scale datasets, and distributed computing tasks. Key benefits include:

Ease of Setup: With the pre-configured ML environment, the app eliminates the need for manual installation of ML libraries and tools, reducing setup time.
Scalability for Large Datasets: ML JupyterLab enables users to process massive datasets across distributed clusters utilizing the Ray engine, making it suitable for high-demand ML workloads.
Resource monitor enablement: with Ray dashboards, users are able to track the performance of the jobs to optimize cluster capacity.
Simplified Dependency Installation: Installing and managing libraries is straightforward with automatic detection and resolution of conflicts. This enables users to easily add or update ML libraries without concerns about dependency issues.
Portability: With ML JupyterLab, users are able to use the data stored on the DNAnexus projects without being downloaded to the instance. The app allows users to run an ML workflow easily in different projects.
Security and Compliance: ML JupyterLab is built on the DNAnexus platform environment with high levels of security and compliance, making it a trusted solution for industries like healthcare and life science.
Efficient Collaboration: Users can save and share environment configurations, allowing easy replication of workspaces across projects and teams. This saves time and ensures consistency in ML workflows.

Core features of ML JupyterLab

Distributed Engines: ML JupyterLab integrates Ray as the distributed computing engine, allowing users to scale their ML workflows across multiple nodes. This enables efficient handling of large datasets and complex computations, streamlining distributed ML tasks.
Preinstalled Popular ML Packages: ML JupyterLab provides built-in support for popular ML libraries such as Scikit-learn, transformers, XGBoost, LightGBM, TensorFlow, and PyTorch. These preinstalled packages ensure that users have access to the latest tools for building, training, and deploying ML models without needing to install or manage dependencies manually.
Seamless Package Installation and Dependency Management: In addition to preinstalled ML libraries, users can easily install new packages within ML JupyterLab. The environment automatically detects and resolves any dependency conflicts, providing a smooth experience when adding or updating libraries for specific projects. This feature ensures that users can customize their workspace effortlessly without breaking existing configurations.
Save and Share Environment Configurations: Users can save their environment configurations (via a custom environment file), and this file can be shared across different DNAnexus projects or with team members, enabling quick replication of environments for new projects. This feature helps maintain consistency across teams and reduces setup time.
Integrated Data Profiler: With the license for Data Profiler, users can launch this app inside a ML JupyterLab notebook via the dxprofiler package. This profiler provides essential statistics such as missing and duplication rates, data distributions, and correlations, allowing users to gain insights into their datasets quickly without requiring additional tools.
Large-Scale Data Processing: ML JupyterLab leverages Modin, a parallel dataframe library compatible with pandas, to efficiently process large-scale datasets. It automatically distributes dataframe operations across the Ray cluster, allowing users to handle large datasets without modifying their existing pandas code.
Directly Access Data from DNAnexus Project: Users can read and write data stored in DNAnexus projects directly from ML JupyterLab using fsspec-dnanexus. This feature provides a smoother workflow for handling large datasets and ensures seamless integration with DNAnexus platform.

Resources

Full Documentation

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.

PreviousML JupyterLab NextLaunching a ML JupyterLab Job

Last updated 4 months ago

Was this helpful?