> For the complete documentation index, see [llms.txt](https://help.ica.illumina.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://help.ica.illumina.com/project/p-bench/bench-workspaces/spark-on-ica-bench.md).

# Spark on ICA Bench

## Running a pyspark application

The JupyterLab environment is by default configured with 3 additional kernels

* PySpark – [Local](#pyspark-local)
* PySpark – [Remote](#pyspark-remote)
* PySpark – [Remote – Dynamic](#pyspark-remote-dynamic)

When one of the above kernels is selected, the spark context is automatically initialised and can be accessed using the **sc** object.

<figure><img src="/files/RbHEWxETqQBHGHWzsH33" alt=""><figcaption></figcaption></figure>

### PySpark - Local

The PySpark - Local runtime environment launches the **spark driver locally** on the workspace node and all spark **executors** are created locally on the **same node**. It does **not require a spark cluster to run** and can be used for running smaller spark applications which don’t exceed the capacity of a single node.

The spark configuration can be found at `/data/.spark/local/conf/spark-defaults.conf`.

{% hint style="info" %}
Making changes to the configuration requires a restart of the Jupyter kernel.
{% endhint %}

### PySpark - Remote

The PySpark – Remote runtime environment launches the **spark driver locally** on the workspace node and interacts with the Manager for scheduling tasks onto **executors created across the Bench Cluster**.

This configuration will not dynamically spin up executors, hence it will **not** trigger the cluster to **auto scale** when using a Dynamic Bench cluster.

The spark configuration can be found at `/data/.spark/remote/conf/spark-defaults.conf`.

{% hint style="info" %}
Making changes to the configuration requires a restart of the Jupyter kernel.
{% endhint %}

### PySpark – Remote - Dynamic

The PySpark – Remote - Dynamic runtime environment launches the **spark driver locally** on the workspace node and interacts with the Manager for scheduling tasks onto **executors created across the Bench Cluster**.

This configuration will increase/decrease the required executors which will result into a cluster that **auto scales** using a Dynamic Bench cluster

The spark configuration can be found at `/data/.spark/remote/conf-dynamic/spark-defaults.conf`.

{% hint style="info" %}
Making changes to the configuration requires a restart of the Jupyter kernel.
{% endhint %}

## Job resources

Every cluster member has a certain capacity depending on the selection of the [Resource](/project/p-bench/bench-workspaces/bench-clusters.md#configuration) model for the member.

A spark application consists of 1 or more **jobs**. Each Job consists out of one or more **stages**. Each stage consists out of one or more **tasks**. Task are handled by executors and executors are run on a worker (cluster member).

The following setting define the amount of cpus needed per task

```
spark.task.cpus 1
```

The following settings define the size of a single executor which handles the execution of a task

```
spark.executor.cores 4 
spark.executor.memory 4g
```

The above example allows an executor to handle 4 tasks concurrently and share a total capacity of 4Gb of memory. Depending on the resource model chosen (e.g. standard-2xlarge) a single cluster member (worker node) is able to run multiple executors concurrently (e.g. 32 cores, 128 Gb for 8 concurrent executors on a single cluster member)

## Spark User Interface

The Spark UI can be accessed via the Cluster. The Web Access URL is displayed in the Workspace details page

This Spark UI will register all applications submitted when using one of the Remote Jupyter kernels. It will provide an overview of the registered workers (Cluster members) and the applications running in the Spark cluster.

<figure><img src="/files/U920ARLy7z5hx9CRXbaL" alt=""><figcaption></figcaption></figure>

### Spark Reference documentation

See the [apache](https://spark.apache.org/docs/3.5.6/configuration.html) website


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://help.ica.illumina.com/project/p-bench/bench-workspaces/spark-on-ica-bench.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.