# Spark on ICA Bench

## Running a pyspark application

The JupyterLab environment is by default configured with 3 additional kernels

* PySpark – [Local](#pyspark-local)
* PySpark – [Remote](#pyspark-remote)
* PySpark – [Remote – Dynamic](#pyspark-remote-dynamic)

When one of the above kernels is selected, the spark context is automatically initialised and can be accessed using the **sc** object.

<figure><img src="https://3193631692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MWUqIqZhOK_i4HqCUpT%2Fuploads%2Fgit-blob-ac6458176a9e464c5bcb17227a179c38922f6205%2Fimage%20(70).png?alt=media" alt=""><figcaption></figcaption></figure>

### PySpark - Local

The PySpark - Local runtime environment launches the **spark driver locally** on the workspace node and all spark **executors** are created locally on the **same node**. It does **not require a spark cluster to run** and can be used for running smaller spark applications which don’t exceed the capacity of a single node.

The spark configuration can be found at `/data/.spark/local/conf/spark-defaults.conf`.

{% hint style="info" %}
Making changes to the configuration requires a restart of the Jupyter kernel.
{% endhint %}

### PySpark - Remote

The PySpark – Remote runtime environment launches the **spark driver locally** on the workspace node and interacts with the Manager for scheduling tasks onto **executors created across the Bench Cluster**.

This configuration will not dynamically spin up executors, hence it will **not** trigger the cluster to **auto scale** when using a Dynamic Bench cluster.

The spark configuration can be found at `/data/.spark/remote/conf/spark-defaults.conf`.

{% hint style="info" %}
Making changes to the configuration requires a restart of the Jupyter kernel.
{% endhint %}

### PySpark – Remote - Dynamic

The PySpark – Remote - Dynamic runtime environment launches the **spark driver locally** on the workspace node and interacts with the Manager for scheduling tasks onto **executors created across the Bench Cluster**.

This configuration will increase/decrease the required executors which will result into a cluster that **auto scales** using a Dynamic Bench cluster

The spark configuration can be found at `/data/.spark/remote/conf-dynamic/spark-defaults.conf`.

{% hint style="info" %}
Making changes to the configuration requires a restart of the Jupyter kernel.
{% endhint %}

## Job resources

Every cluster member has a certain capacity depending on the selection of the [Resource](https://help.ica.illumina.com/project/p-bench/bench-clusters#configuration) model for the member.

A spark application consists of 1 or more **jobs**. Each Job consists out of one or more **stages**. Each stage consists out of one or more **tasks**. Task are handled by executors and executors are run on a worker (cluster member).

The following setting define the amount of cpus needed per task

```
spark.task.cpus 1
```

The following settings define the size of a single executor which handles the execution of a task

```
spark.executor.cores 4 
spark.executor.memory 4g
```

The above example allows an executor to handle 4 tasks concurrently and share a total capacity of 4Gb of memory. Depending on the resource model chosen (e.g. standard-2xlarge) a single cluster member (worker node) is able to run multiple executors concurrently (e.g. 32 cores, 128 Gb for 8 concurrent executors on a single cluster member)

## Spark User Interface

The Spark UI can be accessed via the Cluster. The Web Access URL is displayed in the Workspace details page

This Spark UI will register all applications submitted when using one of the Remote Jupyter kernels. It will provide an overview of the registered workers (Cluster members) and the applications running in the Spark cluster.

<figure><img src="https://3193631692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MWUqIqZhOK_i4HqCUpT%2Fuploads%2Fgit-blob-ac252984719a8a23448cfc8c6aa503ccb44cc4e8%2Fimage%20(71).png?alt=media" alt=""><figcaption></figcaption></figure>

### Spark Reference documentation

See the [apache](https://spark.apache.org/docs/3.5.6/configuration.html) website
