.. _ref-docker-active-learning:

Active Learning
===============

.. warning::
    **The Docker Archive documentation is deprecated**

    The old workflow described in these docs will not be supported with new Lightly Worker versions above 2.6.
    Please switch to our `new documentation page <https://docs.lightly.ai/docs>`_ instead.

Lightly makes use of active learning scores to select the samples which will yield
the biggest improvements of your machine learning model. The scores are calculated
on-the-fly based on model predictions and provide the selection algorithm with feedback
about the uncertainty of the model for the given sample. 

.. note:: Note that the active learning features require a minimum
    Lightly Worker of version 2.2. You can check your installed version of the 
    Lightly Worker by running the :ref:`ref-docker-setup-sanity-check`.

Prerequisites
--------------
In order to do active learning with Lightly, you will need the following things:

- The installed Lightly docker (see :ref:`ref-docker-setup`)
- A dataset with a configured datasource (see :ref:`ref-docker-with-datasource-datapool`)
- Your predictions uploaded to the datasource (see :ref:`ref-docker-datasource-predictions`)

.. note::

    The dataset does not need to be new! For example, an initial selection without
    active learning can be used to train a model. The predictions from this model
    can then be used to improve your dataset by adding new images to it through active learning.


Selection
-------------------------
Once you have everything set up as described above, you can do an active learning
iteration by specifying the following three things in your Lightly docker config:

- `method`
- `active_learning.task_name`
- `active_learning.score_name`

Here's an example of how to configure an active learning run:


.. tabs::

    .. tab:: Web App

        **Trigger the Job**

        To trigger a new job you can click on the schedule run button on the dataset
        overview as shown in the screenshot below:

        .. figure:: ../integration/images/schedule-compute-run.png

        After clicking on the button you will see a wizard to configure the parameters
        for the job.

        .. figure:: ../integration/images/schedule-compute-run-config.png

        In this example we have to set the `active_learning.task_name` parameter
        in the docker config. Additionally, we set the `method` to `coral` which
        simultaneously considers the diversity and the active learning scores of
        the samples. All other settings are default values. The
        resulting docker config should look like this:

        .. literalinclude:: code_examples/active_learning_worker_config.txt
            :caption: Docker Config
            :language: javascript

        The Lightly config remains unchanged.

    .. tab:: Python Code

        .. literalinclude:: code_examples/python_run_active_learning.py


After the worker has finished its job you can see the selected images with their
active learning score in the web-app.


Active Learning with Custom Scores (not recommended as of March 2022)
----------------------------------------------------------------------

.. note::
    This is not recommended anymore as of March 2022 and will be deprecated in the future!


For running an active learning step with the Lightly docker, we need to perform
3 steps:

1. Create an `embeddings.csv` file. You can use your own models or the Lightly docker for this.
2. Add your active learning scores as an additional column to the embeddings file.
3. Use the Lightly docker to perform an active learning iteration on the scores.


Create Embeddings
^^^^^^^^^^^^^^^^^

You can create embeddings using your own model. Just make sure the resulting
`embeddings.csv` file matches the required format:
:ref:`ref-cli-embeddings-lightly`. 

Alternatively, you can run the docker as usual and as described in the 
:ref:`rst-docker-first-steps` section.
The only difference is that you set the number of samples to be selected to 1.0,
as this simply creates an embedding of the full dataset.

E.g. create and run a bash script with the following content:

.. code::

    # Have this in a step_1_run_docker_create_embeddings.sh
    INPUT_DIR=/path/to/your/dataset
    SHARED_DIR=/path/to/shared
    OUTPUT_DIR=/path/to/output

    LIGHTLY_TOKEN= # put your token here
    N_SAMPLES=1.0

    docker run --gpus all --rm -it \
      -v ${INPUT_DIR}:/home/input_dir:ro  \
      -v ${SHARED_DIR}:/home/shared_dir:ro \
      -v ${OUTPUT_DIR}:/home/output_dir \
      lightly/worker:latest \
      token=${LIGHTLY_TOKEN} \
      lightly.loader.num_workers=4     \
      stopping_condition.n_samples=${N_SAMPLES}\
      method=coreset \
      enable_training=True     \
      lightly.trainer.max_epochs=20

Running it will create a terminal output similar to the following:

.. code-block::

    [2021-09-29 13:32:11] Loading initial dataset...
    [2021-09-29 13:32:11] Found 372 input images in input_dir.
    [2021-09-29 13:32:11] Lightly On-Premise License is valid
    [2021-09-29 13:32:11] Checking for corrupt images (disable with enable_corruptness_check=False).
    Corrupt images found: 0: 100%|██████████████████| 372/372 [00:01<00:00, 310.35it/s]
    [2021-09-29 13:32:14] Training self-supervised model.
    GPU available: True, used: True
    [2021-09-29 13:32:57,696][lightning][INFO] - GPU available: True, used: True
    TPU available: None, using: 0 TPU cores
    [2021-09-29 13:32:57,697][lightning][INFO] - TPU available: None, using: 0 TPU cores
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
    [2021-09-29 13:32:57,697][lightning][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

      | Name      | Type       | Params
    -----------------------------------------
    0 | model     | SimCLR     | 11.2 M
    1 | criterion | NTXentLoss | 0
    -----------------------------------------
    11.2 M    Trainable params
    0         Non-trainable params
    [2021-09-29 13:34:29,772][lightning][INFO] - Saving latest checkpoint...
    Epoch 19: 100%|████████████████████████████████| 23/23 [00:04<00:00,  5.10it/s, loss=2.52, v_num=0]
    [2021-09-29 13:34:29] Embedding images.
    Compute efficiency: 0.90: 100%|█████████████████████████| 24/24 [00:01<00:00, 21.85it/s]
    [2021-09-29 13:34:31] Saving embeddings to output_dir/2021-09-29/13:32:11/data/embeddings.csv.
    [2021-09-29 13:34:31] Unique embeddings are stored in output_dir/2021-09-29/13:32:11/data/embeddings.csv
    [2021-09-29 13:34:31] Normalizing embeddings to unit length (disable with normalize_embeddings=False).
    [2021-09-29 13:34:31] Normalized embeddings are stored in output_dir/2021-09-29/13:32:11/data/normalized_embeddings.csv
    [2021-09-29 13:34:31] Sampling dataset with stopping condition: n_samples=372
    [2021-09-29 13:34:31] Skipped sampling because the number of remaining images is smaller than the number of requested samples.
    [2021-09-29 13:34:31] Writing report to output_dir/2021-09-29/13:32:11/report.pdf.
    [2021-09-29 13:35:04] Writing csv with information about removed samples to output_dir/2021-09-29/13:32:11/removed_samples.csv
    [2021-09-29 13:35:04] Done!

By running it, this will create an `embeddings.csv` file
in the output directory. Locate it and save the path to it.
E.g. It may be found under
`/path/to/output/2021-09-28/15:47:34/data/embeddings.csv`

It should look similar to this:

+----------------+--------------+--------------+--------------+--------------+---------+
| filenames      | embedding_0  | embedding_1  | embedding_2  | embedding_3  | labels  |
+================+==============+==============+==============+==============+=========+
| cats/0001.jpg  | 0.29625183   | 0.50055015   | 0.36491454   | 0.8156051    | 0       |
+----------------+--------------+--------------+--------------+--------------+---------+
| dogs/0005.jpg  | 0.36491454   | 0.29625183   | 0.38491454   | 0.36491454   | 1       |
+----------------+--------------+--------------+--------------+--------------+---------+
| cats/0014.jpg  | 0.8156051    | 0.59055015   | 0.29625183   | 0.50055015   | 0       |
+----------------+--------------+--------------+--------------+--------------+---------+


Add Active Learning Scores
^^^^^^^^^^^^^^^^^^^^^^^^^^

You can use the predictions from your model as active learning scores.

.. note:: You can also use your own scorers. Just make sure that you get a value
          between `0.0` and `1.0` for each sample. A number close to `1.0` should
          indicate a very important sample you want to be selected with a higher
          probability.

We provide a simple Python script to append a list of `scores` to the `embeddings.csv` file.

.. code-block:: python

    # Have this in a step_2_add_al_scores.py

    from typing import Iterable
    import csv
    import os

    """
    Run your detection model here
    Use the scorers offered by lightly to generate active learning scores.
    """

    # Let's assume that you have one active learning score for every image.
    # WARNING: The order of the scores MUST match the order of filenames
    # in the embeddings.csv.
    scores: Iterable[float] =  # must be an iterable of floats,
    # e.g. a list of float or a 1d-numpy array

    # define the function to add the scores to the embeddings.csv
    def add_al_scores_to_csv(
            input_file_path: str, output_file_path: str,
            scores: Iterable[float], column_name: str = "al_score"
    ):
        with open(input_file_path, 'r') as read_obj:
            with open(output_file_path, 'w') as write_obj:
                csv_reader = csv.reader(read_obj)
                csv_writer = csv.writer(write_obj)

                # add the column name
                first_row = next(csv_reader)
                first_row.append(column_name)
                csv_writer.writerow(first_row)

                # add the scores
                for row, score in zip(csv_reader, scores):
                    row.append(str(score))
                    csv_writer.writerow(row)

    # use the function
    # adapt the following line to use the correct path to the embeddings.csv
    input_embeddings_csv = '/path/to/output/2021-07-28/12:00:00/data/embeddings.csv'
    output_embeddings_csv = input_embeddings_csv.replace('.csv', '_al.csv')
    add_al_scores_to_csv(input_embeddings_csv, output_embeddings_csv, scores)

    print("Use the following path to the embeddings_al.csv in the next step:")
    print(output_embeddings_csv)

Running it will create a terminal output similar to the following:

.. code-block::

    (base) user@machine:~/GitHub/playground/docker_with_al$ sudo python3 step_2_add_al_scores.py
    Use the following path to the embedding.csv in the next step:
    /path/to/output/2021-07-28/12:00:00/data/embeddings_al.csv

Your embeddings_al.csv should look similar to this:

+----------------+--------------+--------------+--------------+--------------+---------+-----------+
| filenames      | embedding_0  | embedding_1  | embedding_2  | embedding_3  | labels  | al_score  |
+================+==============+==============+==============+==============+=========+===========+
| cats/0001.jpg  | 0.29625183   | 0.50055015   | 0.36491454   | 0.8156051    | 0       | 0.7231    |
+----------------+--------------+--------------+--------------+--------------+---------+-----------+
| dogs/0005.jpg  | 0.36491454   | 0.29625183   | 0.38491454   | 0.36491454   | 1       | 0.91941   |
+----------------+--------------+--------------+--------------+--------------+---------+-----------+
| cats/0014.jpg  | 0.8156051    | 0.59055015   | 0.29625183   | 0.50055015   | 0       | 0.01422   |
+----------------+--------------+--------------+--------------+--------------+---------+-----------+


Run Active Learning using the Docker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

At this point you should have an `embeddings.csv` file with the active learning 
scores in a column named `al_scores`. 

We can now perform an active learning iteration using the `coral` selection strategy. In order
to do the selection on the `embeddings.csv` file we need to make this file
accessible to the docker. We can do this by using the `shared_dir` feature of the
docker as described in :ref:`docker-sampling-from-embeddings`.

E.g. use the following bash script.

.. code-block:: bash

    #!/bin/bash -e

    # Have this in a step_3_run_docker_coral.sh
    
    INPUT_DIR=/path/to/your/dataset/
    SHARED_DIR=/path/to/shared/
    OUTPUT_DIR=/path/to/output/
    
    EMBEDDING_FILE= # insert the path printed in the last step here.
    # e.g. /path/to/output/2021-07-28/12:00:00/data/embeddings_al.csv

    cp INPUT_EMBEDDING_FILE SHARED_DIR # copy the embedding file to the shared directory
    EMBEDDINGS_REL_TO_SHARED=embeddings_al.csv
    

    LIGHTLY_TOKEN= # put your token here
    N_SAMPLES= # Choose how many samples you want to use here, e.g. 0.1 for 10 percent.

    docker run --gpus all --rm -it \
        -v ${INPUT_DIR}:/home/input_dir:ro  \
        -v ${SHARED_DIR}:/home/shared_dir:ro \
        -v ${OUTPUT_DIR}:/home/output_dir \
        lightly/worker:latest \
        token=${LIGHTLY_TOKEN} \
        lightly.loader.num_workers=4     \
        stopping_condition.n_samples=${N_SAMPLES}\
        method=coral \
        enable_training=False     \
        dump_dataset=True \
        upload_dataset=False \
        embeddings=${EMBEDDINGS_REL_TO_SHARED} \
        active_learning_score_column_name="al_score" \
        scorer=""
      
Your terminal output should look similar to this:

.. code-block::

    [2021-09-29 09:36:27] Loading initial embedding file...
    [2021-09-29 09:36:27] Output images will not be resized.
    [2021-09-29 09:36:27] Found 372 input images in shared_dir/embeddings_al.csv.
    [2021-09-29 09:36:27] Lightly On-Premise License is valid
    [2021-09-29 09:36:28] Removing exact duplicates (disable with remove_exact_duplicates=False).
    [2021-09-29 09:36:28] Found 0 exact duplicates.
    [2021-09-29 09:36:28] Unique embeddings are stored in shared_dir/embeddings_al.csv
    [2021-09-29 09:36:28] Normalizing embeddings to unit length (disable with normalize_embeddings=False).
    [2021-09-29 09:36:28] Normalized embeddings are stored in output_dir/2021-09-29/09:36:27/data/normalized_embeddings.csv
    [2021-09-29 09:36:28] Sampling dataset with stopping condition: n_samples=10
    [2021-09-29 09:36:28] Sampled 10 images.
    [2021-09-29 09:36:28] Writing report to output_dir/2021-09-29/09:36:27/report.pdf.
    [2021-09-29 09:36:56] Writing csv with information about removed samples to output_dir/2021-09-29/09:36:27/removed_samples.csv
    [2021-09-29 09:36:56] Done!