Tutorial: AVI Development via Jupyter

This tutorial is comprised of two parts:

  1. First, we will use a Jupyter notebook to get some data from the Gaia archive and perform some basic analysis.
  2. Second, we will create an AVI based on the analysis codes from our Jupyter notebook

Warning

An internet connection and Docker is required for this tutorial

Source code

The code for this tutorial is available in the Parameter Space repository: https://github.com/parameterspace-ie/tutorial-avi-jupyter

Background

First, let’s recall how GAVIP works, and what an AVI is:

  1. The Gaia satellite will create more than 1 petabyte of data, too much for most users to download
  2. GAVIP is designed to let users run their code near the archive:
    1. Code can be run in a hosted Jupyter notebook
    2. Code can be packaged as an AVI (forming a reusable configurable tool)

More information on GAVIP and its components is available in the Portal section.

Jupyter

Using a Jupyter notebook is the quickest way to start accessing Gaia data, and doing some analysis of it.

GAVIP allows users to start a Jupyter notebook in the portal (in this tutorial, we will do exactly that).

Once it is running, we will explain the motivations toward using an AVI instead of Jupyter, then migrate the analysis code to an AVI.

A reusable tool (AVI)

Once you have some analysis being performed in a Jupyter notebook, you may want to package your analysis as a reusable tool for others. Packaging your code as an AVI allows you to:

  1. Provide a user interface of your choosing
  2. Store historic results
  3. Perform more advanced visualizing of results

In addition, AVIs can perform more computationally intensive tasks: Jupyter notebooks can use a maximum of 1GB of RAM, whereas AVIs can work with a configurable RAM allocation.

AVI parts

AVIs conceptually are made of two parts:

  1. The front-end (providing the user-interface)
  2. The back-end (does the analysis)

The front-end is run whenever a user requests an AVI, and the back-end (your analysis) is run when possible.

More technical discussion of an AVI and how it works can be found in the next development tutorial; for now, knowing that there is a separation and that they run separately is enough to proceed.

Starting the notebook

In this tutorial we will use the hosted Jupyter notebook provided by GAVIP. Note: During AVI development (discussed further later), the AVI is run locally in “standalone” mode. While in this mode, a Jupyter notebook is provided for the developer. Either notebook is suitable.

There is a tutorial on Jupyter notebook available for those unfamiliar with Jupyter notebooks.

To start the notebook in GAVIP, perform the following:

_images/tutorial_signin.png
  • In the navigation bar at the top of the page, click the ‘Jupyter’ button.

    • A notification will be presented which will tell you when your Jupyter notebook is ready to use
_images/tutorial_jupyterstart.png _images/tutorial_jupyer_selectnotebook.png
  • Once uploaded, open the notebook

  • The notebook is documented to bring you through a quick demonstration of how to download some Gaia data and interact with it, including:

    • Downloading the data using TAP+
    • Parsing the resulting VOtable to a pandas dataframe
    • Getting some quick statistics using pandas_profiling
  • Feel free to change the code in the notebook and run through the steps as desired.

This concludes the Jupyter portion of this tutorial.

Creating an AVI

Technical background

At this point, we will provide a bit more technical information on how AVIs work so that this part of the tutorial is more clear.

The tools provided by users must be isolated from each other in GAVIP. This could be done by running the tools in virtual machines, but instead we run them in ‘Containers’.

Isolating code in Containers rather than virtual machines is analogous to making a sandbox in your operating system, rather than emulating hardware for a complete system to run on. It is more efficient in many ways as a result. In addition, a complete copy (image) of an AVI with all software dependencies (including the Anaconda suite) is small enough to download comfortably.

The AVI itself forms a Django project (this tutorial assumes that you are not familiar with Django, but it is worth stating here).

Note

this tutorial is intentionally kept as simple as possible. It is recommended that when creating your own AVI frontend, you use the Django documentation and tutorials as a resource

Overall structure

AVIs provide a front-end (user interface) and back-end (one or more analysis pipelines). The AVI front-end lets the user interact with your AVI and lets them submit work for any of your analysis pipelines. The AVI back-end simply performs the analysis given the parameters chosen by the user. AVIs use a local database for storing analysis parameters (often referred to as an AVI job), and results.

The AVI code provided by the developer is run within the AVI framework (which handles processes such as user authentication and background processing).

Back-end structure

Analysis work (or any long running processes) will be kept in a file called tasks.py.

In this tutorial we will break down the code from the Jupyter notebook above into smaller logical operations (downloading and analysing the data). These logical operations are referred to as tasks, and are chained up to form a “pipeline”. More detail on the back-end code will be provided later in this tutorial, and is also provided in the more advanced technical tutorial.

This forms our back-end, running that in the background and queueing jobs is all performed by GAVIP.

Front-end structure

As mentioned, AVIs store jobs and results in a database. So we create a description (or model) of the parameters that our AVI needs in models.py, this is then used to structure the database.

The AVI interface is a web interface, so AVIs typically includes some HTML files. In this tutorial we have 2 files:

  1. index.html which provides the main page of our interface (where we can enter a query, and view previous jobs)
  2. job_result.html which is used to render one of our results

Users will need a way to enter the parameters for our back-end, so we need some sort of form or input to be available in the AVI front-end. We usually don’t want to write this manually (though developers are free to do so!) so we will get the AVI to generate a HTML form for us from our model. The definition of how the form is generated is done in forms.py.

User’s will generally interact with your AVI by requesting a page, or submitting a form. We create the functions to handle these operations in views.py.

Finally, we map URLs to different view functions in urls.py.

So to summarise the files used in this tutorial AVI front-end:

File Description
models.py Defines the structure of the AVI data to be stored in the AVI
HTML files Defines the user interface
forms.py Generates forms to be inserted in the AVI interface
views.py Functions for handling user-AVI interaction (rendering a HTML page, or storing AVI job parameters)
urls.py Maps URLs to different functions in views.py

Download the tutorial AVI

Rather than specifying the content to be pasted into each file, we will copy the AVI from the tutorial Git repository, and examine each file separately. After that, we will start the AVI to look at it in action!

To start, clone or download a copy of the tutorial code from https://github.com/parameterspace-ie/tutorial-avi-jupyter such that the avi directory is somewhere convenient on your machine.

Now that you have the AVI downloaded, lets have a look through the files. It is recommended that you open this folder in your favourite (or any) text editor now.

(tasks.py) The back-end code

Note

Open avi/tasks.py

See also

The more technical tutorial provides further detail on pipelines and how they are built

Summary

In tasks.py our analysis code from Jupyter is broken down into separate tasks which together form an “analysis pipeline”.

The file typically consists of a number of classes which each define an output() function, requires() function, and a run() function.

  • The output() function specifies the output of a particular task.

    • This allows your analysis pipeline to automatically check if the output exists already.
    • If so it will skip that part of your pipeline (this is very useful if you have a large file in common between different pipelines).
  • The requires() function lets you define dependencies between your pipeline.

  • The run() function lets you define the work that each task in your pipeline must perform.

Examination

The imports at the top of the file are similar to the ones used in our Jupyter notebook, but there are a few additional imports: we are now also importing some Python “classes” to help define our pipeline tasks.

At the top of the file we also import a “service” class

Note

Services were mentioned in the Jupyter notebook from earlier in the tutorial. Services basically provide some reusable functionality which can be easily added to your analysis pipeline. In this case, it handles submitting and managing the TAP+ job for our ADQL query, then downloading the results to the path specified in the output() function.

The analysis code within tasks.py (see ProcessData.run()) is similar to the latter code in our Jupyter notebook from earlier. In this case, there is some additional code to store the pandas_profiling output into a file.

About half the code from the Jupyter notebook (used for downloading the Gaia data) cannot be found in tasks.py because it is performed by the service in our pipeline (note how the DownloadData class does not specify a run() function).

(models.py) The AVI models

Note

Open avi/models.py

Summary

All models which are used to store the parameters for an AVI analysis pipeline must extend the ‘AviJob’ class. This class is used to automatically run and track the progress of your AVI pipelines.

Examination

In this file we can see a single class being defined (SimpleJob) which stores a query parameter as a character field, with a default value identical to the query we used in our notebook (this is purely done for convenience).

In addition to the query parameter, we also store a parameter called pipeline_task: this is used by the AviJob class to determine which analysis task to invoke. Note that the value ProcessData is the name of the task in tasks.py which performed the analysis of the Gaia data and had a dependency on the DownloadData class.

(forms.py) The AVI forms

Note

Open avi/forms.py

Summary

Django forms are a way of letting Django generate a HTML form for you based on your model (SimpleJob in our case).

Note

This is the recommended approach for beginning your AVI interface.

Examination

In this file we can see a single class being defined (QueryForm) which extends ModelForm.

See also

The Django documentation on model forms: https://docs.djangoproject.com/en/1.10/topics/forms/modelforms/

We simply define the model to create the form from, what fields to exclude (which are provided by the AviJob class), and some additional specification to help style the generated form.

This form will be used when rendering the AVI interface to the user.

(views.py) The AVI views

Note

Open avi/views.py

Summary

Each function within views.py will usually handle some aspect of user-AVI interaction.

See also

The Django documentation on views: https://docs.djangoproject.com/en/1.10/#the-view-layer

Examination

In this file there are 3 functions:

  • index() renders the index.html file to the user (this is the main view in the AVI front-end)
  • run_query() retrieves the query parameter when the user submits their ADQL query using the AVI front-end
  • job_result() which retrieves a job given a job ID, and renders job_result.html to show us the pandas profiling output

index()

In the index() view, our QueryForm is used from forms.py. Here it is used to create “context” (this is simply a dictionary used to help render a HTML file - index.html in this case). Later we will see the form being used in index.html

run_query()

In the run_query() view we are retrieving the query parameter submitted by the user. We then create an instance of the SimpleJob model from models.py.

Note

Once an instance of a model extending AviJob (SimpleJob in this case) is created, the corresponding analysis pipeline is automatically queued for processing in GAVIP. When developing an AVI locally, the processing occurs immediately.

job_result()

This view is used to render the result (the pandas profiling in our case) for a particular job, given its ID. We can see in the code that job_id is a parameter to this function, which is then used to retrieve an instance of SimpleJob. Once retrieved, we open the file generated by the analysis pipeline and add it to the “context” to be used when rendering job_result.html (similarly to the index() view).

Note

The result_path attribute of the job is the path specified by the ProcessData output() function in tasks.py

(index.html) The AVI HTML files

Note

Open avi/templates/avi/index.html

Summary

This file is a HTML template. It is very similar to a standard HTML file but includes some mark-up to let Django dynamically create some content (saving us effort).

See also

The Django documentation on templates: https://docs.djangoproject.com/en/1.10/#the-template-layer

Examination

At the start of the file we “extend” a file:

{% extends "base/base.html" %}

By doing this we are extending an existing HTML template and including our code in it. This base template imports some JavaScript and Stylesheets so we don’t have to (including Bootstrap, and Jquery). Note: this is only provided for your convenience as a developer, and is not necessary for your AVI.

We then specify some content within a “block” called ‘avi_title’:

{% block avi_title %}
Tutorial AVI
{% endblock avi_title %}

This simply results in “Tutorial AVI” becoming the title of the AVI web page.

Then we specify the main content of our page within a “block” called ‘avi_content’:

{% block avi_content %}
...
{% endblock avi_content %}

In this block, we are providing our main page, which consists of:

  1. Our QueryForm from forms.py
  2. A “plugin” used to generate a useful interactive table of all jobs performed in the AVI

The QueryForm is embedded in a <form> in our AVI page as follows:

{{ query_form.as_table }}

Note

This is using the ‘as_table()’ function on the query_form object. This object is passed in the context created for index.html in views.py

The AVI job list plugin is embedded in our page using:

{% include "plugins/gavip_job_table.html" %}

(job_result.html) The AVI HTML files

Note

Open avi/templates/avi/index.html

Summary

This file is a HTML template. It is very similar to a standard HTML file but includes some mark-up to let Django dynamically create some content (saving us effort).

See also

The Django documentation on templates: https://docs.djangoproject.com/en/1.10/#the-template-layer

Examination

Similarly to index.html we are extending the base template.

In the ‘avi_title’ block we have changed the page title to be “Job Result: {{job_id}}”. This uses a value from the context created in views.py in the job_result() function.

In the ‘avi_content’ block we show a panel containing some helpful information (using bootstrap), and then we embed the content of the ‘pandas_profiling’ object passed into the view context. Note: we use ‘safe’ when embedding this content so that Django doesn’t automatically escape some of its content.

(urls.py) Mapping URLs to views

Note

Open avi/urls.py

Summary

Now that we have seen the view functions, the HTML templates being used in the views, the forms and the models, we finally map URLs to the view functions to create a functional AVI.

Examination

In urls.py we import some views from our AVI, then we create an array of URL patterns which each map a URL pattern to a particular view (note: the patterns are regular expressions).

The first and second URLs are straightforward, mapping ‘/’ to our index view, and ‘/run_query/’ to our run_query view, respectively.

The third URL contains a regular expression so that ‘/result/XX’ will pass the integer XX to the job_result view in the ‘job_id’ parameter (as we saw in views.py).

The final URL maps ‘/job_list/’ as the root to a collection of URLs imported from ‘plugins’. Note: this is used by the job_list plugin used in index.html (further detail is available in the more technical AVI tutorial).

Note

All URLs in this file are appended to ‘/avi/’ by the AVI framework. So for example, the run_query URL will actually be ‘XXXXX/avi/run_query’ where XXXXX is the AVI URL.

Starting your AVI

Now that we have an AVI created, we will download a container image and run it with our AVI code.

The container image is an identical copy of what exists in one of the AVIs when they are run by GAVIP - i.e. it includes all the software dependencies including the Anaconda software suite, and AVI framework. This means that preparing an AVI environment on your machine will usually require two commands!.

When an AVI is run in GAVIP, it is configured to store all data products in a special directory (/data) which is mounted from your User Space. When running the AVI locally, we will create a directory to mount to the data directory in the AVI (so that our results are easily accessible during development).

Once we have the container image (referred to as AVI template, as it is the basis of an AVI), we will create a data directory for the AVI, then start up a container and mount in our AVI code so it runs in the framework.

Download the image

Here we use Docker to download the image:

docker pull repositories.gavip.science/ps_avi_python_2:develop

Note

sudo may be required depending on your installation of Docker

Create a Data directory

Here we will create a directory in /tmp (but this location can be changed as you wish). The only requirement is that it also includes a ‘logs’ subdirectory:

mkdir -p /tmp/my_data_volume/logs

Start the AVI

Now we can start the AVI. For this command, the AVI folder from the tutorial is assumed to exist in /tmp (change this as required)

  • (avi folder) /tmp/avi
  • (data folder) /tmp/my_data_volume
docker run -d --name myfirstavi \
-e SETTINGS=settings.standalone \
-v /tmp/avi:/opt/gavip_avi/avi \
-v /tmp/my_data_volume:/data \
repositories.gavip.science/ps_avi_python_2:develop \
supervisord

You should see some unique ID output after this command is run, this is the unique ID of the container. We could use this in later docker commands, but instead we will use the more convenient container name ‘myfirstavi’.

Retrieve the AVI IP address

To view your AVI, we need to find its IP address using Docker:

docker inspect --format '{{ .NetworkSettings.IPAddress }}' myfirstavi

Once you have your container IP address, view the IP at port 10000 in your browser. For example: http:172.17.0.2:10000

Use your AVI

You have now started your first AVI, and should see something similar to the image below.

_images/tutorial_avirunning.png

Things to do with your AVI:

  • Submit an ADQL query
    • Watch it get added to your AVI job list automatically
    • The progress in the job list is calculated based on the ratio of tasks completed in your pipeline to those required
  • Submit the same query twice
    • Notice how the job is completed almost instantly
    • This is because our pipeline is now checking for existing files before doing unnecessary work
  • Submit a different ADQL query
    • This one should take some time to complete (like the first one)
  • Check the parameters given to your job by clicking on the Info icon
  • Search your jobs using the Search box
  • Click the Result button to view your job result
    • We now see the pandas profiling from our Jupyter notebook
  • Delete a job by selecting the ‘Delete’ option from the dropdown box
    • Many options are unavailable in standalone mode, but can be used when the AVI is deployed by GAVIP.

Other things to do with your AVI:

  • View the ‘Queue Monitor’ from the navigation bar at the top of your AVI interface
    • This shows the status of your job queue
    • Remember that your analysis pipeline is queued and run separately to your AVI interface
  • Open the ‘Jupyter Notebook’ provided by your local AVI container
    • The earlier portion of this tutorial could be run from the Jupyter notebook service provided by the AVI

Starting your AVI with the GAVIP Client

If you have the GAVIP client installed (available on LiveLink), starting your AVI is simpler.

Create a Data directory

Here we will create a directory in /tmp as before:

mkdir -p /tmp/my_data_volume/logs

Start the AVI

The AVI folder from the tutorial is assumed to exist in /tmp (change this as required)

  • (avi folder) /tmp/avi
  • (data folder) /tmp/my_data_volume
gavip_client development start_avi \
--avi-path /tmp/avi --data-path /tmp/my_data_volume \
--avi-template=ps_avi_python_2:develop \

This command will download the required AVI container image (AVI template), then provide the name of the container, it’s URL, and information on how to access the running AVI from a terminal.