Writing a new DQR task

This guide will show how to develop a new task and add the script to dqrtasks. As an example, we will walk through the code for dqr-largest-value. Note that this script is designed for this tutorial, and is not a ‘’reviewed’’ script.

The dqr-largest-value script is designed to read in data, find the maximum value of the data in a small chunk of time around an event, and then estimate the significance (with a p-value) of that measurement.

The completed results page for this example task using data from LIGO Livingston around GW70817 can be found here:

Largest value example with GW170817

Starting off

The first thing to do for a new script is create a new folder in dqrtasks/dqrtasks/ to house your code. For this example, we created the folder dqrtasks/dqrtasks/largest.

The next step is to add a file to this folder named __init__.py. There aren’t any requirements for what this file should contain; this file is just required for the package to be imported. To make imports easier, once can add the following line to the file:

from . import *

Now let’s start writing our new script. We create the file largest.py and add in the required python shebang and import the packages that we will need for this script. In this example, this looks like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt

import os, json, argparse, time, sys, logging
import numpy

from gwpy.timeseries import TimeSeries

from gwdetchar.io import html as htmlio
from MarkupPy import markup

Next we want to create a function that will run the required code. You can include as many functions as needed (like any python code), but it is strongly suggested there is a ‘’top-level’’ function that can run the entire script.

This new function should take in all of the required variables as input - it is strongly reccommended that as few variables as possible are ‘’hard-coded’’ into the code. The function we are making only needs 5 variables:

def largest_value(channel,
                  gps_time,
                  outdir='./',
                  total_duration=256.,
                  chunk_duration=1.):
ifo = channel[:2]

Now that we’ve set things up, we can start on the code that will run the actual analysis.

Data access

Accessing the data can be one of the most straightforward and important parts of the code to get right. One of the most significant differences between different types of IGWN computing environments is how the data is stored and queried. To ensure that data can be accessed in the largest number of computing environments, it is strongly suggested that the gwpy function gwpy.timeseries.TimeSeries.get() be used to access data. An example use of this function is below:

# get data
start_data = int(gps_time-total_duration/2.)
end_data = int(gps_time+total_duration/2.)
data = TimeSeries.get(channel=channel,
                      start=start_data,
                      end=end_data,
                     )

By using gwpy.timeseries.TimeSeries.get(), this allows data to be accessed by your code using datafind methods at LIGO and EGO clusters, in addition to with nds2. Besides allowing your code to be used at all IGWN computing sites, having this flexibility in data access also increases the robustness of the code. In cases where data is not accessible using one of these methods, alternate methods will be attempted automatically.

Now that we have data, let’s analyze it!

Data analysis

The analysis for this task is quite simple. We take crop the data down to a small window around our event (with window width of chunk_duration) and then find the maximum value of the data during this window.

# find max value
short_data = data.crop(gps_time-chunk_duration/2., gps_time+chunk_duration/2.).value
max_val = max(short_data)

Most tasks will require more complicated data analysis, but this should suffice for our example. The most involved component will be calculating the significance of this measurement, which will be explained in the next section.

Calculating the p-value

In order to support increased automation of DQR tasks, your task should report a significance estimate (using a p-value) if at all possible. This p-value should be the probability of the null hypothesis, where the null hypothesis is that there the data quality issue considered by the task is not present. i.e. a low p-value means that there is likely a data quality issue present. In this case, the ‘’data quality issue’’ we are considering is that the considered channel has an anomalously large value.

To estimate the significance of our measurement, we will repeat our measurement on a large number of time windows (with the same width) to estimate the distribution of values this task would return. We will loop through all of the data we downloaded (256 seconds) in this example.

# find same value for every chunk
max_val_array = []
chunk_starts = numpy.arange(start=start_data,
                            stop=end_data,
                            step=chunk_duration)
for t in chunk_starts:
    short_data_chunk = data.crop(t, t+chunk_duration).value
    max_val_chunk = max(short_data_chunk)
    max_val_array.append(max_val_chunk)
max_val_array = numpy.array(max_val_array)

Now that we have our distribution, the p-value is simply the fraction of events larger than our measurement. If our measurement was the largest out of 256 samples, than the p-value we would return is 1/256, or 0.00390625. The p-value cannot be 0 in this case, as we are limited by our finite sample size.

# calulate p_value, just the number of points louder
points_larger = sum(max_val_array > max_val)
p_value = (points_larger+1.)/len(max_val_array)

This concludes the ‘’analysis’’ portion of the example. The next few sections will detail how to save and present these results.

Plotting the results

Although this task was designed to automatically analyze the results, it is also strongly beneficial if these results can be confirmed by a human (if needed). To help this process, we should make a few plots that show how we can to this p-value and to give some context for our result. We will make 2 plots - one showing the distribution of measured values, and a second showing the time series of our data:

# first the distribution of data
fig = plt.figure(figsize=[8,4])
ax = fig.gca()
ax.hist(max_val_array, bins=50, histtype='step',
                   label="Full dataset")
ax.axvline(max_val,label='Measured value',color='r',ls='--')
ax.set_xlabel('Amplitude')
ax.set_ylabel('Counts')
ax.set_title('')
ax.grid(True)
ax.legend(loc='upper right')
fig.tight_layout()
hist_name = '%s-hist_%d.png'%(ifo,int(gps_time))
fig.savefig(os.path.join(outdir, hist_name))

# now the timeseries
plot = data.plot(figsize=[8,4], epoch=gps_time)
ts_name = '%s-timeseries-%d.png'%(ifo,int(gps_time))
plot.savefig(os.path.join(outdir, ts_name))

As part of generating these plots, we save them in the same output directory as all our other results will be stored. We also store the plot names as variables which we will be using later.

Saving the results

Next we save our numerical results to a .json file. There are a number of required fields for this result file, including (information to be confirmed).

This data can be saved in a relatively straightforward manner:

# json file
json_name = os.path.join(outdir, 'data.json')

out_dict = dict(
    username = os.getenv('USER', default='unknown'),
    hostname = os.getenv('HOSTNAME', default='unknown'),
    tables = None,
    date = time.strftime("%H:%M:%S UTC %a %d %b %Y", time.gmtime()),
    directory = outdir,
    p_value = p_value,
    comment = 'p-value based on data from %.2f seconds'%total_duration
)

with open(json_name, 'w') as fp:
    json.dump(out_dict, fp)

This .json file will be used by the task manager to identify that a task has finished and to display summary results. These steps will be handled by the task manager, and don’t need to be part of the task.

What this task should do, however, is create a results page just for this task. We will walk through this step next.

Displaying the results

In order to make the results of this task easier to interpret, we need to develop an html page to display the numbers and plots we generated. There are not any requirements for this page beyond that the page can easily be interpreted. However, there are tools that are a part of the gwdetchar package that help simplify the process of creating an html results page in python. This section walks through a basic example of a results page, but more intricate pages may be required for complex tasks.

The first step in making our results page is using the new_bootstrap_page() function. A number of details need to be filled in here, including a name for the page and some of the headings that will be shown.

# set up page
links = [str(gps_time)]
title = '%s Signficance for GPS Time %.2f'%(ifo,gps_time)
(brand, class_) = htmlio.get_brand(ifo, 'Large Val', int(gps_time))
navbar = htmlio.navbar(links, class_=class_, brand=brand)
page = htmlio.new_bootstrap_page(
    title='%s Largest Value | %d' % (ifo, int(gps_time)),
    navbar=navbar)
page.h1(title, class_='pb-2 mt-3 mb-2 border-bottom')

The html is coded using bootstrap, which allows html syntax (such as h1 or div) to be set using python syntax. The examples of the class syntax for bootstrap can be seen in this cheat sheet: https://hackerthemes.com/bootstrap-cheatsheet/#custom-radio

This example can be used directly as a base, but different formats can easily be configured.

The next step of this results example is adding the plots. We will start a new row and then add both plots, one after the other:

# add in the plots
page.div(class_='row', id_='main-plot')

page.div(class_='col-md-6')
img1 = htmlio.FancyPlot(hist_name)
page.add(htmlio.fancybox_img(img1))
page.div.close()  # col-md-6

page.div(class_='col-md-6')
img2 = htmlio.FancyPlot(ts_name)
page.add(htmlio.fancybox_img(img2))
page.div.close()  # col-md-6

page.div.close()  # main-plot

The class 'col-md-6' means that both plots will be half of the page, as long as the browser can support both side by side. In the case of a smaller window (e.g. on a mobile device), the plots will be resized or stacked automatically to make them viewable.

Next we will make a new row that includes a table. We first create an array of tuples with the content and then add in a parameter_table() with this content. We also set the start and end times of the analysis, which are treated separately in the table.

# make a table with some important details
content = [
    ('Channel', markup.oneliner.code(channel)),
    ('Max value', "{:e}".format(max_val)),
    ('P-value', '{:.4f}'.format(p_value))]
page.h2('Parameters', class_='mt-4 mb-4', id_='parameters')
page.div(class_='row')
page.div(class_='col-md-12 col-sm-12')
page.add(htmlio.parameter_table(content, start=start_data, end=end_data))
page.div.close()  # col-md-12 col-sm-12
page.div.close() # row

Finally, we add in the command line that was used to generate the page. This is automatically generated using sys.argv[0]. By including the explicit call used to generate the page, we are able to easily document how each analysis was generated. More extensive descriptions of how an analysis was generated are also beneficial, but including the command line will generally be sufficient.

# save the command line used to run this program
PROG = (os.path.basename(sys.argv[0]))
page.h3('Command-line:')
page.add(htmlio.get_command_line(about=False, prog=PROG))

Now that we’ve set up the entire html results page, it is time to save the page. The .html file is saved in the same folder as the other results. We also include a link to the documentation page, which will be included as a link in the footer.

# save the html page
html_path = os.path.join(outdir, 'index.html')
htmlio.close_page(page, html_path,
                  about='https://detchar.docs.ligo.org/dqrtasks/loudestamp.html')

We’ve now finished the entire function used to generate this task. The last step of setting up the code will be adding in support for running the task on the command line, which is explained in the next section.

Setting up the command-line

Adding support for the command line is straightforward since we have a single function that runs the task.

Our first step is to add a new function named main() that ingest the required arguments. We use the argparse package to include all of the required variables. These arguments are then passed to the function that we wrote earlier:

def main(args=None):
    logging.basicConfig()
    logger = logging.getLogger(__process_name__)
    logger.setLevel(logging.DEBUG)

    parser = argparse.ArgumentParser(description=__doc__,
                                     prog=__process_name__)
    parser.add_argument('-v', '--verbose', action='store_true',
                        help='increase verbose output')
    parser.add_argument('-V', '--version', action='version',
                        version=__version__)
    parser.add_argument('-o', '--output-dir', type=os.path.abspath,
                        help='Directory for all output')
    parser.add_argument('--gps-time', type=float, required=True,
                        help='')
    parser.add_argument('--channel', type=str, required=True,
                        help='')
    parser.add_argument('--total-duration', type=float, default=256,
                        help='')
    parser.add_argument('--chunk-duration', type=float, default=1.0,
                        help='')
    args = parser.parse_args(args=args)

    # call the above function
    largest_value(channel=args.channel,
                  gps_time=args.gps_time,
                  outdir=args.output_dir,
                  total_duration=args.total_duration,
                  chunk_duration=args.chunk_duration)

Typically, there should be an argument for every variable that was included as an input for the task function.

We now add a few additional lines that specifically allow the main() function to be used on the command line:

# allow be be run on the command line
if __name__ == "__main__":
    main(args=None)

At this point, the file largest.py is complete and can be saved.

We now need to make sure that the script we wrote is installed as part of the package. To accomplish this, we add our script to the setup.py file as one of the console_scripts options as follows:

entry_points={
    "console_scripts": [
    'dqr-largest-value=dqrtasks.largest.largest:main',
    ],
    },

The above code will connect the command line option dqr-largest-value to the script we just wrote, which can be found via the path dqrtasks.largest.largest:main.

Reinstalling dqrtasks should allow this script to be run on the command line!

Documentation

The last important step is documenting the new task. A new documentation page should be added to the repository detailing how the task is run.

The required documentation sections are:

…

Work in progress