Writing a new DQR task
This guide will show how to develop a new task and add the script to dqrtasks
.
As an example, we will walk through the code for dqr-largest-value
.
Note that this script is designed for this tutorial, and is not a ‘’reviewed’’ script.
The dqr-largest-value
script is designed to read in data, find the maximum value
of the data in a small chunk of time around an event, and then
estimate the significance (with a p-value) of that measurement.
The completed results page for this example task using data from LIGO Livingston around GW70817 can be found here:
Starting off
The first thing to do for a new script is create a new folder
in dqrtasks/dqrtasks/
to house your code.
For this example, we created the folder dqrtasks/dqrtasks/largest
.
The next step is to add a file to this folder named __init__.py
.
There aren’t any requirements for what this file should contain;
this file is just required for the package to be imported.
To make imports easier, once can add the following line to the file:
from . import *
Now let’s start writing our new script.
We create the file largest.py
and add in the required python shebang
and import the packages that
we will need for this script.
In this example, this looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import os, json, argparse, time, sys, logging
import numpy
from gwpy.timeseries import TimeSeries
from gwdetchar.io import html as htmlio
from MarkupPy import markup
Next we want to create a function that will run the required code. You can include as many functions as needed (like any python code), but it is strongly suggested there is a ‘’top-level’’ function that can run the entire script.
This new function should take in all of the required variables as input - it is strongly reccommended that as few variables as possible are ‘’hard-coded’’ into the code. The function we are making only needs 5 variables:
def largest_value(channel,
gps_time,
outdir='./',
total_duration=256.,
chunk_duration=1.):
ifo = channel[:2]
Now that we’ve set things up, we can start on the code that will run the actual analysis.
Data access
Accessing the data can be one of the most straightforward and important
parts of the code to get right.
One of the most significant differences between different types of IGWN
computing environments is how the data is stored and queried.
To ensure that data can be accessed in the largest
number of computing environments,
it is strongly suggested that the gwpy function
gwpy.timeseries.TimeSeries.get()
be used to access data.
An example use of this function is below:
# get data
start_data = int(gps_time-total_duration/2.)
end_data = int(gps_time+total_duration/2.)
data = TimeSeries.get(channel=channel,
start=start_data,
end=end_data,
)
By using gwpy.timeseries.TimeSeries.get()
,
this allows data to be accessed by your code using
datafind methods at LIGO and EGO clusters,
in addition to with nds2
.
Besides allowing your code to be used at all IGWN computing
sites, having this flexibility in data access also
increases the robustness of the code.
In cases where data is not accessible using one of these
methods,
alternate methods will be attempted automatically.
Now that we have data, let’s analyze it!
Data analysis
The analysis for this task is quite simple.
We take crop the data down to a small window
around our event (with window width of chunk_duration
)
and then find the maximum value of the data during this window.
# find max value
short_data = data.crop(gps_time-chunk_duration/2., gps_time+chunk_duration/2.).value
max_val = max(short_data)
Most tasks will require more complicated data analysis, but this should suffice for our example. The most involved component will be calculating the significance of this measurement, which will be explained in the next section.
Calculating the p-value
In order to support increased automation of DQR tasks, your task should report a significance estimate (using a p-value) if at all possible. This p-value should be the probability of the null hypothesis, where the null hypothesis is that there the data quality issue considered by the task is not present. i.e. a low p-value means that there is likely a data quality issue present. In this case, the ‘’data quality issue’’ we are considering is that the considered channel has an anomalously large value.
To estimate the significance of our measurement, we will repeat our measurement on a large number of time windows (with the same width) to estimate the distribution of values this task would return. We will loop through all of the data we downloaded (256 seconds) in this example.
# find same value for every chunk
max_val_array = []
chunk_starts = numpy.arange(start=start_data,
stop=end_data,
step=chunk_duration)
for t in chunk_starts:
short_data_chunk = data.crop(t, t+chunk_duration).value
max_val_chunk = max(short_data_chunk)
max_val_array.append(max_val_chunk)
max_val_array = numpy.array(max_val_array)
Now that we have our distribution, the p-value is simply the fraction of events larger than our measurement. If our measurement was the largest out of 256 samples, than the p-value we would return is 1/256, or 0.00390625. The p-value cannot be 0 in this case, as we are limited by our finite sample size.
# calulate p_value, just the number of points louder
points_larger = sum(max_val_array > max_val)
p_value = (points_larger+1.)/len(max_val_array)
This concludes the ‘’analysis’’ portion of the example. The next few sections will detail how to save and present these results.
Plotting the results
Although this task was designed to automatically analyze the results, it is also strongly beneficial if these results can be confirmed by a human (if needed). To help this process, we should make a few plots that show how we can to this p-value and to give some context for our result. We will make 2 plots - one showing the distribution of measured values, and a second showing the time series of our data:
# first the distribution of data
fig = plt.figure(figsize=[8,4])
ax = fig.gca()
ax.hist(max_val_array, bins=50, histtype='step',
label="Full dataset")
ax.axvline(max_val,label='Measured value',color='r',ls='--')
ax.set_xlabel('Amplitude')
ax.set_ylabel('Counts')
ax.set_title('')
ax.grid(True)
ax.legend(loc='upper right')
fig.tight_layout()
hist_name = '%s-hist_%d.png'%(ifo,int(gps_time))
fig.savefig(os.path.join(outdir, hist_name))
# now the timeseries
plot = data.plot(figsize=[8,4], epoch=gps_time)
ts_name = '%s-timeseries-%d.png'%(ifo,int(gps_time))
plot.savefig(os.path.join(outdir, ts_name))
As part of generating these plots, we save them in the same output directory as all our other results will be stored. We also store the plot names as variables which we will be using later.
Saving the results
Next we save our numerical results
to a .json
file.
There are a number of required fields for this result file,
including (information to be confirmed).
This data can be saved in a relatively straightforward manner:
# json file
json_name = os.path.join(outdir, 'data.json')
out_dict = dict(
username = os.getenv('USER', default='unknown'),
hostname = os.getenv('HOSTNAME', default='unknown'),
tables = None,
date = time.strftime("%H:%M:%S UTC %a %d %b %Y", time.gmtime()),
directory = outdir,
p_value = p_value,
comment = 'p-value based on data from %.2f seconds'%total_duration
)
with open(json_name, 'w') as fp:
json.dump(out_dict, fp)
This .json
file will be used by the task manager to
identify that a task has finished and to display summary results.
These steps will be handled by the task manager, and don’t need
to be part of the task.
What this task should do, however, is create a results page just for this task. We will walk through this step next.
Displaying the results
In order to make the results of this task
easier to interpret, we need to develop
an html page to display the numbers and plots we generated.
There are not any requirements for this page
beyond that the page can easily be interpreted.
However, there are tools
that are a part of the gwdetchar
package that
help simplify the process of creating an html results
page in python.
This section walks through a basic example of a results page,
but more intricate pages may be required for complex tasks.
The first step in making our results page is using the
new_bootstrap_page()
function.
A number of details need to be filled in here,
including a name for the page and some of the headings that will be shown.
# set up page
links = [str(gps_time)]
title = '%s Signficance for GPS Time %.2f'%(ifo,gps_time)
(brand, class_) = htmlio.get_brand(ifo, 'Large Val', int(gps_time))
navbar = htmlio.navbar(links, class_=class_, brand=brand)
page = htmlio.new_bootstrap_page(
title='%s Largest Value | %d' % (ifo, int(gps_time)),
navbar=navbar)
page.h1(title, class_='pb-2 mt-3 mb-2 border-bottom')
The html is coded using bootstrap
,
which allows html
syntax (such as h1
or div
)
to be set using python syntax.
The examples of the class syntax for bootstrap can be
seen in this cheat sheet: https://hackerthemes.com/bootstrap-cheatsheet/#custom-radio
This example can be used directly as a base, but different formats can easily be configured.
The next step of this results example is adding the plots. We will start a new row and then add both plots, one after the other:
# add in the plots
page.div(class_='row', id_='main-plot')
page.div(class_='col-md-6')
img1 = htmlio.FancyPlot(hist_name)
page.add(htmlio.fancybox_img(img1))
page.div.close() # col-md-6
page.div(class_='col-md-6')
img2 = htmlio.FancyPlot(ts_name)
page.add(htmlio.fancybox_img(img2))
page.div.close() # col-md-6
page.div.close() # main-plot
The class 'col-md-6'
means that both plots will be half
of the page, as long as the browser can support both side by side.
In the case of a smaller window (e.g. on a mobile device),
the plots will be resized or stacked automatically to make them
viewable.
Next we will make a new row that includes a table.
We first create an array of tuples with the content
and then add in a parameter_table()
with this content.
We also set the start and end times of the analysis,
which are treated separately in the table.
# make a table with some important details
content = [
('Channel', markup.oneliner.code(channel)),
('Max value', "{:e}".format(max_val)),
('P-value', '{:.4f}'.format(p_value))]
page.h2('Parameters', class_='mt-4 mb-4', id_='parameters')
page.div(class_='row')
page.div(class_='col-md-12 col-sm-12')
page.add(htmlio.parameter_table(content, start=start_data, end=end_data))
page.div.close() # col-md-12 col-sm-12
page.div.close() # row
Finally, we add in the command line that was used to generate the page.
This is automatically generated using
sys.argv[0]
.
By including the explicit call used to generate the page,
we are able to easily document how each analysis was generated.
More extensive descriptions
of how an analysis was generated are also beneficial,
but including the command line will generally
be sufficient.
# save the command line used to run this program
PROG = (os.path.basename(sys.argv[0]))
page.h3('Command-line:')
page.add(htmlio.get_command_line(about=False, prog=PROG))
Now that we’ve set up the entire html results page,
it is time to save the page.
The .html
file is saved in the same
folder as the other results.
We also include a link to the documentation page,
which will be included as a link in the footer.
# save the html page
html_path = os.path.join(outdir, 'index.html')
htmlio.close_page(page, html_path,
about='https://detchar.docs.ligo.org/dqrtasks/loudestamp.html')
We’ve now finished the entire function used to generate this task. The last step of setting up the code will be adding in support for running the task on the command line, which is explained in the next section.
Setting up the command-line
Adding support for the command line is straightforward since we have a single function that runs the task.
Our first step is to add a new function named main()
that ingest the required arguments.
We use the argparse
package to include all of the
required variables.
These arguments are then passed to the function
that we wrote earlier:
def main(args=None):
logging.basicConfig()
logger = logging.getLogger(__process_name__)
logger.setLevel(logging.DEBUG)
parser = argparse.ArgumentParser(description=__doc__,
prog=__process_name__)
parser.add_argument('-v', '--verbose', action='store_true',
help='increase verbose output')
parser.add_argument('-V', '--version', action='version',
version=__version__)
parser.add_argument('-o', '--output-dir', type=os.path.abspath,
help='Directory for all output')
parser.add_argument('--gps-time', type=float, required=True,
help='')
parser.add_argument('--channel', type=str, required=True,
help='')
parser.add_argument('--total-duration', type=float, default=256,
help='')
parser.add_argument('--chunk-duration', type=float, default=1.0,
help='')
args = parser.parse_args(args=args)
# call the above function
largest_value(channel=args.channel,
gps_time=args.gps_time,
outdir=args.output_dir,
total_duration=args.total_duration,
chunk_duration=args.chunk_duration)
Typically, there should be an argument for every variable that was included as an input for the task function.
We now add a few additional lines that specifically allow
the main()
function to be used on the command line:
# allow be be run on the command line
if __name__ == "__main__":
main(args=None)
At this point, the file largest.py
is complete and can be saved.
We now need to make sure that the script we wrote is installed as part of the
package.
To accomplish this, we add our script to the setup.py
file
as one of the console_scripts
options as follows:
entry_points={
"console_scripts": [
'dqr-largest-value=dqrtasks.largest.largest:main',
],
},
The above code will connect the command line option dqr-largest-value
to the script we just wrote, which can be found via the path
dqrtasks.largest.largest:main
.
Reinstalling dqrtasks
should allow this script to be run on the command line!
Documentation
The last important step is documenting the new task. A new documentation page should be added to the repository detailing how the task is run.
The required documentation sections are:
…
Work in progress