9 stable releases

new 2.1.0 Apr 16, 2025
2.0.1 Mar 7, 2025
2.0.0 Feb 28, 2025
1.2.0 Feb 18, 2025
0.4.0 Aug 15, 2021

#30 in Science

Download history 3/week @ 2025-01-31 4/week @ 2025-02-07 110/week @ 2025-02-14 31/week @ 2025-02-21 163/week @ 2025-02-28 151/week @ 2025-03-07 5/week @ 2025-03-14 6/week @ 2025-03-21 9/week @ 2025-03-28 5/week @ 2025-04-04 104/week @ 2025-04-11

124 downloads per month

MPL-2.0 license

155KB
3K SLoC

giant-squid

Tests Code Coverage codecov Crates.io Crates.io Crates.io docs

An alternative MWA ASVO client. For general help on using the MWA ASVO, please visit: MWA ASVO wiki.


NOTE FOR HPC USERS

Please read this wiki article if you are running giant-squid on HPC systems.


giant-squid was originally created as a library to do MWA ASVO related tasks in the Haskell programming language (it is now available in Rust). However, it's not just a library; the giant-squid executable acts as an alternative to the manta-ray-client and may better suit users for a few reasons:

  1. By default, giant-squid stream untars the downloads from MWA ASVO. In other words, rather than downloading a potentially large (> 100 GiB!) tar file and then untarring it yourself (thereby occupying double the space of the original tar and performing a very expensive IO operation), it is possible to get the files without performing an untar using --keep-tar

  2. If --keep-tar is specified, then giant-squid will support resuming partial downloads and continue where it left off if the download command is run after a download was interrupted or failed. In addition, if the file to download already exists and matches the expected file size and checksum, then giant-squid will skip downloading the file again.

  3. giant-squid does not require a CSV file to submit jobs; this is instead handled by command line arguments.

  4. For any commands that accept obsids or job IDs, it is possible use text files instead. These files are unpacked as if you had typed them out manually, and each entry of the text file(s) are checked for validity (all ints and all 10-digits long); any exceptions are reported and the command fails.

  5. One can ask giant-squid to print your MWA ASVO queue as JSON; this makes parsing the state of your jobs in another programming language much simpler.

  6. By default, giant-squid will validate the hash of the archive. You can skip this check with --skip-hash

Usage

Print help text

giant-squid -h

This also applies to all of the commands, e.g.

giant-squid download -h

Print the giant-squid version

giant-squid --version
giant-squid -V

(Useful if things are changing over time!)

List MWA ASVO jobs

giant-squid list
giant-squid l

List MWA ASVO jobs in JSON

the following commands are equivalent:

giant-squid list --json
giant-squid list -j
giant-squid l -j

Example output:

giant-squid list -j
{"325430":{"obsid":1090528304,"jobId":325430,"jobType":"DownloadVisibilities","jobState":"Ready","files":[{"fileName":"1090528304_vis.zip","fileSize":10762878689,"fileHash":"ca0e89e56cbeb05816dad853f5bab0b4075097da"}]},"325431":{"obsid":1090528432,"jobId":325431,"jobType":"DownloadVisibilities","jobState":"Ready","files":[{"fileName":"1090528432_vis.zip","fileSize":10762875021,"fileHash":"9d9c3c0f56a2bb4e851aa63cdfb79095b29c66c9"}]}}

jobType is allowed to be any of:

  • Conversion
  • DownloadVisibilities
  • DownloadMetadata
  • DownloadVoltage
  • CancelJob

jobState is allowed to be any of:

  • Queued
  • WaitCal
  • Staging
  • Staged
  • Downloading
  • Preprocessing
  • Imaging
  • Delivering
  • Ready
  • Error: Text (e.g. "Error: some error message")
  • Expired
  • Cancelled

Example reading this in Python:

$ giant-squid list -j > /tmp/asvo.json
$ ipython
Python 3.8.0 (default, Oct 23 2019, 18:51:26)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.10.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import json

In [2]: with open("/tmp/asvo.json", "r") as h:
   ...:     q = json.load(h)
   ...:

In [3]: q.keys()
Out[3]: dict_keys(['216087', '216241', '217628'])

Filter MWA ASVO job listing

giant-squid list takes an optional list of identifiers that can be used to filter the job listing, these identifiers can either be a list of jobIDs or a list of obsIDs, but not both.

Additionally, the --states and --types options can be used to further filter the output.

These both taks a comma-separated, case-insensitive list of values from the jobType and jobState lists above. These can be provided in TitleCase, UPPERCASE, lowercase, kebab-case, snake_case, or even SPoNgeBOb-CAse

example: show only jobs that match both of the following conditions:

  • obsid is 1234567890 or 1234567891
  • jobType is DownloadVisibilities, DownloadMetadata or CancelJob
  • jobState is Preprocessing or Queued
giant-squid list \
   --types dOwNlOaD__vIsIbIlItIeS,download-metadata,CANCELJOB \
   --states PrepRoCeSsInG,__Q_u_e_u_e_D__ \
   1234567890 1234567891

Example: manual hash validation with Bash and jq

This example demonstrates how it is possible to stream the output of giant-squid list -j into jq. This is the equivalent of what giant-squid download does, but with the extra overhead of storing the tar to disk (-k).

set -eux
giant-squid list -j --types download_visibilities --states ready \
  | jq -r '.[]|[.jobId,.files[0].fileUrl//"",.files[0].fileSize//"",.files[0].fileHash//""]|@tsv' \
  | tee ready.tsv
while read -r jobid url size hash; do
   # note: it's a good idea to check you have enough disk space here using $size.
   wget $url -O ${jobid}.tar --progress=dot:giga --wait=60 --random-wait
   sha1=$(sha1sum ${jobid}.tar | cut -d' ' -f1)
   if [ "\$sha1" != "\$hash" ]; then
      echo "Download failed, hash mismatch. Expected $hash, got $sha1"
      exit 1
   fi
   tar -xf ${jobid}.tar
do < ready.tsv

Download MWA ASVO jobs

To download job ID 12345 to your current directory '.':

giant-squid download 12345
# or
giant-squid d 12345

To download obsid 1065880128 to your current directory '.':

giant-squid download 1065880128
# or
giant-squid d 1065880128

(giant-squid differentiates between job IDs and obsids by the length of the number specified; 10-digit numbers are treated as obsids. If the MWA ASVO ever serves up more than a billion jobs, you have permission to be upset with me. The same applies if this code is still being used in the year 2296.)

Text files containing job IDs or obsids may be used too.

You can specify the directory to download to by providing the download_dir parameter to the download command. Ommitting this will default to your current dir ..

To download obsid 1065880128 to your /tmp directory:

giant-squid download --download-dir /tmp 1065880128
# or
giant-squid d -d /tmp 1065880128

By default, giant-squid will perform stream unzipping. Disable this with -k (or --keep-tar).

The MWA ASVO provides a SHA-1 of its downloads. giant-squid will verify the integrity of your download by default. Give a --skip-hash to the download command to skip.

Jobs which were submitted with the /scratch data delivery option behave differently than jobs submitted with the other data delivery options. When attempting to download a /scratch job, if the path of the job (eg /scratch/mwaops/asvo/12345) is reachable from the current host, it will be moved to the current working directory. Otherwise, it will be skipped.

Download performance: Concurrent Downloads

By default, giant-squid will download 4 jobs concurrently (assuming you have specified 4 or more jobs to download). This can help throughput if you have a good Internet connection, otherwise you may set the value manually by specifying: --concurrent-downloads N or -c N where N is an integer equal or greater than 1.

Download performance: Changing the buffer size

By default, when downloading, giant-squid will store 100 MiB of the download in memory before writing to disk. This is friendlier on disks (especially those belonging to supercomputers!), and can make downloads faster.

The amount of data to cache before writing can be tuned by setting GIANT_SQUID_BUF_SIZE. e.g.

export GIANT_SQUID_BUF_SIZE=50
giant-squid download 12345

would use 50 MiB of memory to cache the download before writing.

Resuming Interrupted Downloads

  • giant-squid will attempt to resume an existing/interrupted download when the download command includes the --keep-tar option.
  • Without the --keep-tar option, giant-squid stream untars files (i.e. it downloads the tar from MWA ASVO and, in memory, untars files on the fly) which means it is not possible for giant-squid to be able to reliably resume an interrupted download.

Submit MWA ASVO jobs

A Note On Delivery Options

Before submitting any MWA ASVO job, you will need to decide where you want the data to be delivered. There are up to three options depending on your user profile.

Delivery: Acacia (Default)
  • The default option for all job types except voltage downloads (Voltage jobs are not able to be delivered to Acacia due to their size).
  • Files are tarred up and uploaded to Pawsey's Acacia object store.
  • To submit a job with the Acacia delivery option specify -d acacia or --delivery acacia on any job submission command.
  • A URL which expires in 7 days is generated- allowing you to download the file via giant-squid, wget, curl, etc from anywhere in the world.
Delivery: Pawsey Scratch Filesystem
  • You can request that your job's files be delivered to Pawsey's /scratch filesystem.
  • To submit a job with the scratch delivery option, specify -d scratch or --delivery scratch on any job submission command.
  • You can also optionally pass delivery-format tar to instruct MWA ASVO to deliver a tar of the files, rather than all of the individual files.
  • This option is only available to users who have a Pawsey account with MWA group access and your Pawsey Group has been set in your MWA ASVO profile by an MWA ASVO administrator.
    • Please contact support to request this.
  • NOTE: all Pawsey users in the specified Pawsey Group can access your job's files. If you prefer to keep your data private to only you, you should choose the acacia delivery option as only you have the download URL.
Delivery: Down Under Geosolutions (DUG) Filesystem
  • You can request that your job's files be delivered to DUG's filesystem.
  • You can also optionally pass delivery-format tar to instruct MWA ASVO to deliver a tar of the files, rather than all of the individual files.
  • To submit a job with the DUG delivery option, specify -d dug or --delivery dug on any job submission command.
  • Voltage jobs are not able to be delivered to DUG currently.
  • This option is only open to users who have a Curtin University DUG account and your DUG Group has been set in your MWA ASVO profile by an MWA administrator.
    • Please contact support to request this.
  • NOTE: all DUG users in the specified DUG Group can access your job's files. If you prefer to keep your data private to only you, you should choose the acacia delivery option as only you have the download URL.
Changing Your Default Delivery Option
  • You can set the environment variable GIANT_SQUID_DELIVERYto acacia, scratch or dug if you don't want to keep specifying the delivery option on the command line.

Visibility downloads

A "visibility download job" refers to a job which provides a zip containing gpubox files, a metafits file and cotter flags for a single obsid.

To submit a visibility download job for the obsid 1065880128:

giant-squid submit-vis 1065880128
# or
giant-squid sv 1065880128

Text files containing obsids may be used too.

If you want to check that your command works without actually submitting the obsids, then you can use the --dry-run option (short version -n).

Conversion downloads

To submit a conversion job for obsid 1065880128:

giant-squid submit-conv 1065880128
# or
giant-squid sc 1065880128

Text files containing obsids may be used too.

The default conversion options can be found by running the help text:

giant-squid submit-conv -h

To change the default conversion options and/or specify more options, specify comma-separated key-value pairs like so:

giant-squid submit-conv 1065880128 -p avg_time_res=0.5,avg_freq_res=10

If you want to check that your command works without actually submitting the obsids, then you can use the --dry-run option (short version -n). More messages (including what giant-squid uses for the conversion options) can be accessed with -v (or --verbose). e.g.

$ giant-squid submit-conv 1065880128 -nv -p avg_time_res=0.5,avg_freq_res=10
20:40:24 [INFO] Would have submitted 1 obsids for conversion, using these parameters:
{"output": "uvfits", "job_type": "conversion", "flag_edge_width": "160", "avg_freq_res": "10", "avg_time_res": "0.5"}

Metadata downloads

A "metadata download job" refers to a job which provides a zip containing a metafits file and cotter flags for a single obsid.

To submit a visibility download job for the obsid 1065880128:

giant-squid submit-meta 1065880128
# or
giant-squid sm 1065880128

Text files containing obsids may be used too.

If you want to check that your command works without actually submitting the obsids, then you can use the --dry-run option (short version -n).

Voltage downloads

A "voltage download job" refers to a job which provides the raw voltages for one or more obsids.

To submit a voltage download job for the obsid 1065880128:

giant-squid submit-volt --delivery scratch --offset 0 --duration 8 1065880128
# or
giant-squid sv -d scratch -o 0 -u 8 1065880128

Text files containing obsids may be used too.

If you want to check that your command works without actually submitting the obsids, then you can use the --dry-run option (short version -n).

For MWAX_VCS or MWAX_BUFFER voltage observations you can optionally pass --from_channel and --to_channel to restrict the job to only the receiver coarse channel range specified (inclusive). MWA receiver channel numbers range from 0-255, and multiplying by 1.28 will result in the center frequency (in MHz) of that channel. Each MWA observation nominally has 24 coarse channels.

Unlike other jobs, you cannot choose to have your files tarred up and uploaded to Pawsey's Acacia for remote download or DUG's filesystem, as the data is generally too large. If you are in the mwaops or mwavcs Pawsey groups and you have asked an MWA ASVO admin to set the pawsey group in your MWA ASVO profile, you can request that the files be left on Pawsey's /scratch filesystem. To submit a job with the /scratch option, set the environment variable GIANT_SQUID_DELIVERY=scratch or pass -d scratch or --delivery scratch.

Resubmitting jobs

By default, the MWA ASVO server will not allow you to submit a new job which is has the exact same settings/parameters as an existing job in your queue (except errored jobs). You can, however override this behaviour by specifying --allow-resubmit (short version -r) on any job submission.

Installation

Pre-compiled

Have a look at the GitHub releases page.

Building from crates.io

  • Install Rust

  • Run cargo install mwa_giant_squid

    • The final executable will be at ~./cargo/bin/giant-squid

    • This destination can be configured with the CARGO_HOME environment variable.

Building from source

  • Install Rust

  • Clone this repo and cd into it

    git clone https://github.com/MWATelescope/giant-squid && cd giant-squid

  • Run cargo install --path .

    • The final executable will be at ~./cargo/bin/giant-squid

    • This destination can be configured with the CARGO_HOME environment variable.

Docker

You can run giant-squid using docker

docker run mwatelescope/giant-squid:latest -h

Other

The Haskell code is still available on chj's GitLab. Switching to Rust means that the code is more efficient and the code is easier to read (sorry Haskell. I love you, but you're weird).

Dependencies

~11–27MB
~429K SLoC