Hail 0.2 on the AWS Cloud

Quick Start Reference Deployment

QS

August, 2020
Adam Perry, Privo and Adam Tebbe, Goldfinch Pharmaceuticals
Paul Underwood, AWS and Roy Hasson, AWS

Visit our GitHub repository for source files and to post feedback, report bugs, or submit feature ideas for this Quick Start.

This Quick Start was created by Privo in collaboration with Amazon Web Services (AWS). Quick Starts are automated reference deployments that use AWS CloudFormation templates to deploy key technologies on AWS, following AWS best practices.

Overview

This Quick Start is for users who want to use an open-source library for scalable data exploration and analysis, with particular emphasis on genomics.

Amazon may share who uses AWS Quick Starts with the AWS Partner Network (APN) Partner that collaborated with AWS on the content of the Quick Start.

Hail 0.2 on AWS

This Quick Start, built in collaboration with Privo, helps to simplify building, managing, and interacting with Hail 0.2 clusters in your Amazon Web Services (AWS) account. Hail 0.2 is an open-source library built for Apache Spark to provide scalable data exploration and analysis, with a particular emphasis on genomics.

Using Hail 0.2, researchers can perform genomic analysis more quickly and efficiently. Hail 0.2 makes it easier to use Spark programming techniques to process genetic data (genomic data frames). It also helps to simplify dealing with multiple input formats by creating a common data structure (Hail MatrixTable).

This deployment uses Amazon EMR in combination with Apache Spark to scale large datasets across instances, such as production-scale genome‐wide association studies (GWAS) and single-node ad hoc processes.

Hail 0.2 is maintained by a team in the Neale lab at the Stanley Center for Psychiatric Research of the Broad Institute of MIT and Harvard and the Analytic and Translational Genetics Unit of Massachusetts General Hospital. This Quick Start packages Hail 0.2 into your AWS environment and makes it easy to spin up clusters and integrated Notebooks for analysis.

Cost

You are responsible for the cost of the AWS services used while running this Quick Start. There is no additional cost for using the Quick Start.

The AWS CloudFormation template for this Quick Start includes configuration parameters that you can customize. Some of the settings, such as the instance type, affect the cost of deployment. For cost estimates, see the pricing pages for each AWS service you use. Prices are subject to change.

After you deploy the Quick Start, enable the AWS Cost and Usage Report to deliver billing metrics to an Amazon Simple Storage Service (Amazon S3) bucket in your account. It provides cost estimates based on usage throughout each month and aggregates the data at the end of the month. For more information about the report, see the AWS documentation.

Software licenses

Hail 0.2 is released under the MIT License.

Architecture

Deploying this Quick Start for a new virtual private cloud (VPC) with default parameters builds the following Hail 0.2 environment in the AWS Cloud.

Architecture
Figure 1. Quick Start architecture for Hail 0.2 on AWS

After the Hail Service Catalog portfolio is deployed, you can spin up notebook instances and Hail 0.2 clusters that can talk to each other through Sparkmagic and Livy.

Overview

As shown in figure 1, the Quick Start sets up the following:

  • A Hail 0.2 AWS Service Catalog portfolio, allowing you to create and manage your own Hail clusters.

  • Four AWS CodeBuild pipelines to support building various combinations of Hail 0.2.x releases, Variant Effect Predictor (VEP) versions, and Loss-Of-Function Transcript Effect Estimator (LOFTEE) plug-ins.

  • An Amazon SageMaker instance that lets you stand up and tear down JupyterLab notebook environments that integrate with Hail clusters (through Sparkmagic and Livy).

  • An Amazon EMR cluster that lets you stand up and tear down Hail 0.2 clusters as needed.

  • An Amazon Simple Storage Service (Amazon S3) Sagemaker bucket to back up launched notebook environments.

  • An Amazon S3 bucket for staging Hail artifacts.

  • An optional VPC configured with a private subnet, according to AWS best practices, to provide you with your own virtual network on AWS.

Planning the deployment

Specialized knowledge

This deployment guide requires a moderate level of familiarity with AWS services. If you’re new to AWS, visit the Getting Started Resource Center and the AWS Training and Certification website. These sites provide materials for learning how to design, deploy, and operate your infrastructure and applications on the AWS Cloud.

AWS account

If you don’t already have an AWS account, create one at https://aws.amazon.com by following the on-screen instructions. Part of the sign-up process involves receiving a phone call and entering a PIN using the phone keypad.

Your AWS account is automatically signed up for all AWS services. You are charged only for the services you use.

Technical requirements

Before you launch the Quick Start, your account must be configured as specified in the following table. Otherwise, deployment might fail.

Resource limits

If necessary, request service quota increases for the following resources. You might need to request increases if your existing deployment currently uses these resources, and this Quick Start deployment could result in exceeding the default quotas. The Service Quotas console displays your usage and quotas for some aspects of some services. For more information, see the AWS documentation.

Resource This deployment uses

VPCs

1*

Elastic IP addresses

2*

IAM roles

1

ml.t3.medium instances

1 (per Hail SageMaker notebook)**

r5.xlarge, m5.xlarge

1 each (per Hail Cluster)**

*The VPC is optional. If you don’t deploy the VPC, these resources aren’t used.

**These are the defaults. You can change instance types for both the Notebook and cluster nodes. You can also specify more nodes in your cluster.

Supported regions

The following regions are currently supported by this Quick Start.

  • us-east-1 (N. Virginia)

  • us-east-2 (Ohio)

  • us-west-1 (N. California)

  • us-west-2 (Oregon)

  • ca-central-1 (Canada Central)

  • eu-central-1 (Frankfurt)

  • eu-west-1 (Ireland)

  • eu-west-2 (London)

  • eu-west-3 (Paris)

  • ap-southeast-1 (Singapore)

  • ap-southeast-2 (Sydney)

  • ap-south-1 (Mumbai)

  • ap-northeast-1 (Tokyo)

  • ap-northeast-2 (Seoul)

  • sa-east-1 (South America)

  • eu-north-1 (Stockholm)

  • ap-east-1 (Hong Kong)

  • me-south-1 (Bahrain)

  • af-south-1 (Cape Town)

  • eu-south-1 (Milan)

Certain regions are available on an opt-in basis. Refer to the AWS Documentation on Managing Regions for more information.

EC2 key pairs

Make sure that at least one Amazon EC2 key pair exists in your AWS account in the Region where you plan to deploy the Quick Start. Make note of the key pair name. You need it during deployment. To create a key pair, follow the instructions in the AWS documentation.

For testing or proof-of-concept purposes, we recommend creating a new key pair instead of using one that’s already being used by a production instance.

IAM permissions

Before launching the Quick Start, you must log in to the AWS Management Console with IAM permissions for the resources and actions the templates deploy.

The AdministratorAccess managed policy within IAM provides sufficient permissions, although your organization may choose to use a custom policy with more restrictions.

Deployment options

This Quick Start provides two deployment options:

  • Deploy Hail 0.2 into a new VPC (end-to-end deployment). This option builds a new AWS environment that includes the VPC, subnets, NAT gateways, security groups, bastion hosts, and other infrastructure components. It then sets a number of AWS Parameter Store values with the appropriate resource IDs for components such as the VPC, subnets, and roles. The parameters are then referenced automatically by the Hail 0.2 Service Catalog when you deploy Notebooks or clusters.

  • Deploy Hail 0.2 into an existing VPC. This option provisions in your existing AWS infrastructure. It sets the necessary AWS Parameter Store values used by the Hail 0.2 Service Catalog products to the existing resources that you want to use.

The Quick Start provides a single template for both options. It also lets you configure Classless Inter-Domain Routing (CIDR) blocks, instance types, Hail, VEP, and Amazon EMR settings, as discussed later in this guide.

For details about operations, maintenance, and cluster management, see the readme file.

Deployment steps

Sign in to your AWS account

  1. Sign in to your AWS account at https://aws.amazon.com with an Identity and Access Management (IAM) user role that has the necessary permissions. For details, see Planning the deployment earlier in this guide.

  2. To help confirm that your AWS account is configured correctly, see the Technical requirements section.

Launch the Quick Start

You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using this Quick Start. For full details, see the pricing pages for each AWS service used by this Quick Start. Prices are subject to change.
  1. Sign in to your AWS account, and choose one of the following options to launch the AWS CloudFormation template. For help with choosing an option, see deployment options earlier in this guide.

Deploy Hail 0.2 into a new VPC on AWS

Deploy Hail 0.2 into an existing VPC on AWS

If you’re deploying Hail 0.2 into an existing VPC, make sure that your VPC has two private subnets in different Availability Zones for the workload instances, and that the subnets aren’t shared. This Quick Start doesn’t support shared subnets. These subnets require NAT gateways in their route tables, to allow the instances to download packages and software without exposing them to the internet.

Also, make sure that the domain name option in the Dynamic Host Configuration Protocol (DHCP) options is configured as explained in the Amazon VPC documentation. Provide your VPC settings when you launch the Quick Start.

Each deployment takes about 10 minutes to complete.

  1. Check the AWS Region that’s displayed in the upper-right corner of the navigation bar, and change it if necessary. This Region is where the network infrastructure for Hail 0.2 is built. The template is launched in the us-east-1 Region by default.

  1. On the Create stack page, keep the default setting for the template URL, and then choose Next.

  2. On the Specify stack details page, change the stack name if needed. Review the parameters for the template. Provide values for the parameters that require input. For all other parameters, review the default settings and customize them as necessary.

In the following tables, parameters are listed by category and described separately for the deployment options. When you finish reviewing and customizing the parameters, choose Next.

Unless you are customizing the Quick Start templates for your own deployment projects, we recommend that you keep the default settings for the parameters labeled Quick Start S3 bucket name, Quick Start S3 bucket Region, and Quick Start S3 key prefix. Changing these parameter settings automatically updates code references to point to a new Quick Start location. For more information, see the AWS Quick Start Contributor’s Guide.

Launch into an existing VPC

Table 1. AWS Quick Start Configuration
Parameter label (name) Default value Description

Quick Start S3 bucket name (QSS3BucketName)

aws-quickstart

S3 bucket name for the Quick Start assets. Quick Start bucket name can include numbers, lowercase letters, uppercase letters, and hyphens (-). It cannot start or end with a hyphen (-).

Quick Start S3 key prefix (QSS3KeyPrefix)

quickstart-hail/

S3 key prefix for the Quick Start assets. Quick Start key prefix can include numbers, lowercase letters, uppercase letters, hyphens (-), and forward slash (/).

Quick Start S3 bucket region (QSS3BucketRegion)

us-east-1

The AWS Region where the Quick Start S3 bucket (QSS3BucketName) is hosted. When using your own bucket, you must specify this value.

Table 2. Network Settings
Parameter label (name) Default value Description

Existing VPC ID (pVpcId)

Requires input

Required - SageMaker security group is created in this VPC.

Existing Subnet ID (pSubnetId)

Requires input

Required for existing VPC target. Subnet for EMR Cluster and SageMaker Notebook Instances. Must reside in the existing VPC.

Existing Subnet Type (pSubnetType)

private

Required for existing VPC target. Public subnets deploy resources with public IPs. Private subnets do not. Private subnets are recommended.

Table 3. Hail Settings
Parameter label (name) Default value Description

Hail Bucket Name (pHailBucket)

Requires input

EMR logs, cluster manifests, and VEP configuration files are placed here.

Create Hail Bucket (pCreateHailBucket)

yes

Select No to use an existing bucket.

Sagemaker Home Directory Bucket Name (pSageMakerBucket)

Requires input

Bucket for common Jupyter notebooks and SageMaker home directory backups.

Create SageMaker Bucket (pCreateSageMakerBucket)

yes

Select No to use an existing bucket.

EBS KMS Key ARN (pKmsEbsArn)

Requires input

Optional - if the source AMI is encrypted specify the full key ARN. Otherwise, leave blank. This does NOT automatically enable EBS encryption.

Table 4. Tagging
Parameter label (name) Default value Description

Environment Tag (pTagEnvironment)

development

Environment type for default resource tagging.

Owner Tag (pTagOwner)

Requires input

Optional - Owner of the resources. Person/Department, etc.

Launch into a new VPC

Table 5. Network Configuration
Parameter label (name) Default value Description

VPC target (VPCTarget)

existing

Choose "new" to use the AWS VPC Quick Start to create a new VPC with three public and three private subnets. If you choose "existing", VPCId and SubnetId network parameters are required.

New VPC CIDR (VPCCIDR)

10.0.0.0/16

Required for a new VPC. A /16 address space is recommended for a new VPC.

Existing VPC ID (VPCId)

Requires input

Required for existing VPC target.

Existing subnet ID (VPCSubnetId)

Requires input

Required for an existing VPC target. Subnet ID in the existing VPC in which EMR Cluster and SageMaker Notebook Instances will be launched. A private subnet is recommended.

Existing subnet type (VPCSubnetType)

private

Required for an existing VPC target. Should match the type of subnet specified by Subnet Id. Type of subnet compute resources will be deployed in and also used to create custom AMIs. A private subnet is recommended.

Table 6. Hail Configuration
Parameter label (name) Default value Description

Hail S3 bucket name (HailS3BucketName)

Requires input

The name of the S3 bucket that stores EMR logs, cluster manifests, and VEP configuration files.

Create Hail S3 bucket (CreateHailBucket)

yes

Select No to use an existing bucket.

Sagemaker home directory S3 bucket name (SageMakerS3BucketName)

Requires input

The name of the S3 bucket for common notebooks and SageMaker home directory backups.

Create SageMaker home directory S3 bucket (CreateSageMakerBucket)

yes

Select No to use an existing bucket.

EBS KMS Key ARN (KMSEbsKeyArn)

Requires input

(Optional) The full KMS Key ARN if Region level EBS encryption is enabled. This does not automatically encrypt your AMI.

Table 7. Tagging Configuration
Parameter label (name) Default value Description

Environment type (TagEnvironmentType)

development

Environment type for default resource tagging.

Owner (TagOwner)

Requires input

(Optional) Owner for default resource tagging. Suggested values are <User Name>, <Department Name>, <Project Name>, etc.

Table 8. AWS Quick Start Configuration
Parameter label (name) Default value Description

Quick Start S3 bucket name (QSS3BucketName)

aws-quickstart

S3 bucket name for the Quick Start assets. Quick Start bucket name can include numbers, lowercase letters, uppercase letters, and hyphens (-). It cannot start or end with a hyphen (-).

Quick Start S3 key prefix (QSS3KeyPrefix)

quickstart-hail/

S3 key prefix for the Quick Start assets. Quick Start key prefix can include numbers, lowercase letters, uppercase letters, hyphens (-), and forward slash (/).

Region of Quickstart bucket (QSS3BucketRegion)

us-east-1

The AWS Region where the Quick Start S3 bucket (QSS3BucketName) is hosted.

  1. On the options page, you can specify tags (key-value pairs) for resources in your stack and set advanced options. When you’re done, choose Next.

  2. On the Review page, review and confirm the template settings. Under Capabilities, select the two check boxes to acknowledge that the template creates IAM resources and might require the ability to automatically expand macros.

  3. Choose Create stack to deploy the stack.

  4. Monitor the status of the stack. When the status is CREATE_COMPLETE, the Hail 0.2 deployment is ready.

  5. Use the values displayed in the Outputs tab for the stack, as shown in figure 2, to view the created resources.

cfn_outputs
Figure 2. Hail 0.2 outputs after successful deployment

Post deployment steps

Grant permissions to the Hail 0.2 Service Catalog Portfolio

  1. Open the AWS Service Catalog console and choose the Hail Products portfolio.

  2. Choose the Groups, roles, and users (1) tab, and then choose the Add groups, roles, users button.

  3. Choose the IAM users/roles/groups that you want to grant the ability to launch Hail 0.2 clusters or notebooks. For example, if you are giving access to an IAM user, even though you launched the catalog, you still must explicitly grant yourself permissions this way. Don’t forget to grant yourself access.

Launch a Hail cluster

  1. Return to the Service Catalog and choose the Products link at the top on the left gutter. Non-administrators who visit the Service Catalog in the console are taken directly to this location.

  2. Select the stacked dots on the Hail 0.2 EMR Cluster product and choose Launch Product.

  3. Provide a name for the Hail 0.2 cluster and choose Next to see the following Hail 0.2 cluster options.

Table 9. Cluster Primary Settings
Parameter label Default Description

Cluster Name

Requires input

Name of the EMR cluster

Hail AMI

''

(Optional) Custom AMI, specific version from the public AMI list, or empty value. If empty, the latest public Hail with VEP AMI is used.

EMR Release

emr-5.29.0

AWS EMR release version to use for cluster nodes.

Root Volume Size

100

Root volume size in GB for all cluster instances.

EBS KMS Key ARN

''

(Optional) The full KMS Key ARN if region level EBS encryption is enabled. Note, this does NOT automatically encrypt your AMI.

Instance Termination Protection

false

Choose True to enable instance termination protection on Master and Core nodes of the cluster.

Allow SSM Shell Access from SageMaker Notebook Instances

false

Choose True to allow SSM Shell access to cluster nodes from SageMaker Notebook instances. To fully enable the setting, select True when creating SageMaker notebook instances.

Table 10. Master Instance Settings
Parameter label Default Description

Master Node Size

m5.xlarge

Instance type to use for EMR master node

Table 11. Core Instance Settings
Parameter label Default Description

Number of Core Nodes

1

Number of core nodes to launch with the cluster. Must be >= 1.

Core Instance Size

r5.xlarge

Instanct type to use for EMR core nodes

Scratch Volume Size

100

Secondary GP2 data volume size in GB for CORE nodes. Available on /mnt

Table 12. Auto Scaling Task Node Settings
Parameter label Default Description

Market

ON_DEMAND

Select “SPOT” to use Spot instances for Task nodes. Spot instances are used with a max bid of the on demand price.

Minimum number of Task Nodes

1

Value of 0 disables task nodes and auto scaling.

Maximum number of Task Nodes

1

Must be equal to or greater than minimum.

Task Node Size

r5.large

Instance type to use for EMR task nodes

Table 13. Tagging
Parameter label Default Description

Environment Tag

development

Environment type for default resource tagging.

Owner Tag

''

(Optional) - Owner of the resources. Person/Department, etc.

Proceed through the Service Catalog wizard, accepting default values for the Tag Options, Notifications, and Review phases of the wizard. Then choose Launch.

The cluster is provisioned, and the status changes to suceeded when complete.

Launch a Hail Notebook

  1. Go back to the Service Catalog’s product list page, choose Hail SageMaker Notebook Instance > Launch product.

  2. Provide a name for your Service Catalog product launch in the Product version phase of the launch wizard.

  3. Supply the following parameters when you reach the Parameters phase.

Table 14. Instance Details
Parameter label Default Description

Instance Name

Requires input

Used as the name of the notebook instance and S3 backup location. User name is recommended - E.g. jsmith

Instance Type

ml.t3.medium

Instance type to use for the notebook instance

Volume Size

20

Size in GB of the EBS volume used by the notebook instance

Allow SSM Shell Access to EMR Nodes

false

Choose True to allow SSM Shell access to cluster nodes from SageMaker notebook instances. To be fully enabled, set this setting to True when creating an EMR cluster.

Table 15. Tagging
Parameter label Default Description

Environment Tag

development

Environment type for default resource tagging.

Owner Tag

''

(Optional) - Owner of the resources. Person/Department, etc.

Proceed through the Tag Options, Notificatons, and Review phases of the wizard, accept the default settings, and choose Launch to complete the Notebook.

Open the notebook, connect to the cluster, and conquer

  1. Go to SageMaker in the web console, and choose Notebook instances in the left gutter. You will see a notebook instance with the name tag you specified when you launched the notebook from the Service Catalog.

  2. Choose that notebook instance and then choose the Open JupyterLab hyperlink.

    • After the notebook launches, navigate to the common-notebooks folder to see example notebooks that show how to connect to the EMR cluster and begin your Hail 0.2 session.

Best practices for using Hail 0.2 on AWS

Although you can work with Hail on EMR through the notebook, you might simply want to access and explore the hosts. If you need to access the hosts, make sure you set Allow SSM Shell Access to EMR Nodes to true when you launch the cluster through the Service Catalog.

You can then start SSH sessions on those nodes by using the Start Session feature in AWS Systems Manager Session Manager console. You can also start a session from your local machine using the AWS Command Line Interface (AWS CLI) or a notebook’s JupyterLab console, as demonstrated below:

SSH

FAQ

Q. I encountered a CREATE_FAILED error when I launched the Quick Start.

A. If AWS CloudFormation fails to create the stack, relaunch the template with the Rollback on failure set to No. (This setting is on the Options page under Advanced in the AWS CloudFormation console.) With this setting, the stack’s state is retained and the instance is left running, so you can troubleshoot the issue. (For Windows, look at the log files in %ProgramFiles%\Amazon\EC2ConfigService and C:\cfn\log.)

When you set Rollback on failure to Disabled, you continue to incur AWS charges for this stack. Make sure to delete the stack when you finish troubleshooting.

For additional information, see Troubleshooting AWS CloudFormation on the AWS website.

Q. I encountered a size limitation error when I deployed the AWS CloudFormation templates.

A. Launch the Quick Start templates from the links in this guide or from another S3 bucket. If you deploy the templates from a local copy on your computer or from a location other than an S3 bucket, you might encounter template size limitations. For more information about AWS CloudFormation quotas, see the AWS documentation.

Send us feedback

To post feedback, submit feature ideas, or report bugs, use the Issues section of the GitHub repository for this Quick Start. If you’d like to submit code, please review the Quick Start Contributor’s Guide.

Quick Start reference deployments

GitHub repository

You can visit our GitHub repository to download the templates and scripts for this Quick Start, to post your comments, and to share your customizations with others.


© 2020, Amazon Web Services Inc., or its affiliates, and Privo. All rights reserved.

Notices

This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether expressed or implied. This document does not create any warranties, representations, contractual commitments, conditions, or assurances from AWS, its affiliates, suppliers, or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers.

The software included with this paper is licensed under the Apache License, version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the accompanying "license" file. This code is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either expressed or implied. See the License for specific language governing permissions and limitations.