Data Lake Foundation on the AWS Cloud

Quick Start Reference Deployment

QS

August, 2020
Dave May, AWS Quick Start team

Visit our GitHub repository for source files and to post feedback, report bugs, or submit feature ideas for this Quick Start.

This Quick Start was created by Amazon Web Services (AWS). Quick Starts are automated reference deployments that use AWS CloudFormation templates to deploy key technologies on AWS, following AWS best practices.

Overview

This Quick Start reference deployment guide provides step-by-step instructions for deploying Data Lake Foundation on the AWS Cloud.

Amazon may share who uses AWS Quick Starts with the AWS Partner Network (APN) Partner that collaborated with AWS on the content of the Quick Start.

Data Lake Foundation on AWS

A data lake is a repository that holds a large amount of raw data in its native (structured or unstructured) format until the data is needed. Storing data in its native format lets you accommodate any future schema requirements or design changes.

Increasingly, customer data sources are dispersed among on-premises data centers, software-as-a-service (SaaS) providers, APN Partners, third-party data providers, and public datasets. Building a data lake on AWS offers a foundation for storing on-premises, third-party, public datasets at low prices and high performance. A portfolio of descriptive, predictive, and real-time agile analytics built on this foundation can help answer important business aspects, such as predicting customer churn and propensity to buy, detecting fraud, optimizing industrial processes, and content recommendations.

This Quick Start is for developers who want to get started with AWS-native components for a data lake in the AWS Cloud. When this foundational layer is in place, you may choose to augment the data lake with independent software vendors and SaaS tools.

The Quick Start builds a data lake foundation that integrates AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Kinesis, Amazon Athena, AWS Glue, Amazon Elasticsearch Service (Amazon ES), Amazon SageMaker, and Amazon QuickSight. The data lake foundation provides these features:

  • Data submission, including batch submissions to Amazon S3 and streaming submissions via Amazon Kinesis Data Firehose.

  • Ingest processing, including data validation, metadata extraction, and indexing via Amazon S3 events, Amazon Simple Notification Service (Amazon SNS), AWS Lambda, Amazon Kinesis Data Analytics, and Amazon ES.

  • Dataset management through Amazon Redshift transformations and Kinesis Data Analytics.

  • Data transformation, aggregation, and analysis through Amazon Athena, Amazon Redshift Spectrum, and AWS Glue.

  • Building and deploying machine learning models using Amazon SageMaker.

  • Search by indexing metadata in Amazon ES and displaying it on Kibana dashboards.

  • Publishing into an S3 bucket for use by visualization tools.

  • Visualization with Amazon QuickSight.

The usage model diagram in the following figure illustrates key actors and use cases that data lake enables, in the context of key component areas that comprise the data lake. This Quick Start provisions foundational data lake capabilities and optionally demonstrates key use cases for each type of actor in the usage model.

Architecture3
Figure 1. Usage model for Data Lake Foundation Quick Start

The following figure illustrates the foundational components of the data lake and how they relate to the usage model. Using your data and business flow, the components interact through recurring and repeatable data lake patterns.

Architecture4
Figure 2. Capabilities and components in the Data Lake foundation Quick Start

Cost

You are responsible for the cost of the AWS services used while running this Quick Start. There is no additional cost for using the Quick Start.

The AWS CloudFormation template for this Quick Start includes configuration parameters that you can customize. Some of the settings, such as the instance type, affect the cost of deployment. For cost estimates, see the pricing pages for each AWS service you use. Prices are subject to change.

After you deploy the Quick Start, enable the AWS Cost and Usage Report to deliver billing metrics to an Amazon Simple Storage Service (Amazon S3) bucket in your account. It provides cost estimates based on usage throughout each month and aggregates the data at the end of the month. For more information about the report, see the AWS documentation.

Software licenses

You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start.

The AWS CloudFormation template for this Quick Start includes configuration parameters that you can customize. Some of these settings, such as instance type, affect the cost of deployment. For cost estimates, see the pricing pages for each AWS service you use. Prices are subject to change.

Because this Quick Start uses AWS-native solution components, there are no costs or license requirements beyond AWS infrastructure costs. This Quick Start also deploys Kibana, which is an open-source tool that’s included with Amazon ES.

Architecture

Deploying this Quick Start for a new virtual private cloud (VPC) with default parameters builds the following Data Lake Foundation environment in the AWS Cloud.

Architecture
Figure 3. Quick Start architecture for Data Lake Foundation on AWS

As shown in figure 1, the Quick Start sets up the following:

  • A virtual private cloud (VPC) that spans two Availability Zones and includes two public and two private subnets.*

  • An internet gateway to allow access to the internet.*

  • In the public subnets:

    • Managed network address translation (NAT) gateways to allow outbound internet access for resources in the private subnets.*

    • Linux bastion hosts in an Auto Scaling group to allow inbound Secure Shell (SSH) access to Amazon Elastic Compute Cloud (Amazon EC2) instances in public and private subnets.*

    • Amazon Redshift and Redshift Spectrum for data aggregation, analysis, transformation, and creation of curated and published datasets.

  • In the private subnets:

    • An Amazon S3 endpoint.

    • The Data Lake wizard.

  • AWS Identity and Access Management (IAM) roles to provide permissions to access AWS resources (for example, to permit Amazon Redshift and Amazon Athena to read and write curated datasets).

  • An Amazon SageMaker instance, which you can access by using AWS authentication.

  • Integration with other Amazon services, such as Amazon S3, Amazon Athena, AWS Glue, AWS Lambda, Amazon ES with Kibana, Amazon Kinesis, and Amazon QuickSight.

* The template that deploys the Quick Start into an existing VPC skips the components marked by asterisks and prompts you for your existing VPC configuration.

The following figure shows how these components work together in a typical end-to-end process flow.

Architecture
Figure 4. Data lake foundation process flow

Planning the deployment

Specialized knowledge

This deployment guide requires a moderate level of familiarity with AWS services. If you’re new to AWS, visit the Getting Started Resource Center and the AWS Training and Certification website. These sites provide materials for learning how to design, deploy, and operate your infrastructure and applications on the AWS Cloud.

Before you deploy this Quick Start, we recommend that you become familiar with the following AWS services. (If you are new to AWS, see the Getting Started Resource Center.)

AWS account

If you don’t already have an AWS account, create one at https://aws.amazon.com by following the on-screen instructions. Part of the sign-up process involves receiving a phone call and entering a PIN using the phone keypad.

Your AWS account is automatically signed up for all AWS services. You are charged only for the services you use.

Technical requirements

Before you launch the Quick Start, your account must be configured as specified in the following table. Otherwise, deployment might fail.

Resource limits

If necessary, request service quota increases for the following resources. You might need to request increases if your existing deployment currently uses these resources, and this Quick Start deployment could result in exceeding the default quotas. The Service Quotas console displays your usage and quotas for some aspects of some services. For more information, see the AWS documentation.

Resource This deployment uses

VPCs

1

Elastic IP addresses

1

IAM roles

9

Auto Scaling groups

1

EC2 instances

1

Amazon Redshift clusters

1

Amazon SageMaker instances

1

Supported Regions

  • us-east-1 (N. Virginia)

  • us-east-2 (Ohio)

  • us-west-1 (N. California)

  • us-west-2 (Oregon)

  • ca-central-1 (Canada Central)

  • eu-central-1 (Frankfurt)

  • eu-west-1 (Ireland)

  • eu-west-2 (London)

  • eu-west-3 (Paris)

  • ap-southeast-1 (Singapore)

  • ap-southeast-2 (Sydney)

  • ap-south-1 (Mumbai)

  • ap-northeast-1 (Tokyo)

  • ap-northeast-2 (Seoul)

  • sa-east-1 (South America)

  • eu-north-1 (Stockholm)

  • ap-east-1 (Hong Kong)

  • me-south-1 (Bahrain)

Certain Regions are available on an opt-in basis. Refer to the AWS Documentation on Managing Regions for more information.

IAM permissions

Before launching the Quick Start, you must log in to the AWS Management Console with IAM permissions for the resources and actions the templates deploy.

The AdministratorAccess managed policy within IAM provides sufficient permissions, although your organization may choose to use a custom policy with more restrictions.

Deployment options

This Quick Start provides two deployment options:

  • Deploy Data Lake Foundation into a new VPC (end-to-end deployment). This option builds a new AWS environment consisting of the VPC, subnets, NAT gateways, security groups, bastion hosts, and other infrastructure components. It then deploys Data Lake Foundation into this new VPC.

  • Deploy Data Lake Foundation into an existing VPC. This option provisions Data Lake Foundation in your existing AWS infrastructure.

The Quick Start provides separate templates for these options. It also lets you configure Classless Inter-Domain Routing (CIDR) blocks, instance types, and Data Lake Foundation settings, as discussed later in this guide.

Deployment steps

Sign in to your AWS account

  1. Sign in to your AWS account at https://aws.amazon.com with an IAM user role that has the necessary permissions. For details, see Planning the deployment, earlier in this guide.

  2. Ensure that your AWS account is configured correctly, as discussed in the Technical requirements section.

Launch the Quick Start

You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using this Quick Start. For full details, see the pricing pages for each AWS service you use. Prices are subject to change.
  1. Sign in to your AWS account, and choose one of the following options to launch the AWS CloudFormation template. For help with choosing an option, see deployment options, earlier in this guide.

Deploy Data Lake Foundation into a new VPC on AWS

Deploy Data Lake Foundation into an existing VPC on AWS

If you’re deploying Data Lake Foundation into an existing VPC, ensure that your VPC has two private subnets in different Availability Zones for the workload instances and that the subnets aren’t shared. This Quick Start doesn’t support shared subnets. To allow instances to download packages and software without exposing them to the internet, the subnets require NAT gateways in their route tables.

Also, ensure that the domain name in the DHCP options is configured, as explained in the Amazon VPC documentation. Provide your VPC settings when you launch the Quick Start.

Each deployment takes about 30 minutes to complete.

  1. Check the AWS Region that’s displayed in the upper-right corner of the navigation bar, and change it if necessary. This is where the network infrastructure for Data Lake Foundation is built. The template is launched in the us-east-1 Region by default.

  1. On the Create stack page, keep the default setting for the template URL, and then choose Next.

  2. On the Specify stack details page, change the stack name, if needed, and review the parameters for the template. Provide values for the parameters that require input. For all other parameters, review the default settings, and customize them as necessary.

In the following tables, parameters are listed by category and described separately for the deployment options. When you finish reviewing and customizing the parameters, choose Next.

Unless you are customizing the Quick Start templates for your own deployment projects, we recommend that you keep the default settings for the parameters labeled Quick Start S3 bucket name, Quick Start S3 bucket Region, and Quick Start S3 key prefix. Changing these parameter settings automatically updates code references to point to a new Quick Start location. For more information, see the AWS Quick Start Contributor’s Guide.
  1. On the options page, you can specify tags (key-value pairs) for resources in your stack and set advanced options. When you’re done, choose Next.

  2. On the Review page, review and confirm the template settings. Under Capabilities, select the two check boxes to acknowledge that the template creates IAM resources and might require the ability to automatically expand macros.

  3. Choose Create stack to deploy the stack.

  4. Monitor the status of the stack. When the status is CREATE_COMPLETE, the Data Lake Foundation deployment is ready.

  5. Use the values displayed in the Outputs tab for the stack, as shown in Data Lake Foundation outputs after successful deployment, to view the created resources.

cfn_outputs
Figure 5. Data Lake Foundation outputs after successful deployment

Test the deployment

Confirm the following:

  • The S3 buckets listed on the Outputs tab for the stack are available in the Amazon S3 console. The Quick Start provisions distinct S3 buckets for submissions, curated datasets, and published results.

  • If you launched the Quick Start with Enable Redshift set to yes, Amazon Redshift is accessible from the Java Database Connectivity (JDBC) endpoint specified on the Outputs tab for the stack. For the user name and password, use the values that you specified when you launched the Quick Start.

  • Kinesis streaming submissions on the stack’s Outputs tab is available in the Kinesis console.

  • The Amazon Elasticsearch Service (Amazon ES) cluster listed on the Outputs tab for the stack is available in the Amazon ES console.

  • The Kibana endpoint listed on the Outputs tab is accessible from a web-browser client within the remote-access CIDR that you specified when you launched the Quick Start.

Optional: Using your own dataset

The Data Lake Foundation provides a base for your processes. Use this infrastructure for the following:

  • Ingest batch submissions, which results in curated Amazon S3 datasets. You can then use your own SQL scripts to load curated datasets into Amazon Redshift.

  • Ingest streaming submissions from Amazon Kinesis Data Firehose.

  • Auto-discover curated datasets using AWS Glue crawlers, and transform curated datasets using AWS Glue jobs.

  • Use your own SQL queries to analyze Amazon Redshift data.

  • Analyze your data with Amazon Kinesis Data Analytics by creating your own applications that read streaming data from Kinesis Data Firehose.

  • Publish the results of analytics to the published datasets bucket.

  • Get a high-level picture of your data lake by using Amazon ES, which indexes the metadata of S3 objects.

  • Use Amazon Athena to run ad hoc analytics on your curated datasets and Amazon QuickSight to visualize the datasets in the published datasets bucket. You can also use Amazon Athena or Amazon Redshift as data sources for Amazon QuickSight.

Architecture5
Figure 6. Infrastructure deployed when launching Quick Start

Optional: Adding VPC definitions

When you launch the deployment option that creates a new VPC, the Quick Start uses VPC parameters that are mapped within the AWS CloudFormation templates. If download the templates from the GitHub repository, you can add new VPC definitions to the mapping and choose one of the named VPC definitions when you launch the Quick Start.

The following table shows the parameters within each VPC definition. You can create as many VPC definitions as you need within your environments. When you deploy the Quick Start, use the VPC Definition parameter to specify the configuration you want to use.

Parameter Default Description

NumberOfAZs

2

Number of Availability Zones to use in the VPC.

PrivateSubnet1CIDR

10.0.2.0/24

CIDR block for private subnet 1, located in Availability Zone 1.

PrivateSubnet2CIDR

10.0.4.0/24

CIDR block for private subnet 2, located in Availability Zone 2.

PublicSubnet2CIDR

10.0.3.0/24

CIDR block for the public (DMZ) subnet 2, located in Availability Zone 2.

PublicSubnet1CIDR

10.0.1.0/24

CIDR block for the public (DMZ) subnet 1, located in Availability Zone 1.

VPCCIDR

10.0.0.0/16

CIDR block for the VPC.

FAQ

D Q. I encountered a CREATE_FAILED error when I launched the Quick Start.

A. If AWS CloudFormation fails to create the stack, we recommend that you relaunch the template with Rollback on failure set to Disabled. (This setting is under Advanced in the AWS CloudFormation console, Options page.) With this setting, the stack’s state is retained, and the instance remains running so you can troubleshoot the issue. (For Windows, look at the log files in %ProgramFiles%\Amazon\EC2ConfigService and C:\cfn\log.)

When you set Rollback on failure to Disabled, you continue to incur AWS charges for the stack. Ensure to delete the stack when you finish troubleshooting.

For additional information, see Troubleshooting AWS CloudFormation on the AWS website.

Q. I encountered a size limitation error when I deployed the AWS CloudFormation templates.

A. We recommend that you launch the Quick Start templates from the links in this guide or from another S3 bucket. If you deploy the templates from a local copy on your computer or from a location other than an S3 bucket, you might encounter template size limitations. For more information about AWS CloudFormation quotas, see the AWS documentation.

Q. I deployed the Quick Start in the EU (London) Region, but it didn’t work.

A. This Quick Start includes services that aren’t supported in all Regions. See the pages for Amazon Kinesis Data Firehose, AWS Glue, Amazon SageMaker, and Amazon Redshift Spectrum on the AWS website for a list of supported Regions.

Q. Can I use the Quick Start with my own data?

A. Yes, you can.

Send us feedback

To post feedback, submit feature ideas, or report bugs, use the Issues section of the GitHub repository for this Quick Start. If you’d like to submit code, please review the Quick Start Contributor’s Guide.

Quick Start reference deployments

GitHub repository

You can visit our GitHub repository to download the templates and scripts for this Quick Start, to post your comments, and to share your customizations with others.


© 2020, Amazon Web Services Inc., or its affiliates, and {partner-company-name}. All rights reserved.

Notices

This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether expressed or implied. This document does not create any warranties, representations, contractual commitments, conditions, or assurances from AWS, its affiliates, suppliers, or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers.

The software included with this paper is licensed under the Apache License, version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the accompanying "license" file. This code is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either expressed or implied. See the License for specific language governing permissions and limitations.