Amazon AWS Certified Database Specialty – Amazon Redshift Part 1

June 17, 2023

1. Redshift overview

All right, in this section we are going to talk about Redshift. Restshift is a data warehousing solution or a solution for analytical workloads. All right, so let’s begin. Redshift is an Olab database or online analytical processing database, or a data warehousing solution provided by AWS. It’s based on postgres SQL in redshift allows you to store and query petabytes of data, both structured and semi structured across your data warehouses and your data lakes, and you can query using standard SQL. Redshift claims to provide ten x performance compared to other competing data warehouses. And Redshift is a columnar data storage solution, so it stores data in columns instead of rows.

And this makes Redshift efficient in computing aggregation queries. It uses massively parallel query execution or MPP, and it’s a highly available solution as well. And as I mentioned earlier, we use SQL interface for performing queries on the Redshift database. And Redshift integrates well with the analytical tools like AWS, Quicksize or Tableau. And to load data into redshift, you can use S three kinesis firehose, DynamoDB, DMs and so on. And a Rest Shift cluster can contain anywhere from one to about 128 nodes, and each node is about 160 gigs in size. You can provision multiple nodes, but within a single AZ. Remember, Rest Shift is not a multiaz solution.

The Redshift architecture has one liter node, which is used for query planning and for result aggregation, which means you communicate with Redshift through your leader node. And then there are up to 128 compute nodes. So the compute nodes are the ones that gather data from the database and send them back to the leader node, and then leader node aggregates the result and sends it back to the calling application. Then, just like any other database service, we have backup and restore features, security with VPC, Im and Kms, and monitoring with Cloud Watch.

Then we also have something called as redshift enhanced VPC routing. And you can use this to route the copy and unload traffic through VPC. And we are going to talk about this later in this section. Let’s park it for now. And remember that Redshift is a provisioned solution. It’s worth it when you have a sustained usage. Otherwise, if you have Spore Addict queries or if you’re just going to need analytics once in a blue moon, then you can definitely consider using Athena, which is a low cost solution as compared to Redshift. All right, so now let’s go ahead and create a Redshift cluster.

2. Creating a Redshift cluster – Hands on

All right. In this lecture, let’s create our first redshift cluster. Before we can create our redshift cluster, we have to create an appropriate Im role. So here I am in the Im dashboard, click on Roles from the left side menu and then choose to create a new role. Then make sure AWS service is selected. And you have to look for the redshift service here. Here it is. And then we choose the redshift customizable use case. And continue. Now here we can attach appropriate policies. I’m going to attach redshift full access, as this is just a demo. And we also need to provide access to S Three, as we’ll be loading data from S Three. So I’m going to attach S Three read only access policy here.

Later in the section, we are also going to query S Three data using Spectrum and the reserve spectrum integrates with Amazon, Athena and Glues. We are also going to add those permissions now itself, so we don’t have to come back to Im again. So I’m going to add Athena full access and Glue full access permissions here. And continue. So we have chosen these four policies. Give this role a name, let’s say my redshift role, okay? And create role. So here we have our role with these four policies. All right, now we can go to redshift. So here I am in the Oregon region, US West Two, and I’m going to create a new cluster. So here we have the cluster Identifier. I’m going to stick to the default name.

Then under node type, you can choose either the RA Three node or DC Two nodes. Since this is just a demo, I’m going to go with the smallest node size available here. So that’s DC two large. We’re going to use two nodes. We have a multi node cluster. And do pay attention to the estimated pricing here. If you’re using a new AWS account or if you have never used redshift before, then you might be eligible for free tier. Otherwise, you should pay attention to the pricing mentioned here. All right, then here we have the database name and the database port. I’m going to stick to the default, then provide a user password. Further, under cluster permissions, we change the Im role, select the role that we just created and add it.

Then under Additional Configuration, I’m not going to stick to defaults. Flip this over. And then under Network Security, you can choose the VPC you want to place the cluster in. I’m going to stick to the default VPC as well as the default security group. Just ensure that the security group that you use here allows inbound connections on the redshift port. That is, the port 5439. All right. And then we want to make this instance or this cluster publicly accessible. So I’m going to choose yes here. And we do not need any elastic IP addresses here. Ideally, you may not want to make your cluster publicly accessible. But since we want to connect to this cluster from our computer, I’m going to set this to yes.

All right, then database configurations. You can leave it at the default value, maintenance windows, monitoring options, and the backup options. Okay, so we can leave them at the default value and create our cluster. And now we can see that the cluster is being created. And while the cluster is creating, I’m going to show you the SQL client that I’m going to use for this demo. So what I’m going to use is a genetic pro. Open the website of Aginity Pro, and you can download the Aginity Pro client from here and install it on your computer or on your Mac. I already have this installed on my computer, so I’m not going to do this. So I’m going to close this.

And when you open Agenity Pro, this is how it looks. We can add connections from here, or you can also add connections from the file menu using the Edit connection option. Okay, so once our cluster is ready, we’ll come back to this place and add our server here. All right, the Redshift cluster is still being created, so I’m going to pause this video here and come back once the cluster is available. All right, now we can see that our redshift cluster is available. So let’s click through it and go to the Properties tab. And here we can see the end point that we can use to connect to our cluster. So let’s copy that and let’s go to Agility Pro. Here we can add a new connection using Edit connections.

Click this plus icon and choose Redshift. Let’s name the connection, say Redshift. And now under host, we simply paste in the endpoint that we copied and remove the port and the database name. This is the host. Then we have the port. We’re not using SSL standard authentication. The user will be AWS user. Right? So if you go here, you can see that the master user name that we created with is AWS user. All right? And then also provide the password that you used and the database name. You can leave this empty since we are using the default database name. Otherwise, you can provide the database name like this. All right? And before we can save this, we also have to provide the JDBC driver for redshift.

And there is nothing installed here, so you can download it from here. Okay, just click this link to download the driver files. And now that the driver is showing up here, we can test this connection and the connection is successful, we save this and we can close. Then from the database explorer, we can access our redshift. All right, you can see that we have our dev database here. Of course it’s going to be empty. So we are able to connect to our cluster. And in the next lecture, we are going to load data into this database. All right, so let’s continue to the next lecture.

3. Redshift architecture

Now that we created a redshift cluster, let’s look at the redshift architecture. And as I mentioned earlier, redshift is a massively parallel column or database that runs within a VPC. So it stores data and columns as against rows in other relational databases. So we have a single leader node and multiple compute nodes up to 128 compute nodes. And you can connect to redshift using JDBC or ODBC drivers for postgres SQL. And you talk to the leader node using the SQL endpoint. When the leader node receives a query, it distributes that query job across different compute nodes, and compute nodes will then partition that job into different slices.

And once the computation is done, then the aggregated result is sent to the leader node, which further aggregates these results from all the compute nodes and returns it to the client. And this kind of architecture definitely helps redshift to perform complex query operations in the shortest possible amount of time. All right, so let’s continue. So, redshift node types. There are three types of nodes you can use with redshift. First one is dense compute nodes or DC two. These can be used for compute intensive workloads, and these come with local SSD storage. Then you have DS two or dense storage nodes. And as the name suggests, these are for storage intensive operations, for example, for large data warehouses.

And these use STDs or hard disk drives. And the third new type of storage that redshift provides is RA three nodes with managed storage. And this again is for large data warehouses. It uses large local SSDs and is definitely recommended over DS two nodes. And the way RA three nodes with managed storage work is they automatically offload your data to S three if your node grows beyond its size. So it’s definitely a good idea to use RA three nodes for your redshift cluster. And when you use RA three nodes, your compute and managed storage is built separately. All right, so that’s about it. Let’s continue.

4. Loading data into Redshift

Now let’s look at how to load data into redshift. Now, typically, data from OLTP systems is loaded into redshift for analytical purposes, or the Bi purposes or business intelligence purposes. So you first load your data from your OLTP databases into S three and then use the Copy command in redshift to load data from S Three to your redshift cluster. And you can also load data from Kinesis Firehose. So Firehose data can be stored in S three, and then you use the Copy command to load data from S three into redshift. So what exactly is this Copy command? So, Copy command is simply a SQL statement that you use to load data stored in S Three into redshift. You can also use it to load data from other sources like DynamoDB or hive.

And when you copy data from S three or from DynamoDB or Hive using the Copy command, then that data is persistently stored in your redshift cluster. So remember that this is built against your redshift storage costs. So let’s look at this Copy command in a little more detail. So this is how the Copy command looks. So, in this particular example, we are copying data from the S Three bucket into the redshift table named users, and we are providing the Im credentials that allow redshift to read from S three. Then we are specifying the delimiter, which is a pipe. In this case, this file is a pipe delimited or pipe separated file, and the region in which your S Three bucket is and your cluster is.

And in the credentials here, we are using the AWS Im role. You can also use your user credentials like the Access key ID and the secret Access key. And if you like, you can also use temporary Security credentials. So if you’re using the Temporary Security credentials, then you can provide the session token as well. So along with the Access key ID and secret Access key, you will provide the temporary session token. So these are three different ways you can use the credentials clause here. And another way to provide your Im role is by using the IAM underscore role parameter that you can use instead of the credentials clause.

So there are different ways you can use this command. This is one of the ways that I have shown here. All right, to use the Copy command, we first create an Im role. And if you don’t have the redshift cluster, you create that as well. You attach the Im role to the cluster, and then this cluster can exist, zoom the Im role on your behalf, and then you use the Copy command to load the data from S three. So why don’t we go ahead into a hands on session and try this out ourselves? So let’s do a quick hands on to load data from S three into redshift.

5. Loading data from S3 into Redshift – Hands on

In this lecture, let’s look at how to load data into your redshift cluster from an S Three bucket. So for learning purposes, AWS provides us with some sample redshift data set. And you can download it from the AWS documentation website. Right, so let’s look for redshift getting started, right, and click on the very first link. And here, if you look at the step six load sample data, this is going to provide us with some sample data set. All right? So if you scroll down, you will find that you can download a ticket DB file here, and you can load this file into your S Three bucket.

And alternatively, AWS also hosts this data on its own S Three bucket, and that’s called as AWS sample DB US West Two. So you can use either of these approaches. I’m going to simply use the data hosted by AWS. All right, so let’s go to a genetic pro. And here we can create a new tab to make requests. All right? So in here we had to first create tables so we can layer data into it. So let’s go back to the Ada documentation, and it provides us with the schema of the tables to load this sample data. So what we are going to do is we are going to copy this over, copy all the create table statements, and paste them into a genetic pro. And then we can execute these statements one by one.

So make sure you choose your redshift cluster and the database and then select the statement you want to execute and click on this button to run it. And we can see that the query is successful. Similarly, I’m going to run all other queries. All right, so all our tables have been created. We can refresh the Dev database. And now if you expand the public, you can see that we have all these tables created for us. All right, so now we have these schema ready so we can load data from S Three using the Copy command. So let’s go back to AWS documentation and copy these copy commands. So I’ll open a new tab here and paste in the copy commands.

And you have to change the bucket name here. So the bucket name that we’re going to use is AWS Sample dbus West II. So this is the bucket that hosts all the sample data, and I’m going to replace it everywhere, right? And the AWS region that we are in is Oregon, which is US west II. So I’m going to change the AWS region here to US West Two and similarly replicated in all the statements. And then here we have to provide the Im role arm of the Im role that we want to use for this purpose. So I’m going to go back to Im and copy the Im roll, the My redshift roll. So copy the role ARN. And now what I’m going to do is I’m going to paste in the Im roll arm here.

It looks like some autocorrect has happened here, so it should actually be equal to followed by the Im role arm. So I’m going to copy this and replicate it in all the commands. And now we can run all the copy commands one by one so we can see that 49,990 records have been loaded successfully into the users table. All right, let’s run the remaining commands as well. All right, so now that we have loaded all the records, let’s go ahead and try to run some queries on the data. So let’s go back to the AWS documentation. And here we have some sample queries. I’m going to copy them over and into a new tab. And then let’s try to run these queries.

This query is going to give us the definition of the sales table. And here we go. All the schema of the table, along with the disk key, sort key, and all the details are displayed here. Similarly, if we want to find total sales on a given date, it’s going to run a sum aggregation operation here. So let’s try that out and we can see that it returns us the sum or the aggregated value. And you can experiment the remaining queries if you like, in the similar fashion. I’m not going to do that. The only purpose of this lecture was to demonstrate how to load data from S three into your redshift cluster. So I hope this lecture was useful and let’s continue to the next one.

6. More ways to load data into Redshift

Now, there are a couple of more ways you can load data into redshift apart from loading it from S three. So you can use the AWS glue. So glue is an ETL service that you can use to load data into redshift. So ETL stands for extract, transform and load. So this typically pulls data from different source systems and then transform that data to suit the redshift structures or redshift schemas, and then load that data into redshift. You can also use other ETL tools from APN partners, so these are external tools for the same purpose. Or yet another option is to use data pipeline. So data pipeline can pull data from different sources, like, for example, DynamoDB, andyou can load it into redshift.

And if you’re migrating from on premise, for example, you could use the AWS import export services like AWS Snowball. Snowball is a physical device that AWS ships to your premises. And then you can load data into snowball and ship it back to AWS. Then AWS will load that data into s three. Apart from using snowball, you can also is AWS Direct Connect, which is a private connection between your data center or your on premise data center and AWS. So these are different ways you can load your data into redshift. All right, let’s continue.

Uncategorized

Related posts:

Leave a Reply Cancel reply