Optimizing Redshift for Sigma
This article details best practices for ensuring optimal performance when using Sigma on top of Redshift. The procedures and functions described below are focused on the easiest ways to achieve the greatest lift when it comes to performance improvement/optimization. Ofcourse, a data schema/structure properly tuned for your use cases is always suggested. Since data tuning and optimization procedures such as: setting correct distribution and sorting keys, table statistics, and column compression/encoding may require in-depth work from a data or redshift specialist in your organization and the specifics of the tuning are use case dependent we will consider those as additional optimization techniques and will be covered as secondary/use case specific techniques towards the end of this article
The sections below are ordered in terms of possible greatest lift to performance and it is suggested that you follow implementing the suggestions below starting at the top and going down the list.
An Amazon Redshift instance is comprised of nodes which are its individual computing resources and are organized into clusters. When creating a new Redshift instance be sure to choose the correct node type. The three node types are DC(Dense Compute) and DS(Dense Storage), and RA3. DC2 and RA3 nodes use SSD for local storage and are optimized for compute intensive workloads. Since data is stored locally on the nodes, adding more compute nodes will not only increase storage capacity but performance as well. DC and RA3 nodes are typically better for analytic workloads so we suggest using one of those two. The decision of which of those two to use will depend largely on how much data storage you need and what might be a more cost effective solution for you.
See Redshift Cluster documentation for more information: Redshift Cluster & Node Types
Sigma Materialization & Redshift Materialized Views
Sigma Dataset Materialization
Redshift takes large performance hits when joining tables, especially when the joins involve large tables that may not have proper distribution keys set. The easiest way to alleviate this problem is to utilize Sigma materialization. Materialization can be used on any Sigma Dataset built on a connection with write access enabled. Materialization will take the query for a dataset and write it as a single new table in Redshift and update the data in that table based on a schedule you set. What this means is that Sigma is now querying a single table containing multiple tables of data already joined rather than querying the tables individually and combining them via JOIN operations on the fly.
- Join tables in the dataset so that the dataset is comprised of multiple tables
- Make sure to limit dataset to only columns needed (See Below Section — Only use columns needed)
- Make sure to filter down dataset to only rows needed (See Below Section — Decrease Row Count)
- Turn on Materialization (See docs: Sigma Materialization) and set refresh schedule
Redshift Materialized Views
Redshift Materialized Views work pretty much exactly like Sigma Materialization works. They are pre-computed views based on a SQL result that can be based off one or multiple tables.
Note: If you’ve setup Sigma Materialization there is no need to setup Redshift Materialized Views on the same queries as they essentially accomplish the same thing. The same is true vice vera, if you’ve setup Redshift Materialized Views there is no need to setup Sigma Materialization on those same Redshift Materialized Views in most situations. A situation where you might still want to use Sigma Materialization on top of a Redshift Materialized View would be if you added an additional table via a Join to the View in Sigma. Sigma Materialization gives you an easy straightforward way to create and set the refresh schedules for these views while the setup of Redshift Materialized Views may require Redshift domain expertise and specific Redshift console access to implement.
You can turn queries into Redshift Materialized Views by wrapping the query in a CREATE MATERIALIZED VIEW with AUTO REFRESH statement. This ofcourse will have to be done outside of Sigma, most likely within the Redshift console.
See Redshift Docs: Create Redshift Materialized View
Only Use Columns Needed — Limit Worksheets/Workbooks
Redshift is a columnar database, meaning data is stored as columns and not rows. As such, the less columns selected means the less data that has to be scanned. Redshift is more performant when you select only the columns you actually need to work with. Loading all the columns of a table or ones that just don’t get used in your analysis will hurt performance. Begin all your worksheets/workbooks by only loading needed columns or deleting the ones you don’t use within them.
Decrease Row Count using Filters
In Redshift, the less rows queried means less data has to move between nodes for processing. As such, less rows in your worksheet/workbook mean better performance . Minimize the number of rows being queried from a worksheet/workbook by using filters. Applying filters early on in your work will ensure that your worksheets/workbooks will be subsequently more performant.
See Sigma Docs: Creating Sigma Filters
Using Relative Date Filters
One way to ensure that your data is being filtered down to only the data you need is to use relative date filters. If your use case only requires you to use the most recent 90 days of data, a relative date filter will ensure that Sigma is only loading the most recent 90 days of data on load. Using relative date filters save you from querying more data than is necessary while also saving you the hassle of constantly updating your datetime filter.
Relative Date Filters can be applied by adding a filter to any date type field, then selecting the “relative” tab in the date picker, selecting the correct time scale (year, month, day) and number of days for example:
Redshift Concurrency Scaling
As more and more users are added to Sigma the accompanying load on Redshift will increase and that can mean slower performance. Luckily, Concurrency Scaling allows Redshift to add capacity dynamically in response to workload. Redshift does this by spinning up additional server clusters and distributing the workload across the newly added clusters. This functionality allows Redshift to improve performance dynamically when the number of users hitting Sigma at a given time spikes or if there is a spike in expensive worksheets/queries being used.
- Allows you to scale from 1 to 10 clusters
- Free billing credits provided for concurrency scaling can often cover the extra cost
- Redshift Concurrency Scaling Explained
Redshift workload management allows you to setup queues for different types of workloads, then assign/allocate specific amounts of memory to each queue, ensuring more important workloads receive more compute power and are not bogged down by less important queries. If you have different types of workloads hitting the Redshift instance that Sigma is using you can take advantage of WLM to assign greater compute resources to Sigma workloads.
- What you can do here is create a separate queue for Sigma workloads and assign the queue more memory and slots than other queues if you want to give Sigma queries priority over other queries
- Redshift WLM
If queries are consistently taking too long to complete, you may want to consider manually resizing your Redshift Cluster. Elastic/Classic resizing is different than concurrency scaling as it is a manual exercise and not something automatically kicked off based on load like concurrency scaling. Classic resizing will first put the original cluster in read-only mode, create the new cluster, then transfer the data from the old cluster to the new one. A big drawback for classic resizing is the amount of time it takes for the whole operation to complete.
Elastic resizing is different in that it allows for the addition of nodes or modification of node type of an existing cluster. In an elastic resize operation existing workloads on the cluster are paused and then data is redistributed to the new nodes. The benefit of using elastic resize over classic resize is that the operation will occur much faster, often in a matter of minutes. The negative aspect of using elastic resize is that you will lose access to the original clusters monitoring metrics and database statistics, you will also lose access to many of the system views and tables.
A good metric to use to determine if your cluster needs to be resized is CPU utilization. If CPU utilization is consistently above a specific threshold that you consider high, it might be time to resize your cluster and add additional nodes.
Additional/ Use Case Specific Optimization Options:
Redshift advisor analyzes your cluster and develops recommendations to help improve performance on the cluster. Advisor is only supposed to display recommendations that may give a significant improvement to performance. Some of the types of recommendations that advisor can give revolve around distribution keys, sort keys, table compression, and table statistics.
Distribution keys determine how rows for a given table are distributed across nodes in your cluster. Distribution keys can become very important with regards to performance especially when joining large tables. The key here, no pun intended…HA!, is to set keys on both tables so that rows that will be joined together are co-located on the same node. If a commonly used dataset or worksheet in sigma is joining large tables, these assets might achieve performance gains by setting the proper distribution keys on both tables. Please note that there are tradeoffs to optimizing dist keys to improve join performance on two specific tables, such as if a third table is joined to one of the other two tables on a different column than specified in the distribution key. That join could be less performant than if no distribution keys were set. Another important note here is that using Sigma Materialization or Redshift Materialized Views to complete a join will do away with the need to set distribution keys on those tables.
Redshift sort keys determine the order in which rows for a given table are stored. Applying the appropriate sort key for a table can improve performance, especially for Sigma datasets and worksheets using range specific filters (think date range filters). Properly established sort keys allow Redshift to skip over large numbers of blocks/data during table scans and hence improve performance. The two types of sort keys are compound keys and interleaved keys. Compound keys will deliver better performance than interleaved keys in most situations. Compound keys will sort the data by multiple columns based in the order they are specified upon table creation. If your use case requires you to often filter by values within a given region with a certain date range your compound key creation might look something like this COMPOUND SORTKEY(date, region). One thing to note is that as new rows are added to the table, you will eventually need to run VACUUM operations to resort the newly added data as they will be unsorted until that is explicitly done. Again, just as with distribution keys, you can forego doing this database operation if for a particular use case Sigma Materialization or Redshift materialized views are used, and with with appropriate sorting being applied on the materialized dataset.
Tables Statistics Recommendations
Redshift uses a cost based optimizer to create custom query plans for all queries. Query plans help determine how much memory should be allocated for a given query. If a query plan does not allocate sufficient memory for a given query, that query will spill to disk and take a significant hit in its performance. As tables grow, queries may need to be allocated more memory than they had been before, that’s why it is important for table statistics to be re-analyzed periodically. Using Redshift advisor and running the ANALYZE command will help you do this.
Column compression and encoding allows you to compress column data so that it is consuming less cluster storage. It not only reduces the amount of storage you use though, it also helps query performance as there is less data to scan. Redshift advisor will provide the query to use to determine which tables have significant uncompressed data. You can then use the ANALYZE COMPRESSION command to get suggested compression encodings to recreate the table with.