Amazon Redshift Consultancy

 

Amazon says: “Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more”

Although Amazon Redshift is based on a very old version of PostgreSQL (8.0.2.  from 2005), Amazon Redshift and PostgreSQL have a number of sigificant differences

Storage: Row vs Column

Traditional databases like Postgres store data in rows.  Redshift stores data in columns. The data effectively becomes the key and a whole colum can be loaded into memory relatively quickly

 

OLTP vs. OLAP

Traditional databases are good at OLTP (Online Transaction Processing);  Redshift is optimised for OLAP (analytical/reporting/data warehouse load),  and makes a lot of tradeoffs along the way

Speed

This is a key attribute of Redshift – it has a Massively Parallel Processing (MPP) Architecture leveraged over a number of nodes.  Having data stored as columns of a single datatype optimises the compression strategy and dramatically reduces I/O overhead.

Database size

This is a difficult topic because use-cases can differ enormously,  so these figures are only illustrative.

A traditional Postgres database can reasonably handle up to 5 TB;  bespoke approaches using sharding and other technologies can reach Petabytes.

Redshift is Petabyte scale

Scaling

Postgres can scale vertically or horizontally via read only replicas,  each of which is a complete database copy;  although it is also possible to devise manual solutions using partitioning and sharding this adds to development and support complexity;  performance also needs to be considered..

Redshift is designed out of the box to deliver horizontal scaling by adding extra compute nodes

Redshift oddities

Unique, primary key, and foreign key constraints are informational only. These constraints may be defined (to help the query optimiser) but are not enforced – so the developer must take responsibility for enforcing constraints – and if the constraint is corrupted queries may return wrong results!

Redshift does not support indexes. However queries can be optimised using sort and distribution keys.

Fewer datatypes than Postgres (eg JSON, XML, Interval are missing)

 When to use Redshift?

 

This is very much use-case dependant,  but we use the following pointers to help make a decision.  In many cases the real test is to set up a proof of concept and run performance tests.

 

Postgres pointers

  • OLTP load – or light/medium reporting/analytical
  • Forecast size less than 10TB (ballpark depending on many things – could be lower or much higher)
  • You want control to host on-premise or in the cloud or change as you wish
  • Cost is a significant constraint
  • Tables have a modest number/size of columns (eg dozens not hundreds/thousands especially very wide columns)
  • You need Postgres features not present in Redshift – eg data types, technologies

Redshift Pointers

  • OLAP load – especially heavy duty/high performance
  • Forecast size Petabyte+
  • You are comfortable with vendor lock-in
  • Cost is not a problem
  • You can ensure unique key constraints and referential integrity
  • You are comfortable maintaining skills in-house for (another) database technology
  • Tables are very wide – hundreds/thousands of rows,  a lot of (wide) text columns