crosbusiness.blogg.se - Aws data lakehouse

#Aws data lakehouse how to#
#Aws data lakehouse software#

#Aws data lakehouse software#

It combines the best of a PaaS and a SaaS service: there is no hardware or software to select, install, configure, or manage. Snowflake is built on top of Amazon Web Services, Microsoft Azure, or Google Cloud Platform. It goes beyond data warehousing, as Snowflake is also positioned as a data lakehouse. Snowflake is a fully managed cloud data warehouse that offers a cloud-based data storage and analytics service. We will also delve into some technical aspects like its configuration and how it connects to Tableau, concluding with an overview of Snowflake’s capabilities.

#Aws data lakehouse how to#

We will also see how to import data into the Snowflake cluster and how to query this internal data in order to – spoiler alert! – get the best performance. In this article, we will demonstrate how to use Snowflake to query the three different versions of the TPC-H dataset using external tables. We are not going to make a thorough comparison between Snowflake and Redshift nonetheless, if you are interested in discovering more about the benefits of both, you should check out our blog post describing how we migrated from Redshift to Snowflake. We have already mentioned Snowflake in previous articles like Big Data Analytics in Amazon Web Services and Real-time Analytics with AWS, in which we demonstrated batch and real-time data platforms on AWS with Snowflake as the data warehouse technology. In this article, we will look at Snowflake, another cloud data warehouse technology that is widely used and that can also be deployed to build a lakehouse. We also mentioned that the combination of S3, Athena and Redshift is the basis of the data lakehouse architecture proposed by AWS. However, we found out that for some situations, a pure data lake querying engine like Athena may not be the best choice, so in our third and latest blog article we introduced Redshift as an alternative for situations in which Athena is not recommended.Īs we discussed in that blog post, Amazon Redshift is a highly scalable AWS data warehouse with its own data structure that optimises queries, and additionally offers Spectrum to query external data. In the second article we ran queries (also defined by the TPC-H benchmark) with Athena and compared the performance we got depending on the dataset characteristics as expected, the best performance was observed with Parquet files. The rest of the tables are left unpartitioned. Partitioned Parquets: 32.5 GB, the largest tables, which are partitioned, are lineitem with 21.5GB and orders with 5GB, with one partition per day each partition has one file and there around 2,000 partitions for each table.

Parquets without partitions: 31.5 GB, the largest tables are lineitem with 21GB and orders with 4.5GB, also split into 80 files.Raw (CSV): 100 GB, the largest tables are lineitem with 76GB and orders with 16GB, split into 80 files.In that example, we used a dataset from the popular TPC-H benchmark, and generated three versions of the dataset: We presented an example of how to convert raw data (most data landing in data lakes is in a raw format such as CSV) into partitioned Parquet files with Athena and Glue in AWS. In the first article of this series, we discussed how to optimise data lakes by using proper file formats ( Apache Parquet) and other optimisation mechanisms (partitioning) we also introduced the concept of the data lakehouse.

This is the fourth article in the ‘Data Lake Querying in AWS’ blog series, in which we introduce different technologies to query data lakes in AWS, i.e.