Greenplum Database, one of many open source data warehouse, provides many advantages over traditional data warehousing systems.
- Open Source
- Commodity Hardware
- Shared Nothing Architecture
- High Availability
- Workload Management
- Scalability and Flexibility
When compared to the competition, Greenplum Database is the only product that performs exceptionally well in all of these categories.
Open Source Data Warehouse
The global adoption and momentum building around Linux have demonstrated the power and value of utilizing open source software in the enterprise. Open source offers many of the benefits that have been missing from the traditional proprietary commercial software industry. Open source software:
- Insulates enterprises from vendor lock-in
- Lowers the cost of ownership
- Leverages the efforts of a global developer community
One of Greenplum Database’s greatest strengths is that it can run on off-the-shelf, low-cost commodity servers. Greenplum Database was designed specifically to take advantage of the tremendous price/performance advantages that commodity computing delivers over traditional proprietary SMP-based systems.
Greenplum Database supports standard hardware configurations from Dell, HP, Sun, and other hardware vendors. A typical Greenplum Database compute host has the following hardware resources:
- 2 dual-core CPUs (typically Xeon or Opteron)
- 16 GB of RAM
- 2 Gigabit Ethernet interfaces
- 1 SATA RAID disk controller per 8 drives
- 16 SATA 400 GB hard drives
By leveraging commodity systems, Greenplum Database requires less than $25,000 (US) of hardware per terabyte of usable warehousing capacity.
Shared Nothing Architecture
Business Intelligence (BI) processing normally involves repeated scanning of the entire contents of a deep repository of data to compute the results of complex queries. On the other hand, most of today’s general-purpose relational database management systems have been designed for Online Transaction Processing (OLTP) applications, where simple queries are repeatedly processed using small amounts of data. Databases designed to handle OLTP workloads often perform poorly when faced with BI applications that require full-table scans, many table joins, sorting, or aggregation against very large volumes of data.
When a query scans the entire contents of the data stored in a database, its speed will be limited by the bandwidth of its connections to the physical storage. Greenplum Database’s shared-nothing approach separates the physical storage into small units on individual segment instances, each with a dedicated, independent high-speed channel connection to local disks.
These segment instances are connected by the Greenplum Database Interconnect and database optimizer technology. They perform work in parallel and use all disk connections simultaneously. As a result, the database system consists of a number of self-contained parallel processing units and is able to scale storage capacity and processing power together to answer complex queries on growing data repositories. Each segment instance acts as a self-contained database processor that owns and manages a distinct portion of the overall data.
Because shared-nothing databases automatically distribute data and make query workloads parallel across all available hardware, they dramatically outperform general-purpose database systems for BI workloads.
Greenplum Database provides for redundancy of its components so that there is no single point of failure in the Greenplum Database system. Greenplum Database provides a high degree of system fail-over through cyclical data redundancy. Each data segment is mirrored on an alternate host, where each segment instance manages one distinct segment of data (either a primary or a backup copy).
In a typical Greenplum Database implementation, there is generally one primary segment instance per CPU, several to a host. When an active segment instance fails, Greenplum Database automatically redirects connections to the backup segment on an alternate host.
Parallel Data Loading and Single Row Error Handling
One challenge of large scale, multi-terabyte data warehouses is getting large amounts of data loaded within a given maintenance window. Greenplum supports fast, parallel data loading with its external tables feature. Using external tables, data can be loaded in excess of 2 TB an hour.
External tables provide an easy way to perform basic extraction, transformation, and loading (ETL) tasks that are common in data warehousing. External table files are read in parallel by the Greenplum Database segment instances, so they also provide a means for fast data loading. External tables are comprised of flat files that reside outside of the database. Creating an external table allows you to access these flat files as though they were a regular database table. External table data can be queried directly (and in parallel) using regular SQL commands.
Data loads can be run in ‘single row error isolation’ mode, allowing administrators to filter out bad rows during a load operation into a separate error table, while still loading properly formatted rows. Administrators can control the acceptable error threshold for a load operation, giving them control over the quality and flow of data into the database.
The purpose of Greenplum workload management is to limit the number of active queries in the system at any given time in order to avoid exhausting system resources such as memory, CPU, and disk I/O. This is accomplished by creating role-based resource queues. A resource queue has attributes that limit the size and/or total number of queries that can be executed by the users (or roles) in that queue. By assigning all of your database roles to the appropriate resource queue, administrators can control concurrent user queries and prevent the system from being overloaded.
Scalability and Flexibility
Greenplum Database allows for incremental, host-centric expansion. Compute, bandwidth, or mass storage capacity issues can be easily addressed and corrected simply by adding an extra host (or hosts) into the system. With linear scalability inherent to the Greenplum Database architecture, you can easily model and make provisions for how many hosts will be required to support data warehouse growth.
Compared to traditional data warehousing solutions, Greenplum Database offers the best performance ratio on the market.
Greenplum Database achieves its tremendous performance advantages through parallelism. SQL statements executed within Greenplum Database are broken into smaller components, and all components are worked on at the same time by the individual segments to deliver a single result set. All relational operations—such as table scans, index scans, joins, aggregations, and sorts—execute in parallel across the segments simultaneously. Each operation is performed on a segment independent of the data associated with the other segments. This parallel execution delivers results up to 100 times faster than traditional database management systems.