Greenplum Database and PostgreSQL
The object-relational database management system known as PostgreSQL is derived from the POSTGRES package written at the University of California at Berkeley. With almost three decades of development behind it, PostgreSQL is now the most advanced open-source database available.
Greenplum Database is built upon the PostgreSQL 8.2.5 code base and has many similarities to PostgreSQL. For example, many of the client and server applications, configuration files, supported SQL commands, and syntax will be the same or very similar to PostgreSQL.
Greenplum Database is essentially several PostgreSQL instances acting as one cohesive database management system. The internals of PostgreSQL have been modified or supplemented to support the parallel structure of Greenplum Database. For example the system catalog has been supplemented to track all of the segment instances that comprise a Greenplum database. The query parser, query planner, query optimizer, and query executor processes have been modified and enhanced to be able to execute queries in parallel across all of the segments.
Data Query and Manipulation Language (DQL/DML) is essentially supported as it is in PostgreSQL. SELECT, INSERT, UPDATE, and DELETE are DQL/DML commands. All other SQL commands are considered Data Definition Language (DDL) or utility commands. Most DDL and utility SQL statements are supported in Greenplum Database as they are in PostgreSQL, with a few minor exceptions. See “SQL Support” on page 24 for more information.
Similar Posts:
- None Found
Green Plum Features and Benefits
Greenplum Database, one of many open source data warehouse, provides many advantages over traditional data warehousing systems.
- Open Source
- Commodity Hardware
- Shared Nothing Architecture
- High Availability
- Workload Management
- Scalability and Flexibility
- Performance
When compared to the competition, Greenplum Database is the only product that performs exceptionally well in all of these categories.
Open Source Data Warehouse
The global adoption and momentum building around Linux have demonstrated the power and value of utilizing open source software in the enterprise. Open source offers many of the benefits that have been missing from the traditional proprietary commercial software industry. Open source software:
- Insulates enterprises from vendor lock-in
- Lowers the cost of ownership
- Leverages the efforts of a global developer community
Commodity Hardware
One of Greenplum Database’s greatest strengths is that it can run on off-the-shelf, low-cost commodity servers. Greenplum Database was designed specifically to take advantage of the tremendous price/performance advantages that commodity computing delivers over traditional proprietary SMP-based systems.
Greenplum Database supports standard hardware configurations from Dell, HP, Sun, and other hardware vendors. A typical Greenplum Database compute host has the following hardware resources:
- 2 dual-core CPUs (typically Xeon or Opteron)
- 16 GB of RAM
- 2 Gigabit Ethernet interfaces
- 1 SATA RAID disk controller per 8 drives
- 16 SATA 400 GB hard drives
By leveraging commodity systems, Greenplum Database requires less than $25,000 (US) of hardware per terabyte of usable warehousing capacity.
Shared Nothing Architecture
Business Intelligence (BI) processing normally involves repeated scanning of the entire contents of a deep repository of data to compute the results of complex queries. On the other hand, most of today’s general-purpose relational database management systems have been designed for Online Transaction Processing (OLTP) applications, where simple queries are repeatedly processed using small amounts of data. Databases designed to handle OLTP workloads often perform poorly when faced with BI applications that require full-table scans, many table joins, sorting, or aggregation against very large volumes of data.
When a query scans the entire contents of the data stored in a database, its speed will be limited by the bandwidth of its connections to the physical storage. Greenplum Database’s shared-nothing approach separates the physical storage into small units on individual segment instances, each with a dedicated, independent high-speed channel connection to local disks.
These segment instances are connected by the Greenplum Database Interconnect and database optimizer technology. They perform work in parallel and use all disk connections simultaneously. As a result, the database system consists of a number of self-contained parallel processing units and is able to scale storage capacity and processing power together to answer complex queries on growing data repositories. Each segment instance acts as a self-contained database processor that owns and manages a distinct portion of the overall data.
Because shared-nothing databases automatically distribute data and make query workloads parallel across all available hardware, they dramatically outperform general-purpose database systems for BI workloads.
High Availability
Greenplum Database provides for redundancy of its components so that there is no single point of failure in the Greenplum Database system. Greenplum Database provides a high degree of system fail-over through cyclical data redundancy. Each data segment is mirrored on an alternate host, where each segment instance manages one distinct segment of data (either a primary or a backup copy).
In a typical Greenplum Database implementation, there is generally one primary segment instance per CPU, several to a host. When an active segment instance fails, Greenplum Database automatically redirects connections to the backup segment on an alternate host.
Parallel Data Loading and Single Row Error Handling
One challenge of large scale, multi-terabyte data warehouses is getting large amounts of data loaded within a given maintenance window. Greenplum supports fast, parallel data loading with its external tables feature. Using external tables, data can be loaded in excess of 2 TB an hour.
External tables provide an easy way to perform basic extraction, transformation, and loading (ETL) tasks that are common in data warehousing. External table files are read in parallel by the Greenplum Database segment instances, so they also provide a means for fast data loading. External tables are comprised of flat files that reside outside of the database. Creating an external table allows you to access these flat files as though they were a regular database table. External table data can be queried directly (and in parallel) using regular SQL commands.
Data loads can be run in ‘single row error isolation’ mode, allowing administrators to filter out bad rows during a load operation into a separate error table, while still loading properly formatted rows. Administrators can control the acceptable error threshold for a load operation, giving them control over the quality and flow of data into the database.
Workload Management
The purpose of Greenplum workload management is to limit the number of active queries in the system at any given time in order to avoid exhausting system resources such as memory, CPU, and disk I/O. This is accomplished by creating role-based resource queues. A resource queue has attributes that limit the size and/or total number of queries that can be executed by the users (or roles) in that queue. By assigning all of your database roles to the appropriate resource queue, administrators can control concurrent user queries and prevent the system from being overloaded.
Scalability and Flexibility
Greenplum Database allows for incremental, host-centric expansion. Compute, bandwidth, or mass storage capacity issues can be easily addressed and corrected simply by adding an extra host (or hosts) into the system. With linear scalability inherent to the Greenplum Database architecture, you can easily model and make provisions for how many hosts will be required to support data warehouse growth.
Performance
Compared to traditional data warehousing solutions, Greenplum Database offers the best performance ratio on the market.
Greenplum Database achieves its tremendous performance advantages through parallelism. SQL statements executed within Greenplum Database are broken into smaller components, and all components are worked on at the same time by the individual segments to deliver a single result set. All relational operations—such as table scans, index scans, joins, aggregations, and sorts—execute in parallel across the segments simultaneously. Each operation is performed on a segment independent of the data associated with the other segments. This parallel execution delivers results up to 100 times faster than traditional database management systems.

Similar Posts:
Greenplum: Open Source Data Warehouse
Greenplum Basics
In 2005, Greenplum released an enterprise-level massively parallel processing (MPP) version of PostgreSQL called Greenplum Database. Greenplum Database is the industry’s first massively parallel processing (MPP) database server based on open-source technology. It is explicitly designed to support business intelligence (BI) applications and large, multi-terabyte data warehouses.
Greenplum Database is the first open source powered database server that can scale to support multi-terabyte data warehousing demands. It is based on PostgreSQL, the most advanced open-source database available. This section explains Greenplum Database’s similarities and differences as compared to PostgreSQL.
Since that time, Greenplum continues to actively contribute to the PostgreSQL community by submitting features that make PostgreSQL a more robust open source database for Business Intelligence applications. Likewise, Greenplum benefits from the advances made by the PostgreSQL development community.
Greenplum History
Greenplum was formed in 2003 by the merger of Metapa and Didera with the goal of developing a low cost, high-performance, large-scale data warehouse open source solution software. Greenplum is led by pioneers in open source, database systems, data warehousing, supercomputing, and Internet performance acceleration with technical staff from companies such as Oracle, Sybase, Informix, Teradata, Netezza, Tandem, and Sun. Greenplum company headquarters are in San Mateo, California.
By utilizing open source software and commodity, off-the-shelf hardware, Greenplum’s vision is to make enterprise data as available and easy to use for business users as Web data is for consumers. Businesses need to make faster, more accurate business decisions to gain competitive advantage. The task of managing and scaling data for business reporting has traditionally been difficult and expensive, and in the past 20 years the database infrastructure on which business intelligence (BI) systems are built has not evolved significantly.
Greenplum recognizes that companies are moving to low cost computing and replacing proprietary, Unix-based hardware and software with Intel-based hardware running Linux. Greenplum’s offerings are specifically designed to help companies take advantage of the price and performance returns of Linux.

Similar Posts:
The Compelling Need For Data Warehousing
Why are companies rushing into data warehousing? Why is there a tremendous surge in interest? Data warehousing is no longer a purely novel idea just for research and experimentation. It has become a mainstream phenomenon. True, the data warehouse is not in every doctor’s office yet, but neither is it confined to only high-end businesses. More than half of all U.S. companies and a large percentage of worldwide businesses have made a commitment to data warehousing.
In every industry across the board, from retail chain stores to financial institutions, from manufacturing enterprises to government departments, and from airline companies to utility businesses, data warehousing is revolutionizing the way people perform business analysis and make strategic decisions.
In the 1990s, as businesses grew more complex, corporations spread globally, and competition became fiercer, business executives became desperate for information to stay competitive and improve the bottom line. The operational computer systems did provide information to run the day-to-day operations, but what the executives needed were different kinds of information that could be readily used to make strategic decisions. They wanted to know where to build the next warehouse, which product lines to expand, and which markets they should strengthen. The operational systems, important as they were, could not provide strategic information. Businesses, therefore, were compelled to turn to new ways of getting strategic information.
Over the past two decades, companies have accumulated tons and tons of data about their operations. Mountains of data exist. Information is said to double every 18 months. If we have such huge quantities of data in our organizations, why can’t our executives and managers use this data for making strategic decisions? Lots and lots of information exists. Why then do we talk about an information crisis? Most companies are faced with an information crisis not because of lack of sufficient data, but because the available data is not readily usable for strategic decision making. These large quantities of data are very useful and good for running the business operations, but hardly amenable for use in making decisions about business strategies and objectives.
The fact is that for nearly two decades or more, IT departments have been attempting to
provide information to key personnel in their companies for making strategic decisions.
Sometimes an IT department could produce ad hoc reports from a single application. In
most cases, the reports would need data from multiple systems, requiring the writing of extract programs to create intermediary files that could be used to produce the ad hoc reports.
Most of these attempts by IT in the past ended in failure. The users could not clearly define what they wanted in the first place. Once they saw the first set of reports, they
wanted more data in different formats. The chain continued. This was mainly because of
the very nature of the process of making strategic decisions. Information needed for
strategic decision making has to be available in an interactive manner. The user must be able to query online, get results, and query some more. The information must be in a format suitable for analysis.
What is a basic reason for the failure of all the previous attempts by IT to provide strategic information? What has IT been doing all along? The fundamental reason for the inability to provide strategic information is that we have been trying all along to provide strategic information from the operational systems. These operational systems such as order processing, inventory control, claims processing, outpatient billing, and so on are not designed or intended to provide strategic information. If we need the ability to provide strategic information, we must get the information from altogether different types of systems. Only specially designed decision support systems or informational systems can provide strategic information.
At this stage of our discussion, we now realize that we do need different types of decision support systems to provide strategic information. The type of information needed for strategic decision making is different from that available from operational systems. We need a new type of system environment for the purpose of providing strategic information for analysis, discerning trends, and monitoring performance.
This new system environment that users desperately need to obtain strategic information
happens to be the new paradigm of data warehousing. Enterprises that are building data
warehouses are actually building this new system environment. This new environment is
kept separate from the system environment supporting the day-to-day operations. The data warehouse essentially holds the business intelligence for the enterprise to enable strategic decision making. The data warehouse is the only viable solution. We have clearly seen that solutions based on the data extracted from operational systems are all totally unsatisfactory.
Basically, datawarehouse is a simple concept, it involves different functions: data extraction, the function of loading the data, transforming the data, storing the data, and providing user interfaces.
The end result is the creation of a new computing environment for the purpose of providing the strategic information every enterprise needs desperately. There are several vendor tools available in each of these technologies. You do not have to build your data warehouse from scratch.
Similar Posts:
Welcome To Data Warehouse Solution
Thanks for visiting Data Warehouse Solution. We have taken the time to organize information that can help you to learn about data warehouse. If you’re looking for a data warehouse solution, or just data warehouse solution information in general, you have come to the right place.
If any of you would like more information on data warehouse solution, please feel free to let us know. Thanks again for the visit and make sure to check back often to receive additional information on data warehouse solution.