Components of the Impala Server

Components of the Impala Server

The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of different daemon processes that run on specific hosts within your CDH cluster.

Continue reading:

The Impala Daemon

The core Impala component is a daemon process that runs on each node of the cluster, physically represented by the impalad process. It reads and writes to data files; accepts queries transmitted from the impala-shell command, Hue, JDBC, or ODBC; parallelizes the queries and distributes work to other nodes in the Impala cluster; and transmits intermediate query results back to the central coordinator node.

You can submit a query to the Impala daemon running on any node, and that node serves as the coordinator node for that query. The other nodes transmit partial results back to the coordinator, which constructs the final result set for a query. When running experiments with functionality through the impala-shell command, you might always connect to the same Impala daemon for convenience. For clusters running production workloads, you might load-balance between the nodes by submitting each query to a different Impala daemon in round-robin style, using the JDBC or ODBC interfaces.

The Impala daemons are in constant communication with the statestore, to confirm which nodes are healthy and can accept new work.

They also receive broadcast messages from the catalogd daemon (introduced in Impala 1.2) whenever any Impala node in the cluster creates, alters, or drops any type of object, or when an INSERT or LOAD DATA statement is processed through Impala. This background communication minimizes the need for REFRESH or INVALIDATE METADATAstatements that were needed to coordinate metadata across nodes prior to Impala 1.2.

Related information: Modifying Impala Startup OptionsStarting ImpalaSetting the Idle Query and Idle Session Timeouts for impaladPorts Used by Impala,Using Impala through a Proxy for High Availability

The Impala Statestore

The Impala component known as the statestore checks on the health of Impala daemons on all the nodes in a cluster, and continuously relays its findings to each of those daemons. It is physically represented by a daemon process named statestored; you only need such a process on one node in the cluster. If an Impala node goes offline due to hardware failure, network error, software issue, or other reason, the statestore informs all the other nodes so that future queries can avoid making requests to the unreachable node.

Because the statestore‘s purpose is to help when things go wrong, it is not critical to the normal operation of an Impala cluster. If the statestore is not running or becomes unreachable, the other nodes continue running and distributing work among themselves as usual; the cluster just becomes less robust if other nodes fail while the statestore is offline. When the statestore comes back online, it re-establishes communication with the other nodes and resumes its monitoring function.

Related information:

Scalability Considerations for the Impala StatestoreModifying Impala Startup OptionsStarting ImpalaIncreasing the Statestore TimeoutPorts Used by Impala

The Impala Catalog Service

The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the nodes in a cluster. It is physically represented by a daemon process named catalogd; you only need such a process on one node in the cluster. Because the requests are passed through the statestore daemon, it makes sense to run the statestored and catalogd services on the same node.

This new component in Impala 1.2 reduces the need for the REFRESH and INVALIDATE METADATA statements. Formerly, if you issued CREATE DATABASE, DROP DATABASE,CREATE TABLE, ALTER TABLE, or DROP TABLE statements on one Impala node, you needed to issue INVALIDATE METADATA on any other node before running a query there, so that it would pick up the changes to schema objects. Likewise, if you issued INSERT statements on one node, you needed to issue REFRESH table_name on any other node before running a query there, so that it would recognize the newly added data files. The catalog service removes the need to issue REFRESH and INVALIDATE METADATAstatements when the metadata changes are performed by statement issued through Impala; when you create a table, load data, and so on through Hive, you still need to issue REFRESH or INVALIDATE METADATA on an Impala node before executing a query there.

This feature, new in Impala 1.2, touches a number of aspects of Impala:

  • See Impala InstallationUpgrading Impala and Starting Impala, for usage information for the catalogd daemon.
  • The REFRESH and INVALIDATE METADATA statements are no longer needed when the CREATE TABLE, INSERT, or other table-changing or data-changing operation is performed through Impala. These statements are still needed if such operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the statements only need to be issued on one Impala node rather than on all nodes. See REFRESH Statement and INVALIDATE METADATA Statement for the latest usage information for those statements.

By default, the metadata loading and caching on startup happens asynchronously, so Impala can begin accepting requests promptly. To enable the original behavior, where Impala waited until all metadata was loaded before accepting any requests, set the catalogd configuration option --load_catalog_in_background=false.

时间: 2024-08-19 05:10:19

Components of the Impala Server的相关文章

Cloudera Impala Guide

Impala Concepts and Architecture The following sections provide background information to help you become productive using Cloudera Impala and its features. Where appropriate, the explanations include context to help understand how aspects of Impala

【原创】大数据基础之Impala(1)简介、安装、使用

impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon. impala是hadoop上的开源分析性数据库: Do BI-style Queries on Hadoop Impala provides

SQL Server 2014 SP1 通过补丁KB3058865提供更新,SP1一文便知

Microsoft SQL Server 2014 SP1 更新: SQLServer2014SP1-KB3058865-architecture-language.exe 安装完成后版本 12.0.4100.1. 主要特性包如下: Microsoft® SQL Server® Backup to Windows® Azure® Tool Microsoft SQL Server Backup to Windows Azure Tool 支持备份到 Windows Azure Blob 存储,加

Impala 源码分析-FE

By yhluo 2015年7月29日 Impala 3 Comments Impala 源代码目录结构 SQL 解析 Impala 的 SQL 解析与执行计划生成部分是由 impala-frontend(Java)实现的,监听端口是 21000.用户通过Beeswax 接口 BeeswaxService.query() 提交一个请求,在 impalad 端的处理逻辑是由void ImpalaServer::query(QueryHandle& query_handle, const Query

.net面试问答(大汇总)

原文://http://blog.csdn.net/wenyan07/article/details/41541489 用.net做B/S结构的系统,您是用几层结构来开发,每一层之间的关系以及为什么要这样分层? 答: 从下至上分别为:数据访问层.业务逻辑层(又或成为领域层).表示层 数据访问层:有时候也称为是持久层,其功能主要是负责数据库的访问 业务逻辑层:是整个系统的核心,它与这个系统的业务(领域)有关 表示层:是系统的UI部分,负责使用者与整个系统的交互.  优点:  分工明确,条理清晰,易

C#面试题汇总【转】

http://www.cnblogs.com/wangjisi/archive/2010/06/14/1758347.html 用.net做B/S结构的系统,您是用几层结构来开发,每一层之间的关系以及为什么要这样分层? 答: 从下至上分别为:数据访问层.业务逻辑层(又或成为领域层).表示层 数据访问层:有时候也称为是持久层,其功能主要是负责数据库的访问 业务逻辑层:是整个系统的核心,它与这个系统的业务(领域)有关 表示层:是系统的UI部分,负责使用者与整个系统的交互.  优点:  分工明确,条理

Oracle Forms 10g Tutorial Ebook Download - Oracle Forms Blog

A step by step tutorial for Oracle Forms 10g development. This guide is helpful for freshers in Oracle forms 10g. To download this ebook click the below button: Download Oracle Forms 10g eBook See Also: Oracle Forms Recipes - Get it from Google Playh

cloudera learning4:Hadoop集群规划

涉及到一些关于硬件的东西,我也不是很懂,记录下来有待以后学习. Hadoop集群一般都是由小到大,刚开始可能只有4到6个节点,随着存储数据的增加,计算量的增大,内存需求的增加,集群慢慢变大. 比如按照数据存储量增大集群,每个星期数据存储3TB数据,HDFS的block备份数为3,则集群就需要9TB的磁盘,一般还要再预估25%buffer.如果一台机器的存储量为16*3T,则大概每个月往集群中增加1台机器. 如何进行硬件选择?一般Hadoop节点分成管理节点(master node)和工作节点(w

Linking code for an enhanced application binary interface (ABI) with decode time instruction optimization

A code sequence made up multiple instructions and specifying an offset from a base address is identified in an object file. The offset from the base address corresponds to an offset location in a memory configured for storing an address of a variable