【原创】大数据基础之Presto(1)简介、安装、使用

presto 0.217

官方:http://prestodb.github.io/

一 简介

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

presto是一个开源的分布式sql查询引擎,用于大规模(从GB到PB)数据源的交互式分析查询,并且达到商业数据仓库的查询速度;

Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow "free" solution that requires excessive hardware.

presto允许直接查询外部数据,包括hive、cassandra、rdbms以及文件系统比如hdfs;一个presto查询中可以同时使用多个数据源的数据来得到结果;presto在‘昂贵且快的商业解决方案’和‘免费且慢的开源解决方案’之间提供了一个新的选择;

Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

facebook使用presto来进行多个内部数据源的交互式查询,包括300PB的数据仓库;每天有超过1000个facebook员工在PB级数据上使用presto运行超过30000个查询;

Leading internet companies including Airbnb and Dropbox are using Presto.

业界领先的互联网公司包括airbnb和dropbox都在使用presto,下面是airbnb的评价:

Presto is amazing. Lead engineer Andy Kramolisch got it into production in just a few days. It‘s an order of magnitude faster than Hive in most our use cases. It reads directly from HDFS, so unlike Redshift, there isn‘t a lot of ETL before you can use it. It just works.

--Christopher Gutierrez, Manager of Online Analytics, Airbnb

架构

Presto is a distributed system that runs on a cluster of machines. A full installation includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.

presto是一个运行在集群上的分布式系统,包括一个coordinator和多个worker,client(比如presto cli)提交查询到coordinator,然后coordinator解析、分析和计划查询如何执行,然后将任务分配给worker;

Presto supports pluggable connectors that provide data for queries. The requirements vary by connector.

presto提供插件化的connector来支持外部数据查询,原生支持hive、cassandra、elasticsearch、kafka、kudu、mongodb、mysql、redis等众多外部数据源;

详细参考:https://prestodb.github.io/docs/current/connector.html

二 安装

1 下载

# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.217/presto-server-0.217.tar.gz
# tar xvf presto-server-0.217.tar.gz
# cd presto-server-0.217

2 准备数据目录

Presto needs a data directory for storing logs, etc. We recommend creating a data directory outside of the installation directory, which allows it to be easily preserved when upgrading Presto.

3 准备配置目录

Create an etc directory inside the installation directory. This will hold the following configuration:

  • Node Properties: environmental configuration specific to each node
  • JVM Config: command line options for the Java Virtual Machine
  • Config Properties: configuration for the Presto server
  • Catalog Properties: configuration for Connectors (data sources)
# mkdir etc

# cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

# cat etc/jvm.config
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

# cat etc/config.properties

# coordinator
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://example.net:8080

# worker
coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://example.net:8080

# cat etc/log.properties
com.facebook.presto=INFO

注意:
1)coordinator和worker的config.properties不同,主要是coordinator上会开启discovery服务

discovery-server.enabled: Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup. In order to simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the Discovery service. It shares the HTTP server with Presto and thus uses the same port.

2)如果coordinator和worker位于不同机器,则设置

node-scheduler.include-coordinator=false

如果coordinator和worker位于相同机器,则设置

node-scheduler.include-coordinator=true

node-scheduler.include-coordinator: Allow scheduling work on the coordinator. For larger clusters, processing work on the coordinator can impact query performance because the machine’s resources are not available for the critical task of scheduling, managing and monitoring query execution.

4 配置connector

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. For example, the Hive connector maps each Hive database to a schema, so if the Hive connector is mounted as the hive catalog, and Hive contains a table clicks in database web, that table would be accessed in Presto as hive.web.clicks.

以hive为例

# mkdir etc/catalog

# cat etc/catalog/hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://example.net:9083
#hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

详细参考:https://prestodb.github.io/docs/current/connector.html

5 启动

调试启动

# bin/launcher run --verbose

Presto requires Java 8u151+ (found 1.8.0_141)

需要jdk1.8.151以上

正常之后,后台启动

# bin/launcher start

三 使用

1 下载cli

# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.217/presto-cli-0.217-executable.jar
# mv presto-cli-0.217-executable.jar presto
# chmod +x presto
# ./presto --server localhost:8080 --catalog hive --schema default

2 jdbc

# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-jdbc/0.217/presto-jdbc-0.217.jar
# export HIVE_AUX_JARS_PATH=/path/to/presto-jdbc-0.217.jar
# beeline -u jdbc:presto://example.net:8080/hive/sales

参考:https://prestodb.github.io/overview.html

原文地址:https://www.cnblogs.com/barneywill/p/10478959.html

时间: 2024-10-30 06:42:45

【原创】大数据基础之Presto(1)简介、安装、使用的相关文章

区块链这些技术与h5房卡斗牛平台出售,大数据基础软件干货不容错过

在IT产业发展中,包括CPU.操作系统h5房卡斗牛平台出售 官网:h5.super-mans.com 企娥:2012035031 vx和tel:17061863513 h5房卡斗牛平台出售在内的基础软硬件地位独特,不但让美国赢得了产业发展的先机,成就了产业巨头,而且因为技术.标准和生态形成的壁垒,主宰了整个产业的发展.错失这几十年的发展机遇,对于企业和国家都是痛心的. 当大数据迎面而来,并有望成就一个巨大的应用和产业机会时,企业和国家都虎视眈眈,不想错再失这一难得的机遇.与传统的IT产业一样,大

大数据基础教程:创建RDD的二种方式

大数据基础教程:创建RDD的二种方式 1.从集合中创建RDD val conf = new SparkConf().setAppName("Test").setMaster("local")      val sc = new SparkContext(conf)      //这两个方法都有第二参数是一个默认值2  分片数量(partition的数量)      //scala集合通过makeRDD创建RDD,底层实现也是parallelize      val 

大数据平台CDH5.14.2 的安装配置

大数据平台CDH5.14.2 的安装配置 标签(空格分隔): 大数据平台构建 一:系统环境初始化 二:安装CDH5.14.2 平台 三:分配主机与分配角色 一: 系统环境初始化 1.1: 系统环境介绍 系统: CentOS7.5X64 cat /etc/hosts --- 172.17.100.11 node-01.flyfish 172.17.100.12 node-02.flyfish 172.17.100.13 node-03.flyfish 172.17.100.14 node-04.f

【原创】大数据基础之Impala(1)简介、安装、使用

impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon. impala是hadoop上的开源分析性数据库: Do BI-style Queries on Hadoop Impala provides

大数据基础架构详解

简介:本文是对大数据领域的基础论文的阅读总结,相关论文包括GFS,MapReduce.BigTable.Chubby.SMAQ. 大数据出现的原因: 大多数的技术突破来源于实际的产品需要,大数据最初诞生于谷歌的搜索引擎中.随着web2.0时代的发展,互联网上数据量呈献爆炸式的增长,为了满足信息搜索的需要,对大规模数据的存储提出了非常强劲的需要.基于成本的考虑,通过提升硬件来解决大批量数据的搜索越来越不切实际,于是谷歌提出了一种基于软件的可靠文件存储体系GFS,使用普通的PC机来并行支撑大规模的存

大数据基础和hadoop

一.大数据的特点 大数据是什么?其实很简单,大数据其实就是海量资料巨量资料,这些巨量资料来源于世界各地随时产生的数据,在大数据时代,任何微小的数据都可能产生不可思议的价值.大数据有4个特点,为别为:Volume(大量).Variety(多样).Velocity(高速).Value(价值),一般我们称之为4V. 所谓4V,具体指如下4点: 1.大量.大数据的特征首先就体现为“大”,从先Map3时代,一个小小的MB级别的Map3就可以满足很多人的需求,然而随着时间的推移,存储单位从过去的GB到TB,

“大数据“基础知识普及

大数据,官方定义是指那些数据量特别大.数据类别特别复杂的数据集,这种数据集无法用传统的数据库进行存储,管理和处理.大数据的主要特点为数据量大(Volume),数据类别复杂(Variety),数据处理速度快(Velocity)和数据真实性高(Veracity),合起来被称为4V. 大数据中的数据量非常巨大,达到了PB级别.而且这庞大的数据之中,不仅仅包括结构化数据(如数字.符号等数据),还包括非结构化数据(如文本.图像.声音.视频等数据).这使得大数据的存储,管理和处理很难利用传统的关系型数据库去

图说大数据基础

大数据开发基础上之图说笔记 1.Hadoop2概览 1.1Hadoop2的组成.演化: 1.2Hadoop2.0——Hadoop1.0演化与改进: 2.HDFS系统概览 2.1HDFS系统的主要特性与适用场景: 2.2HDFS的体系结构: 2.3HDFS的构成 2.4HDFS的读流程: 2.5HDFS创建子路径流程: 2.6写流程和删除流程 3 YARN概览 3.1Hadoop1.x中的MapReduce构成及特点: 3.2 Yarn的结构图和主要组件: 3.3 YARN的工作流程图: 4 Ma

学完大数据基础,可以按照我写的顺序学下去

首先给大家介绍什么叫大数据,大数据最早是在2006年谷歌提出来的,百度给他的定义为巨量数据集合,辅相成在今天大数据技术任然随着互联网的发展,更加迅速的成长,小到个人,企业,达到国家安全,大数据的作用可见一斑,也就是近几年大数据这个概念,随着云计算的出现才凸显出其价值,云计算与大数据的关系就像硬币的正反面一样,相密不可分.但是大数据的人才缺失少之又少,这就拖延了大数据的发展.所以人才培养真的很重要. 大数据的定义.大数据,又称巨量资料,指的是所涉及的数据资料量规模巨大到无法通过人脑甚至主流软件工具