更有效的批处理（cheaper in bulk）

mget允许我们一次检索多个document，而bulk API则允许我们在一个请求中做create，index，update
或者delete。如果你要index一个数据流如日志数据，bulk是很实用的，bulk可以能排队数百或数千的批次处理。

bulk的请求体有点不同寻常，如下：

{ action:{ metadata }}\n
{ request body        }\n
{ action:{ metadata }}\n
{ request body        }\n
...

这个格式好像一行一行的用“\n”换行符分割的JSON document的集合。有两个要点需要注意：

1：每一行都要使用换行符“\n"分割，包括最后一行。这些换行符用来标记分割。

2：每一行都不能包括非转义的换行符，因为这些非法符号会影响JSON的解析。

在Why
the funny format? 中将会解释为什么bulk API要使用这种格式。

上面action/metadata这一行指定了将要对那个document做什么操作。

action必须是index，create，update，delete中的一个，metadata应该指定要操作的document的_index，_type和_id。

例如，一个删除的请求应该是这样的：

{"delete":{"_index":"website","_type":"blog","_id":"123"}}

请求体这行包含了document的_source自身——也就是document包含的字段以及字段值，这个对index和create操作是必要的，这也是必须注意的是：你一定要提供index的document。这对update操作也是必须的，并且也要包括你要update的请求体如doc，upsert，script等。不许要请求体的是delete动作。

{"create":  {"_index":"website","_type":"blog","_id":"123"}}
{"title":    "My first blog post"}

如果没有指定一个_id，那么ES就会自动生成一个：

{"index":{"_index":"website","_type":"blog"}}
{"title":    "My second blog post"}

几个操作放在一起，一个完整的bulk请求就如一下示例：

POST /_bulk
{"delete":{"_index":"website","_type":"blog","_id":"123"}}
{"create":{"_index":"website","_type":"blog","_id":"123"}}
{"title":    "My first blog post"}
{"index":  {"_index":"website","_type":"blog"}}
{"title":    "My second blog post"}
{"update":{"_index":"website","_type":"blog","_id":"123","_retry_on_conflict":3}}
{"doc":{"title":"My updated blog post"}}

标记1表示了一个delete动作是不许要请求体的，后面可以直接跟随其他的动作

标记2表示这里应该是一个换行符

相应消息体包括了items数组，这个数组列出了每个请求的操作结果，次序是和请求是保持一致的：

{
   "took":4,
   "errors":false,
   "items":[
      {  "delete":{
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version":2,
            "status":   200,
            "found":    true
      }},
      {  "create":{
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version":3,
            "status":   201
      }},
      {  "create":{
            "_index":   "website",
            "_type":    "blog",
            "_id":      "EiwfApScQiiy7TIKFxRCTw",
            "_version":1,
            "status":   201
      }},
      {  "update":{
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version":4,
            "status":   200
      }}
   ]
}}

标记1表示所有的请求操作都是成功的。

每个子请求都是独立执行的，一个执行失败并不能影响其他操作的执行。如果有任何一个操作失败，响应体中的顶级error的值就是true，详细的错误信息就会在相应的响应体作出答复：

POST /_bulk
{"create":{"_index":"website","_type":"blog","_id":"123"}}
{"title":    "Cannot create - it already exists"}
{"index":  {"_index":"website","_type":"blog","_id":"123"}}
{"title":    "But we can update it"}

在上面这个执行响应体中，我们能看到create一个id是123的document是失败的，因为这个document已经存在了，但是其后的index请求是成功的：

{
   "took":3,
   "errors":true,
   "items":[
      {  "create":{
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "status":   409,
            "error":    "DocumentAlreadyExistsException
                        [[website][4][blog][123]:
                        document already exists]"
      }},
      {  "index":{
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version":5,
            "status":   200
      }}
   ]
}

标记1表示有一个或多个请求是失败的

标记2表示数据是冲突的，因为HTTP的状态码是409

标记3表示失败的详细细节

标记4表示这个操作是成功的

这也说明了bulk的请求不是原子性的——他们不能被用来实现记录。每一个请求都是独立的，所以一个请求的成功或者失败并不会影响到其他的请求操作。

不要重复自己

也许你要批量对logging数据插入到相同的index，并且type也是相同的，为每个document都指定相同的元数据是很糟糕的。就像mget
API，bulk在URL请求中接受默认的/_index或者/_index/_type：

POST /website/_bulk
{"index":{"_type":"log"}}
{"event":"User logged in"}

你也能在metadata这一行重写URL中的_index，和_type：

POST /website/log/_bulk
{"index":{}}
{"event":"User logged in"}
{"index":{"_type":"blog"}}
{"title":"Overriding the default type"}

多大才是大

批处理请求通过接受请求的node把整个bulk加载到内存中，所以这个请求越大，为其他的请求剩余的内存空间就越小，有一个最佳的bulk请求大小，超个这个大小，性能不再提高，并且有可能会崩溃。

这个最佳的大小，并不是一个固定的数值，是依赖你的硬件换进，你的document的大小和复杂度，你的索引和搜索负荷。幸好，这个数值是容易找到的：

逐渐增加对典型的document的批处理数量，当着个性能开始降低的时候，你的批处理数量就是太大了。一个好的批处理数量值就是从1000到5000个document，如果你的document太大，这个数量就要减小。

注意bulk请求的物理数量是很有用的，一千个1KB的document和一千个1MB的数据是大有区别的，一个好的批处理数据量就是5-15M。

原文：http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/bulk.html

更有效的批处理（cheaper in bulk）,布布扣,bubuko.com

时间： 2024-08-04 18:29:24

更有效的批处理（cheaper in bulk）

更有效的批处理（cheaper in bulk）的相关文章

批处理基本知识以及进阶 V2.0

批处理学习：for语句详解

dos下和批处理中的 for 语句的基本用法

第三部分：批处理与变量

批处理语法详解

批处理-For详解

Elasticsearch系列---简单入门实战

Android 7.0 Nougat(牛轧糖)---对开发者来说

多文档模式（multi-document parrerns）