splunk 索引过程


Event :Events are records of activity in log files, stored in Splunk indexes. 简单说,处理的日志或话单中中一行记录就是一个Event;
Source type: 来源类型,identifies the format of the data,简单说,一种特定格式的日志,可以定义为一种source type;Splunk默认提供有500多种确定格式数据的type,包括apache log、常见OS的日志、Cisco等网络设备的日志等;
Index: The index is the repository for Splunk Enterprise data. Splunk transforms incoming data into events, which it stores in indexes. 有两层含义:一是数据物理存储上的表达,也是一个数据处理的动作表达:Splunk indexes your data,这个过程会产生两类数据:
The raw data in compressed form (rawdata)
Indexes that point to the raw data, plus some metadata files (index files)
Indexer: An indexer is a Splunk Enterprise instance that indexes data. 通常说的索引概念,也是对Splunk中“Indexer”这个特定模块的称呼,是一种Splunk Enterprise Instance;
Bucket: Index储存的两类数据按照age组织为不同的目录,称为buckets;


Search Head:前端搜索;
Deployment Server:相当于配置管理中心,对其它节点统一管理;

Forwarder:负责收集、预处理和前转数据至Indexer(consume data and forward it on to indexers),配合构成类似Flume的Agent和Collector的机制;动作包括:
· Tagging of metadata (source, sourcetype, and host)
· Configurable buffering
· Data compression
· SSL security
· Use of any available network ports
· Running scripted inputs locally



Indexer:负责对数据“索引化”处理,即indexing process,也可称为event processing;包括:
· Separating the datastream into individual, searchable events.(分行)
· Creating or identifying timestamps. (识别时间戳)
· Extracting fields such as host, source, and sourcetype. (外置公共字段处理)
· Performing user-defined actions on the incoming data, such as identifying custom fields, masking sensitive data, writing new or modified keys, applying breaking rules for multi-line events, filtering unwanted events, and routing events to specified indexes or servers.

Parts of an indexer cluster——分布式部署

An indexer cluster is a group of Splunk Enterprise instances, or nodes, that, working in concert, provide a redundant indexing and searching capability. Each cluster has three types of nodes:

  • A single master node to manage the cluster.
  • Several to many peer nodes to index and maintain multiple copies of the data and to search the data.
  • One or more search heads to coordinate searches across the set of peer nodes.

The master node manages the cluster. It coordinates the replicating activities of the peer nodes and tells the search head where to find data. It also helps manage the configuration of peer nodes and orchestrates remedial activities if a peer goes down.

The peer nodes receive and index incoming data, just like non-clustered, stand-alone indexers. Unlike stand-alone indexers, however, peer nodes also replicate data from other nodes in the cluster. A peer node can index its own incoming data while simultaneously storing copies of data from other nodes. You must have at least as many peer nodes as the replication factor. That is, to support a replication factor of 3, you need a minimum of three peer nodes.

The search head runs searches across the set of peer nodes. You must use a search head to manage searches across indexer clusters.——将搜索请求发给indexer节点,然后合并搜索请求

For most purposes, it is recommended that you use forwarders to get data into the cluster.

Here is a diagram of a basic, single-site indexer cluster, containing three peer nodes and supporting a replication factor of 3:

This diagram shows a simple deployment, similar to a small-scale non-clustered deployment, with some forwarders sending load-balanced data to a group of indexers (peer nodes), and the indexers sending search results to a search head. There are two additions that you don‘t find in a non-clustered deployment:

  • The indexers are streaming copies of their data to other indexers.
  • The master node, while it doesn‘t participate in any data streaming, coordinates a range of activities involving the search peers and the search head.

How indexing works

Splunk Enterprise can index any type of time-series data (data with timestamps). When Splunk Enterprise indexes data, it breaks it into events, based on the timestamps.

Event processing

Event processing occurs in two stages, parsing and indexing. All data that comes into Splunk Enterprise enters through the parsing pipeline as large (10,000 bytes) chunks. During parsing, Splunk Enterprise breaks these chunks into events which it hands off to the indexing pipeline, where final processing occurs.

While parsing, Splunk Enterprise performs a number of actions, including:

  • Extracting a set of default fields for each event, including hostsource, and sourcetype.
  • Configuring character set encoding.
  • Identifying line termination using linebreaking rules. While many events are short and only take up a line or two, others can be long.
  • Identifying timestamps or creating them if they don‘t exist. At the same time that it processes timestamps, Splunk identifies event boundaries.
  • Splunk can be set up to mask sensitive event data (such as credit card or social security numbers) at this stage. It can also be configured toapply custom metadata to incoming events.

In the indexing pipeline, Splunk Enterprise performs additional processing, including:

  • Breaking all events into segments that can then be searched upon. You can determine the level of segmentation, which affects indexing and searching speed, search capability, and efficiency of disk compression.
  • Building the index data structures.
  • Writing the raw data and index files to disk, where post-indexing compression occurs.

The breakdown between parsing and indexing pipelines is of relevance mainly when deploying forwardersHeavy forwarders can parse data and then forward the parsed data on to indexers for final indexing. Some source types - those that reference structured data - require configuration on the forwarder prior to indexing. See "Extract data from files with headers".

For more information about events and what happens to them during the indexing process, see the chapter "Configure event processing" in the Getting Data In Manual.

Note: Indexing is an I/O-intensive process.

This diagram shows the main processes inherent in indexing:

Note: This diagram represents a simplified view of the indexing architecture. It provides a functional view of the architecture and does not fully describe Splunk Enterprise internals. In particular, the parsing pipeline actually consists of three pipelines: parsingmerging, and typing, which together handle the parsing function. The distinction can matter during troubleshooting, but does not generally affect how you configure or deploy Splunk Enterprise.

How indexer acknowledgment works

In brief, indexer acknowledgment works like this: The forwarder sends data continuously to the receiving peer, in blocks of approximately 64kB. The forwarder maintains a copy of each block in memory until it gets an acknowledgment from the peer. While waiting, it continues to send more data blocks.

If all goes well, the receiving peer:

1. receives the block of data, parses and indexes it, and writes the data (raw data and index data) to the file system.

2. streams copies of the raw data to each of its target peers.

3. sends an acknowledgment back to the forwarder.

The acknowledgment assures the forwarder that the data was successfully written to the cluster. Upon receiving the acknowledgment, the forwarder releases the block from memory.

If the forwarder does not receive the acknowledgment, that means there was a failure along the way. Either the receiving peer went down or that peer was unable to contact its set of target peers. The forwarder then automatically resends the block of data. If the forwarder is using load-balancing, it sends the block to another receiving node in the load-balanced group. If the forwarder is not set up for load-balancing, it attempts to resend data to the same node as before.

Important: To ensure end-to-end data fidelity, you must explicitly enable indexer acknowledgment for each forwarder that‘s sending data to the cluster, as described earlier in this topic. If end-to-end data fidelity is not a requirement for your deployment, you can skip this step.

For more information on how indexer acknowledgment works, read "Protect against loss of in-flight data" in the Forwarding Data manual.

时间: 2024-10-09 15:54:31

splunk 索引过程的相关文章


Lucene的索引过程分两个阶段,第一阶段把文档索引到内存中:第二阶段,即内存满了,就把内存中的数据刷新到硬盘上.          倒排索引信息在内存存储方式 Lucene有各种Field,比如StringField,TextField,IntField,FloatField,DoubleField-,Lucene在处理的过程中把各种Field都处理成相应的byte[],以最本质的方式来看待各种Field的内容,统一了数据的存储形式. 在写入内存阶段,第一步就是需要理清各个类之间的关系. 在索


倒排索引就是根据单词内容来查找文档的方式,由于不是根据文档来确定文档所包含的内容,进行了相反的操作,所以被称为倒排索引 下面来看一个例子来理解什么是倒排索引 这里我准备了两个文件 分别为1.txt和2.txt 1.txt的内容如下 I Love Hadoop I like ZhouSiYuan I love me 2.txt的内容如下 I Love MapReduce I like NBA I love Hadoop 我这里使用的是默认的输入格式TextInputFormat,他是一行一行的读的


在搜索文档内容之前要做的事情就是对从各种不同来源(网页,数据库,电子邮件等)的文档进行索引,索引的过程就是对内容进行提取,规范化(通过对内容进行建模来实现),然后存储. 在索引的过程中有几个基本的概念,根据我自己的理解大概写一下: 文档(Document): 文档在索引和搜索的时候都会用到,是索引和搜索的基本单位(类似于关系数据库关系表中的记录),若我们对网页内容进行索引和搜索,则从互联网上爬下来的每一个网页最终都会经过分析,提取出其中有意义的部分(比如网页标题,URL,包含的关键字,发布日期等


理解索引过程中的核心类 执行简单索引的时候需要用的类有: IndexWriter.?Directory.?Analyzer.?Document.?Field 1.IndexWriter IndexWriter(写索引)是索引过程的核心组件,这个类负责创建新的索引,或者打开已有的索引,以及向索引中添加.删除或更新被索引文档的信息,但不能读取或搜索索引.IndexWriter需要开辟一定的空间来存储索引,该功能由Directory完成 2.Directory /** A Directory is a


---恢复内容开始--- 搜索的过程总的来说就是将词典及倒排表信息从索引中读出来,根据用户的查询语句合并倒排表,得到结果文档集并对文档进行打分的过程. 如图: 总共包含以下几个过程: index打开索引文件,读取并打开指向索引文件的流. 用户输入查询语句. 将查询语句转为查询对象Query对象树.(从luke中可以看出来) 构造weight对象树,用于计算词的权重,也即计算打分公司中与搜索语句有关,与文档无关的部分(红色部分). 构造Score对象树,用于计算打分. 在构造score对象树的过程


建立索引过程 用户提交数据=>solr建立索引=>调用lucene包建立索引 官方建立索引和查询索引的例子如下: http://lucene.apache.org/core/4_10_3/demo/overview-summary.html#About_the_code http://lucene.apache.org/core/4_10_3/core/overview-summary.html#overview_description 其他参考文章如下: http://dataknocker


索引 其实在计算机中我们早已接触过跟索引有关的东西,比如数据库里的索引(index),还有硬盘文件系统中其实也有类似的东西,简而言之,索引是一种为了方便找到自己需要的东西而设计出来的条目,你可以通过找索引找到自己想要内容的位置.索引过程是: 关键字->索引->文档.在图书馆内的书分门别类,就是一种按类别来分的索引.当然索引还有很多其他的实现. 仅仅有索引的概念是不够的.虽然分门别类是一种方法,但是我们在拥有一堆文档的时候必须要有从文档到索引的规范过程,并且索引的结构要满足能够让人(或者计算机)


一.MongoDB配置 mongodb配置文件/etc/mongodb.conf中的配置项,其实都是mongod启动选项(和memcached一样) [[email protected] ~]# mongod --help Allowed options: General options:   -h [ --help ]               show this usage information   --version                   show version inf


一.索引 索引通常能够极大的提高查询的效率,如果没有索引,MongoDB在读取数据时必须扫描集合中的每个文件并选取那些符合查询条件的记录.这种扫描全集合的查询效率是非常低的,特别在处理大量的数据时,查询可以要花费几十秒甚至几分钟,这对网站的性能是非常致命的. 索引是特殊的数据结构,索引存储在一个易于遍历读取的数据集合中,索引是对数据库表中一列或多列的值进行排序的一种结构 1.索引的类型 B+ Tree.hash.空间索引.全文索引 MongoDB支持的索引: 单字索引.组合索引(多字段索引).