ElasticSearch改造研报查询实践

背景：

　　1，系统简介：通过人工解读研报然后获取并录入研报分类及摘要等信息，系统通过摘要等信息来获得该研报的URI

　　2，现有实现：老系统使用MSSQL存储摘要等信息，并将不同的关键字分解为不同字段来提供搜索查询

　　3，存在问题：

　　　　-查询操作繁琐,死板：例如要查某个机构，标题含有周报的研报，现有系统需要勾选相应字段再输入条件

　　　　-查询速度缓慢，近千万级别数据响应时间4-5s

　　4，改进：使用es优化，添加多个关键字模糊查询(非长文本数据，因此未使用_socre进行评分查询)

　　　　-例如：输入“国泰君安周报”就可查询到所有相关的国泰君安的周报

1，新建Index

curl -X PUT ‘localhost:9200/src_test_1‘ -H ‘Content-Type: application/json‘ -d ‘
{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
  "mappings": {
    "doc_test": {
      "properties": {
        "title": {#研报综合标题
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "author": {#作者
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "institution": {#机构
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word"
        },
          "industry": {#行业
              "type": "text",
              "analyzer": "ik_max_word",
              "search_analyzer": "ik_max_word"
          },
          "grade": {#评级
              "type": "text",
              "analyzer": "ik_max_word",
              "search_analyzer": "ik_max_word"
          },
          "doc_type": {#研报分类
              "type": "text",
              "analyzer": "ik_max_word",
              "search_analyzer": "ik_max_word"
          },
         "time": {#发布时间
          "type": "date" ,
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
         },
          "doc_uri": {#地址
           "type": "text",
            "index":false
         },
          "doc_size": {#文件大小
           "type": "integer",
            "index":false
         },
          "market": {#市场
          "type": "byte"
         }
      }
    }
  }
}‘

2，数据导入(CSV分批)

import pandas as pd
import numpy as np
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
es = Elasticsearch()

data_will_insert = []
x = 1

# #使用pandas读取csv数据；如果出现乱码加：encoding = "ISO-8859-1"
src_data = pd.read_csv(‘ResearchReportEx.csv‘)

for index,i in src_data.iterrows():
    x+=1
    #每次插入100000条
    if x%100000 == 99999:
        #es批量插入
        success, _ = bulk(es, data_will_insert, index=‘src_test_1‘, raise_on_error=True)
        print(‘Performed %d actions‘ % success)
        data_will_insert = []

    #判断市场
    if i[‘ExchangeType‘] == ‘CN‘:
        market = 0
    elif i[‘ExchangeType‘] == ‘HK‘:
        market = 1
    elif i[‘ExchangeType‘] == ‘World‘:
        market = 2
    else:
        market = 99

    data_will_insert.append({"_index":‘src_test_1‘,"_type": ‘doc_test‘,‘_source‘:
                {
                ‘title‘:i[‘Title‘],
                ‘author‘:i[‘AuthorName‘],
                ‘time‘:i[‘CreateTime‘]+‘:00‘,
                ‘institution‘:i[‘InstituteNameCN‘],
                ‘doc_type‘:i[‘KindName‘] if i[‘Kind2Name‘] is np.NaN else i[‘KindName‘]+‘|%s‘ % i[‘Kind2Name‘],
                ‘industry‘:‘‘ if i[‘IndustryName‘] is np.NaN else i[‘IndustryName‘],
                ‘grade‘:‘‘ if i[‘GradeName‘] is np.NaN else i[‘GradeName‘],
                ‘doc_uri‘:i[‘FileURL‘],
                ‘doc_size‘:i[‘Size‘],
                ‘market‘:market
                }
                })

#将最后剩余在list中的数据插入
if len(data_will_insert)>0:
    success, _ = bulk(es, data_will_insert, index=‘src_test_1‘, raise_on_error=True)
    print(‘Performed %d actions‘ % success)

3，查询

import time
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan

# es连接
es = Elasticsearch()

# 计算运行时间装饰器
def cal_run_time(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        res = func(*args, **kwargs)
        end_time = time.time()
        print(str(func) + ‘---run time--- %s‘ % str(end_time - start_time))
        return res

    return wrapper

@cal_run_time
def query_in_es():
    body = {
        "query": {
            "bool": {
                "must": [
                    {
                        "multi_match": {
                            "query": "国泰 报告",
                            "type": "cross_fields",#跨字段匹配
                            "fields": ["title", "institution","grade"
                                       "doc_type","author","industry"],#在这6个字段中进行查找
                            "operator": "and"
                        }#此查询条件等于：query中的关键都在fields中所有字段拼接成的字符中
                    },
                    {
                        "range": {
                            "time": {
                                "gte": ‘2018-02-01‘#默认查询限制时间
                            }
                        }
                    }
                ],
            }
        }
    }

    # 根据body条件查询
    scanResp = scan(es, body, scroll="10m", index="src_test_1", doc_type="doc_test", timeout="10m")
    row_num = 0

    for resp in scanResp:
        print(resp[‘_source‘])
        row_num += 1

    print(row_num)

query_in_es()

※测试结果速度相当快：多关键字查询只需零点几秒

原文地址：https://www.cnblogs.com/dxf813/p/8447196.html

时间： 2024-10-05 07:03:36

ElasticSearch改造研报查询实践的相关文章

让Elasticsearch飞起来!——性能优化实践干货

原文:让Elasticsearch飞起来!--性能优化实践干货版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/wojiushiwo987/article/details/85109769 0.题记 Elasticsearch性能优化的最终目的:用户体验爽. 关于爽的定义--著名产品人梁宁曾经说过"人在满足时候的状态叫做愉悦,人不被满足就会难受,就会开始寻求.如果这个人在寻求中,能立刻得到

高频交易已经竞争到纳秒级！！！（赠送HFT的18篇论文+15本书籍+9篇研报）

正文高频交易是一种更频繁地用于快速启动金融交易的方法.这种由高速发送订单组成的自动交易形式在美国过去十年中经历了强劲的增长.Tabb Group的数据显示,高频交易目前约占美国贸易额的55%,欧洲贸易额的近40%. 高频交易(HFT,high frequency trading)中现有的一些知名投资银行.机构交易和对冲基金维权宣传机构包括Virtu Financial.KCG.DRW trading.Optiver.Tower Research Capital.Flow Traders.Hud

Elasticsearch+Mongo亿级别数据导入及查询实践

数据方案: 在Elasticsearch中通过code及time字段查询对应doc的mongo_id字段获得mongodb中的主键_id 通过获得id再进入mongodb进行查询 1,数据情况: 全部为股票及指数的分钟K线数据(股票代码区分度较高) Elasticsearch及mongodb都未分片且未优化参数配置 mongodb数据量: Elasticsearch数据量: 2,将数据从mongo源库导入Elasticsearch import time from pymongo impor

elasticsearch的rest搜索--- 查询

目录: 一.针对这次装B 的解释二.下载,安装插件elasticsearch-1.7.0 三.索引的mapping 四. 查询五.对于相关度的大牛的文档四. 查询 1. 查询的官网的文档 https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html 2. 查询的rest格式 3. 介绍用过的查询方式一般的查询 http://blog.csdn.net/dm_vincent/article/d

干货 |《从Lucene到Elasticsearch全文检索实战》拆解实践

1.题记 2018年3月初,萌生了一个想法:对Elasticsearch相关的技术书籍做拆解阅读,该想法源自非计算机领域红火已久的[樊登读书会].得到的每天听本书.XX拆书帮等. 目前市面上Elasticsearch的中文书籍就那么基本,针对ES5.X以上的三本左右:国外翻译有几本,都是针对ES1.X,2.X版本,其中<深入理解Elasticsearch>还算比较经典. 拆书的目的: 1)梳理已有的Elasticsearch知识体系: 2)拾遗拉在角落的Elasticsearch知识点: 3)

elk中elasticsearch安装启动报错

elasticsearch安装之后.启动报错.elasticsearch版本为5.4.1 下载安装: wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.4.1.tar.gz tar zxf elasticsearch-5.4.1.tar.gz mv elasticsearch-5.4.1 /usr/local/elasticsearch cd /usr/local/elasticsearch/ ./

elasticsearch 中文API 基于查询的删除(九)

基于查询的删除API 基于查询的删除API允许开发者基于查询删除一个或者多个索引.一个或者多个类型.下面是一个例子. import static org.elasticsearch.index.query.FilterBuilders.*; import static org.elasticsearch.index.query.QueryBuilders.*; DeleteByQueryResponse response = client.prepareDeleteByQuery("test&q

[Elasticsearch] 部分匹配 (三) - 查询期间的即时搜索

本章翻译自Elasticsearch官方指南的Partial Matching一章. 查询期间的即时搜索(Query-time Search-as-you-type) 现在让我们来看看前缀匹配能够如何帮助全文搜索.用户已经习惯于在完成输入之前就看到搜索结果了 - 这被称为即时搜索(Instant Search, 或者Search-as-you-type).这不仅让用户能够在更短的时间内看到搜索结果,也能够引导他们得到真实存在于我们的索引中的结果. 比如,如果用户输入了johnnie walker

Elasticsearch java API (23)查询 DSL Geo查询

地理查询编辑 Elasticsearch支持两种类型的地理数据: geo_point纬度/经度对字段的支持,和 geo_shape领域,支持点.线.圆.多边形.多等. 这组查询: geo_shape 查询发现文档与几何图型相交,包含,或与指定的geo-shape不相交. geo_bounding_box 查询发现文档与geo-points落入指定的矩形. geo_distance 查询发现文档geo-points内指定的中心点的距离. geo_distance_range 查询就像 ge