MongoDB在大多数的情形中都是作为数据存储的模块而被使用,作为一个数据库,一般不应该承担更多的任务。
从专业性的角度来说,将文本搜索的任务交由专业的搜索引擎来负责,往往是更好的选择。
常用的搜索引擎与MongoDB往往都有着现成的工具,可以方便的进行结合。
1、Sphinx与mongodb-sphinx
Sphinx是一个C++编写的文本搜索引擎,其本身与MySQL结合的非常好,可以非常方便的从MySQL中导入数据。
对于其他的数据库来说,Sphinx并不提供原生的支持,但是Sphinx提供了xmlpipe2接口,任何程序只要实现了相应的接口就可以与Sphinx进行数据交互。
对于MongoDB来说,mongodb-sphinx(https://github.com/georgepsarakis/mongodb-sphinx)就是一个xmlpipe2接口的实现。
mongo-sphinx中带有一个stackoverflow的样例数据,以及运行的参数样例,只需要将样例数据导入MongoDB再执行以下的命令即可实现数据向sphinx的导入
./mongodb-sphinx.py -d stackoverflow -c posts --text-fields profile_image link --attributes last_activity_date _id --attribute-types timestamp string --timestamp-from=1366045854 --id-field=post_id
常用的参数包括:
-d 指定数据库,-c指定集合,-H指定MongoDB的地址,-p指定MongoDB的端口
-f起始时间戳,-u结束时间戳,-t需要建立搜索索引的字段
-a不索引的属性,--attribute-types为-a中的属性指定属性类型包括字符串,时间戳,整数等等
--id-field用作文档ID的字段,--threads线程数
非常重要的一点在于,mongodb-sphinx默认MongoDB数据中的_id为ObjectID,即带有时间信息的ID,所以如果需要使用自己的ID系统则在时间判断上会出现问题,需要自行修改代码。
2、ElasticSearch和Mongo-Connector
在es2.0及之前的版本中,经常用到的与MongoDB之间进行数据结合的是mongodb-river。
不过在es5之后的版本中,插件已经无法再想之前的版本一样安装,所以网上的mongodb-river教程都无法使用。
同时mongodb-river已经有几年没有更新,可能对es5的支持不如别的程序。
MongoDB官方提供了类似的工具Mongo-Connector(https://github.com/mongodb-labs/mongo-connector)
安装方法非常简单:pip install mongo-connector
Mongo-Connector支持多种不同的搜索引擎,对于es来说支持1.x,2.x,5.x等多个版本,只需要安装对应的doc-manager
也可以直接使用,pip install ‘mongo-connector[elastic5]‘安装,即可直接使用。
使用之前,需要将MongoDB切换为副本集模式,这样MongoDB才会记录oplog。
$ mongod --replSet singleNodeRepl $ mongo > rs.initiate() # MongoDB is now running on port 27017
之后,编辑一个配置文件,例如配置密码信息等:
{"authentication": {"password": XXX}}
官方自带了一个配置文件的样例:
{ "__comment__": "Configuration options starting with ‘__‘ are disabled", "__comment__": "To enable them, remove the preceding ‘__‘", "mainAddress": "localhost:27017", "oplogFile": "/var/log/mongo-connector/oplog.timestamp", "noDump": false, "batchSize": -1, "verbosity": 0, "continueOnError": false, "logging": { "type": "file", "filename": "/var/log/mongo-connector/mongo-connector.log", "__format": "%(asctime)s [%(levelname)s] %(name)s:%(lineno)d - %(message)s", "__rotationWhen": "D", "__rotationInterval": 1, "__rotationBackups": 10, "__type": "syslog", "__host": "localhost:514" }, "authentication": { "__adminUsername": "username", "__password": "password", "__passwordFile": "mongo-connector.pwd" }, "__comment__": "For more information about SSL with MongoDB, please see http://docs.mongodb.org/manual/tutorial/configure-ssl-clients/", "__ssl": { "__sslCertfile": "Path to certificate to identify the local connection against MongoDB", "__sslKeyfile": "Path to the private key for sslCertfile. Not necessary if already included in sslCertfile.", "__sslCACerts": "Path to concatenated set of certificate authority certificates to validate the other side of the connection", "__sslCertificatePolicy": "Policy for validating SSL certificates provided from the other end of the connection. Possible values are ‘required‘ (require and validate certificates), ‘optional‘ (validate but don‘t require a certificate), and ‘ignored‘ (ignore certificates)." }, "__fields": ["field1", "field2", "field3"], "__namespaces": { "excluded.collection": false, "excluded_wildcard.*": false, "*.exclude_collection_from_every_database": false, "included.collection1": true, "included.collection2": {}, "included.collection4": { "includeFields": ["included_field", "included.nested.field"] }, "included.collection5": { "rename": "included.new_collection5_name", "includeFields": ["included_field", "included.nested.field"] }, "included.collection6": { "excludeFields": ["excluded_field", "excluded.nested.field"] }, "included.collection7": { "rename": "included.new_collection7_name", "excludeFields": ["excluded_field", "excluded.nested.field"] }, "included_wildcard1.*": true, "included_wildcard2.*": true, "renamed.collection1": "something.else1", "renamed.collection2": { "rename": "something.else2" }, "renamed_wildcard.*": { "rename": "new_name.*" }, "gridfs.collection": { "gridfs": true }, "gridfs_wildcard.*": { "gridfs": true } }, "docManagers": [ { "docManager": "elastic_doc_manager", "targetURL": "localhost:9200", "__bulkSize": 1000, "__uniqueKey": "_id", "__autoCommitInterval": null } ] }
之后执行mongo-connector -c config.json命令即可开始进行数据同步。