dtSearch使用教程:全文数据库索引
A simple index
Start a new C# project and make sure you have added a reference to the dtSearch library and have added:
1 |
|
to the start of the project.
Creating an index under program control with dtSearch is exceptionally simple. All you need is an IndexJob object:
1 |
|
You simply set the properties of the IndexJob object to specify the index you want to create and call one of the Execute methods to build or update the index.
So what do you have to specify to create an index?
First you have to say where you want the index to be created:
1 |
|
There is no particular reason to use this location; it is just the default used by the dtSearch Desktop utility for the indexes it creates. Notice that you specify the directory that the files for the index are created in.
Next you have to specify the folders and file that you would like to index. This is achieved using the FoldersToIndex string collection. You can add as many strings specifying paths to folders to this collection as you need. For the example we will add just one:
1 |
|
You can add a <+> to the end of the path to signify that all of the subfolders should be indexed. If you don‘t add <+> then just the content of the specified folder is indexed. You can also add include and exclude filters to specify which types of file are to be indexed. For simplicity we will ignore filters.
Finally, we have to set some "Action" properties that indicate how the indexing operation should be performed. The ActionCreate property has to be set to true for the indexing operation to create a new index. If the index already exists then it is overwritten. The ActionAdd property allows new documents to be added to the index. To create a new empty index and add files to it you have to set both:
1 2 |
|
The IndexJob is now setup with minimal configuration and we can start it going. The simplest way to do this is to use the Execute method. This starts the indexing off and only returns with a Boolean to indicate success or failure when the index is complete. So, to complete the program, we have to add:
1 |
|
The complete program is:
1 2 3 4 5 6 7 8 |
|
Execute may be simple but it isn‘t really of much use.
Do you really want your indexing program to wait unresponsively while the index is constructed?
No, probably not.
In most cases the construction of an index takes more time that you can afford to have the UI blocked for. The standard solution in this case is to run the long blocking process on another thread. In this case dtSearch makes this very easy for you.
Instead of calling Execute, all you have to do is call ExecuteInThread and the call returns immediately and the indexing proceeds on another thread. You can keep control of the progress of the index using IsThreadDone, AbortThread and so on.
Implementing a full indexing application using these facilities is fairly easy - everything works as you would expect - and so for simplicity of the example we will avoid the slight complication of making the indexing asynchronous. In this case it doesn‘t matter too much because the index is small and completed in a few minutes or less.
Other data sources
One of the nice things about dtSearch is that it tends to implement facilities in ways that are simple, direct and probably the way you would choose to do it as well. Of course this means that you don‘t get the chance to use a lot of new jargon but you also get the program completed quicker.
Rather than implementing lots of different interfaces to work with standard data exchange protocols dtSearch simply provides a DataSource class. This uses any protocol you care to name internally to retrieve the data and then presents it to the indexing engine in a simple and uniform way.
Now in all probability you are already an expert on ADO, LINQ or RSS and so I‘m not going to go over any of these technologies. What I am going to concentrate on is how the DataSource class is used to feed the data to the indexing engine.
Let‘s get started.
Creating a custom DataSource
The basic idea is very simple - you have to create a class that inherits from DataSource. You have to override a few of the DataSource methods to provide the data to the search engine.
You can provide the data to the search engine either via DocText, DocStream, DocBytes or DocIsFile. The difference is that DocText is a simple string and the other three provide binary data that is treated as if it was a file of a specified format.
There are only two methods you have to implement - GetNextDoc and Rewind.
The GetNexDoc has to get the next "document" be it a row in a database table or a file downloaded by any means you want to use and present it to the indexing engine via one of the properties listed above. It simply returns true or false to indicate success or failure.
The Rewind method simply resets the document sequence so that the next GetNextDoc returns the first document in the sequence. It too returns true or false to indicate success or failure.
There are some other properties that you have to set to make everything work well but these are the basic core set. Let‘s see how it all works.
Rather than write an example that uses ADO, LINQ or some other data protocol it is simpler to read some files from disk. It shows how everything works and you can modify it to work with any other protocol. In fact the index to be constructed is the same as the first example.
First we need to define our own DataSource class:
1 2 |
|
Usually this would be in another file in the project but when experimenting you can include it within the form‘s source file. Also to keep things simple let‘s not bother writing a constructor and dispense with error checking. This is not the way you would do it in anything other than an example that has been stripped down the to the bare minimum.
We need to override two methods GetNextDoc and Rewind. The Rewind method has to reset the data import so this is also the place to write the initialization code:
1 2 3 4 5 6 7 |
|
We are using standard .NET I/O classes to work with the file system. You need to add:
1 |
|
and declare the two private variables:
1 2 |
|
We now have a list of file names in the string array files. Notice that we really do need to check that this operation worked and return false if it didn‘t. In a more realistic application the Rewind might well only reset the position in the data and you would probably need to write a separate initialization method to be used internally by the DataSource class.
The GetNextDoc method could return the next file in the list in a number of different ways - as a file, as stream or as an array of bytes. We could even read the file in and extract any text it might contain and present this as a string. In this case let‘s read the file into a byte array and present this to the indexing engine:
1 2 |
|
First we should check that we haven‘t reached the end of the list of files:
1 2 |
|
As long as there is a file to process we can process it. First we set DocName to the name of the file, notice that DocName is one of the inherited properties:
1 2 |
|
Next we set the inherited data and time stamp properties:
1 2 |
|
We also have to set DocIsFile to false to stop the Index engine reading the file in from disk on is own - yes we could get it to do all of the work but this wouldn‘t illustrate how to get raw data to it.
1 |
|
As we have decided to handle the data input ourselves we next have to read the data into a byte array. We also have to check that the file actually has some data to read:
1 2 3 4 5 |
|
At this point we have the entire content of the file stored in the fileData array. However the file data has to be presented in DocBytes and we also have to set HaveDocBytes to true to indicate to the indexing engine that it has to read and process DocBytes:
1 2 3 |
|
We can now finish the method and the class:
1 2 |
|
The entire class is surprisingly short
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Using the custom DataSource
Now we have the custom DataSource we can make use of it. Setting up the index creation is much the same as before - create IndexJob, set index path and action properties:
1 2 3 4 5 |
|
Next we create an instance of the custom DataSource:
1 |
|
Finally we can tell the IndexJob to use the data source, and finally execute the job:
1 2 |
|
The indexing engine performs a rewind to make sure everything is initialized before it begins.
If you try this out you will discover that the contents of the index are the same as before. The program might achieve the same result but it does it in a very different way. Now you can take the same DataSource class and customize it to provide documents or raw text from any source you care to use - ODB, ADO.NET, LINQ, raw SQL, XML, RSS or any of the many web APIs.
原文地址:http://www.i-programmer.info/programming/database/3408-full-text-database-indexing-with-dtsearch.html