uva 1597 Searching the Web

The word "search engine" may not be strange to you. Generally speaking, a search engine searches the web pages available in the Internet, extracts and organizes the information and responds to users‘ queries with the most relevant pages. World famous search engines, like GOOGLE, have become very important tools for us to use when we visit the web. Such conversations are now common in our daily life:

"What does the word like ****** mean?" "Um... I am not sure, just google it."

In this problem, you are required to construct a small search engine. Sounds impossible, does it? Don‘t worry, here is a tutorial teaching you how to organize large collection of texts efficiently and respond to queries quickly step by step. You don‘t need to worry about the fetching process of web pages, all the web pages are provided to you in text format as the input data. Besides, a lot of queries are also provided to validate your system. Modern search engines use a technique called inversion for dealing with very large sets of documents. The method relies on the construction of a data structure, called an inverted index,which associates terms (words) to their occurrences in the collection of documents. The set of terms of interest is called the vocabulary, denoted as V. In its simplest form, an inverted index is a dictionary where each search key is a term ω∈V. The associated value b(ω) is a pointer to an additional intermediate data structure, called a bucket. The bucket associated with a certain term ω is essentially a list of pointers marking all the occurrences of ω in the text collection. Each entry in each bucket simply consists of the document identifier (DID), the ordinal number of the document within the collection and the ordinal line number of the term‘s occurrence within the document. Let‘s take Figure-1 for an example, which describes the general structure. Assuming that we only have three documents to handle, shown at the right part in Figure-1; first we need to tokenize the text for words (blank, punctuations and other non-alphabetic characters are used to separate words) and construct our vocabulary from terms occurring in the documents. For simplicity, we don‘t need to consider any phrases, only a single word as a term. Furthermore, the terms are case-insensitive (e.g. we consider "book" and "Book" to be the same term) and we don‘t consider any morphological variants (e.g. we consider "books" and "book", "protected" and "protect" to be different terms) and hyphenated words (e.g. "middle-class" is not a single term, but separated into 2 terms "middle" and "class" by the hyphen). The vocabulary is shown at the left part in Figure-1.Each term of the vocabulary has a pointer to its bucket. The collection of the buckets is shown at the middle part in Figure-1. Each item in a bucket records the DID of the term‘s occurrence. After constructing the whole inverted index structure, we may apply it to the queries. The query is in any of the following formats: term term AND term term OR term NOT term A single term can be combined by Boolean operators: AND, OR and NOT ("term1 AND term2" means to query the documents including term1 and term2; "term1 OR term2" means to query the documents including term1 or term2; "NOT term1" means to query the documents not including term1). Terms are single words as defined above. You are guaranteed that no non-alphabetic characters appear in a term, and all the terms are in lowercase. Furthermore, some meaningless stop words (common words such as articles, prepositions, and adverbs, specified to be "the, a, to, and, or, not" in our problem) will not appear in the query, either. For each query, the engine based on the constructed inverted index searches the term in the vocabulary, compares the terms‘ bucket information, and then gives the result to user. Now can you construct the engine?

Input

The input starts with integer N (0 < N < 100) representing N documents provided. Then the next N sections are N documents. Each section contains the document content and ends with a single line of ten asterisks. ********** You may assume that each line contains no more than 80 characters and the total number of lines in the N documents will not exceed 1500. Next, integer M (0 < M <= 50000) is given representing the number of queries, followed by M lines, each query in one line. All the queries correspond to the format described above.

Output

For each query, you need to find the document satisfying the query, and output just the lines within the documents that include the search term (For a NOT query, you need to output the whole document). You should print the lines in the same order as they appear in the input. Separate different documents with a single line of 10 dashes. ---------- If no documents matching the query are found, just output a single line: "Sorry, I found nothing." The output of each query ends with a single line of 10 equal signs. ==========

Sample Input

4
A manufacturer, importer, or seller of
digital media devices may not (1) sell,
or offer for sale, in interstate commerce,
or (2) cause to be transported in, or in a
manner affecting, interstate commerce,
a digital media device unless the device
includes and utilizes standard security
technologies that adhere to the security
system standards.
**********
Of course, Lisa did not necessarily
intend to read his books. She might
want the computer only to write her
midterm. But Dan knew she came from
a middle-class family and could hardly
afford the tuition, let alone her reading
fees. Books might be the only way she
could graduate
**********
Research in analysis (i.e., the evaluation
of the strengths and weaknesses of
computer system) is essential to the
development of effective security, both
for works protected by copyright law
and for information in general. Such
research can progress only through the
open publication and exchange of
complete scientific results
**********
I am very very very happy!
What about you?
**********
6
computer
books AND computer
books OR protected
NOT security
very
slick

Sample Output

want the computer only to write her
----------
computer system) is essential to the
==========
intend to read his books. She might
want the computer only to write her
fees. Books might be the only way she
==========
intend to read his books. She might
fees. Books might be the only way she
----------
for works protected by copyright law
==========
Of course, Lisa did not necessarily
intend to read his books. She might
want the computer only to write her
midterm. But Dan knew she came from
a middle-class family and could hardly
afford the tuition, let alone her reading
fees. Books might be the only way she
could graduate
----------
I am very very very happy!
What about you?
==========
I am very very very happy!
==========
Sorry, I found nothing.
==========

代码超时,改进后仍然超时,以下是交了两次后的超时代码
#include <iostream>
#include <string>
#include <vector>
#include <set>
#include <sstream>
#include <stdio.h>

using namespace std;

vector<string>lines[105][1505];                  //共不超过100篇文章,每篇文章不超过1500行
vector<string>::iterator t;
int line_num[105], N;                            //line_num每篇文章的行数,N

bool strCmp(const string  a , const string b)         //将a ,b都转化为大写字母比较,若相同返回true
{
    int aLen = a.length();
    int bLen = b.length();
    bool flag = true;

    int p = 0;
    if(!isalpha(b[0]))p = 1;
    //cout << a<<"a  b--"<<b<<endl;
    for(int i = 0;i < aLen;i++){
        if(tolower(a[i]) != tolower(b[p++])){
            flag = false;
            break;
        }
    }

    if(flag && p < bLen){
        if(isalpha(b[p]))flag = false;
    }
    return flag;
}

bool deal_find(string a,int p, int q)               //在一行中查找a,若存在返回true
{
    for(t = lines[p][q].begin();t != lines[p][q].end();t++){
        if(strCmp(a,*t))return true;
    }
    return false;
}

void output(int i,int j)
{
    for(t = lines[i][j].begin();t != lines[i][j].end();t++){
        if( t == lines[i][j].begin())cout<<*t;
        else cout<<" "<<*t;
    }
    cout<<endl;
}

bool WORD(string a, int k)
{
    bool flag = false, re = false;

    for(int i = 0; i < N; i++){
        for(int j = 0; j < line_num[i]; j++){

            if(deal_find(a,i,j)){
                re = true;
                if(flag&&k)cout<<"----------"<<endl;
                flag = true;
                output(i,j);
            }

        }

    }
    return re;
}

bool AND(string a, string b)
{
    bool flag = false, re = false;
    for(int i = 0; i < N; i++){
        int a0 = 0,b0 = 0;                                 //分别记录在文章中有没有查找到字符串a或b
        set<int> and_line;
        for(int j = 0; j < line_num[i]; j++){
            if(deal_find(a,i,j)){
                a0 = 1;
                and_line.insert(j);
            }
            if(deal_find(b,i,j)){
                b0 = 1;
                and_line.insert(j);
            }
        }
        if(a0 && b0){
            re = true;
            if(flag)cout<<"----------"<<endl;
            flag = true;
             set<int>::iterator iter;
             for(iter=and_line.begin();iter!=and_line.end();iter++)output(i,*iter);
        }
    }
    return re;
}

bool NOT(string a)
{
    bool flag , re = false, k = false;
    for(int i = 0; i < N; i++){
        flag = false;
        for(int j = 0; j < line_num[i]; j++){
            if(deal_find(a,i,j)){
                flag = true;
                break;
            }
        }
        if(flag)continue;
        else{
            re = true;
            if(k)cout<<"----------"<<endl;
            k = true;
            for(int j = 0;j < line_num[i]; j++)output(i,j);
        }
    }
    return re;
}

int main()
{
    int num1 = 0, M;
    cin >> N;
    int n = N;
    while(n--){                                //n篇文章输入
        int num2 = 0;
        string line;

        while((getline(cin,line)) != NULL){
            bool flag = true;
            stringstream ss(line);
            string word;

            while(ss >> word){
                if( word[0] == ‘*‘ ){
                    flag = false;
                    break;
                }
                lines[num1][num2].push_back(word);
            }

            if(!flag)break;
            num2++;

        }
        line_num[num1] = num2;
        num1++;
    }

    cin >> M;
    bool re1,re2;
    string com;
    getchar();
    while(M--){
        getline(cin, com);
        if(com.find("AND") != string::npos){
            re1 = AND(com.substr(0,com.find_first_of(‘ ‘)), com.substr(com.find_last_of(‘ ‘)+1));
            if(!re1)cout << "Sorry, I found nothing."<<endl;
        }
        else if(com.find("OR") != string::npos){
            re1 = WORD(com.substr(0,com.find_first_of(‘ ‘)) ,0);
            cout<<"----------"<<endl;
            re2 = WORD(com.substr(com.find_last_of(‘ ‘)+1) ,0);
            if(!re1&&!re2)cout << "Sorry, I found nothing."<<endl;
        }
        else if(com.find("NOT")!= string::npos){
            re1 = NOT(com.substr(com.find_last_of(‘ ‘)+1));
            if(!re1)cout << "Sorry, I found nothing."<<endl;
        }
        else {
            re1 = WORD(com, 1);
            if(!re1)cout << "Sorry, I found nothing."<<endl;
        }

        cout << "==========" << endl;
    }

    //system("pause");
    return 0;
}
最后分别在VS,CB上运行,发现主函数的返回值有问题,程序已经运行结束,然而程序仍没有退出。出现以下情况需要再点一次回车然后
程序内部的错误吧........越来越不懂计算机了....T_T接下来又继续改,已经没有上面的问题了,而且感觉结果正确,但是!!!!!还是超时了!!!!
#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <vector>
#include <sstream>
#include <set>
#include <algorithm>
#include <iterator>

using namespace std;

vector<string>lines[105][1505];                  //共不超过100篇文章,每篇文章不超过1500行
vector<string>::iterator t;
int line_num[105], N;

#define FILE

bool deal_find(string a,int p, int q)               //在一行中查找a,若存在返回true
{
    for(t = lines[p][q].begin();t != lines[p][q].end();t++){
        int aLen = a.length(), bLen = (*t).length();
        bool flag = true;
        int p = 0;
        if(!isalpha((*t)[0]))p = 1;
        for(int i = 0;i < aLen;i++){
            if(tolower(a[i]) != tolower((*t)[p++])){
                flag = false;
                break;
            }
        }

        if(flag && p < bLen){
            if(isalpha((*t)[p]))flag = false;
        }
        if(flag)return true;
    }
    return false;
}

void output(int i,int j)
{
    for(t = lines[i][j].begin();t != lines[i][j].end();t++){
        if( t == lines[i][j].begin())cout<<*t;
        else cout<<" "<<*t;
    }
    cout<<endl;
}

bool WORD(string a)
{
    bool flag = false, re = false;

    for(int i = 0; i < N; i++){

        flag = false;int k = 0;
        for(int j = 0; j < line_num[i]; j++){

            if(deal_find(a,i,j)){
                if(re)flag = true;       //falg = true 说明前已有文章的片段输出
                re = true;
                if(flag && re && !k)cout<<"----------"<<endl;
                k = 1;
                output(i,j);
            }

        }

    }
    return re;
}

bool AND(string a, string b)
{
    bool flag = false, re = false;
    for(int i = 0; i < N; i++){
        int a0 = 0,b0 = 0;                                 //分别记录在文章中有没有查找到字符串a或b
        set<int> and_line;
        for(int j = 0; j < line_num[i]; j++){
            if(deal_find(a,i,j)){
                a0 = 1;
                and_line.insert(j);
            }
            if(deal_find(b,i,j)){
                b0 = 1;
                and_line.insert(j);
            }
        }
        if(a0 && b0){
            re = true;
            if(flag)cout<<"----------"<<endl;
            flag = true;
             set<int>::iterator iter;
             for(iter=and_line.begin();iter!=and_line.end();iter++)output(i,*iter);
        }
    }
    return re;
}

bool OR(string a, string b)
{
    bool flag = false, re = false;
    for(int i = 0; i < N; i++){
        flag = true;
        int k = 1;
        for(int j = 0; j < line_num[i]; j++){
            if(deal_find(a,i,j)){
                if(flag&&k&&re){
                    cout<<"----------"<<endl;
                    k = 0;
                }
                flag = false;
                re = true;
                output(i,j);
            }
            if(deal_find(b,i,j)) {
                if(flag&&k&&re){
                    cout<<"----------"<<endl;
                    k = 0;
                }
                flag = false;
                re = true;
                output(i,j);
            }
        }
    }
    return re;
}
bool NOT(string a)
{
    bool flag , re = false, k = false;
    for(int i = 0; i < N; i++){
        flag = false;
        for(int j = 0; j < line_num[i]; j++){
            if(deal_find(a,i,j)){
                flag = true;
                break;
            }
        }
        if(flag)continue;
        else{
            re = true;
            if(k)cout<<"----------"<<endl;
            k = true;
            for(int j = 0;j < line_num[i]; j++){
                    output(i,j);
            }
        }
    }
    return re;
}

int main(int argc, char* argv[])
{
    int M, num1 = 0,num2 = 0;
    string line;
    cin >> N;
    cin.get();

    for(int i = 0; i <N; i++){
       num2 = 0;
    while((getline(cin,line)) != NULL){
            if(line == "**********") break;

            stringstream ss(line);
            string word;

            while(ss >> word)lines[num1][num2].push_back(word);
            num2++;
            }
            line_num[num1] = num2;
            num1++;
        }

        cin >> M;
        bool re1,re2;
        string com;
        cin.get();
        for(int i=0;i<M;i++)
        {
            getline(cin,com);
            if(com[0]==‘N‘)
            {
            re1 = NOT(com.substr(com.find_last_of(‘ ‘)+1));
            if(!re1)cout << "Sorry, I found nothing."<<endl;
            }
            else if(com.find("AND")!=string::npos)
            {
            re1 = AND(com.substr(0,com.find_first_of(‘ ‘)), com.substr(com.find_last_of(‘ ‘)+1));
            if(!re1)cout << "Sorry, I found nothing."<<endl;
            }
            else if(com.find("OR")!=string::npos)
            {
            re1 = OR(com.substr(0,com.find_first_of(‘ ‘)),com.substr(com.find_last_of(‘ ‘)+1));
            if(!re1)cout << "Sorry, I found nothing."<<endl;
            }
            else
            {
            re1 = WORD(com);
            if(!re1)cout << "Sorry, I found nothing."<<endl;
            }
            cout<<"=========="<<endl;
        }
    //system("pause");
    return 0;
}

醉了,这下足以说明是思路的问题了,思路不正确导致超时。



				
时间: 2024-10-05 23:08:54

uva 1597 Searching the Web的相关文章

STL --- UVA 123 Searching Quickly

UVA - 123 Searching Quickly Problem's Link:   http://acm.hust.edu.cn/vjudge/problem/viewProblem.action?id=19296 Mean: 有一个字符串集合Ignore,还有一个文本集合TXT,在TXT中除了Ignore中的单词外其他的都是关键字,现在要你根据这些关键字来给TXT文本排序(根据关键字的字典). 注意:一行TXT文本中含多少个关键字就需要排多少次序,如果关键字的字典序相同则按照先后顺序来

UVa 1597

超时代码送上: // UVa 1597 //#define LOCAL #include <iostream> #include <sstream> #include <cctype> // for tolower() isalpha() #include <string> #include <algorithm> // for set_union() set_intersection() set_difference() #include &l

Searching the Web UVA - 1597

链接:https://vjudge.net/problem/UVA-1597#author=0 这题写了我一个晚上,然后debug了一个早上.. 最主要就是AND那一部分,一开始用了一个很奇怪的方法实现,就是利用set递增的性质,设置一个cur变量保存现在遍历到的文章下标的最大值,然后检查s1和s2能否取到,cur每次取当前s1和s2的文章下标最大值.中间实现的时候也出了点bug,没有在遍历到末尾的时候跳出循环.然而这不是重点..重点在于cur不一定取到,也就是可以跳过cur取一个更大的值, 这

UVa 12505 Searching in sqrt(n)

传送门 一开始在vjudge上看到这题时,标的来源是CSU 1120,第八届湖南省赛D题“平方根大搜索”.今天交题时CSU突然跪了,后来查了一下看哪家OJ还挂了这道题,竟然发现这题是出自UVA的,而且用的原题. ------------------------------------------------------------------------------------------------------------------ time limit 5s In binary, the

POJ 2050 Searching the Web

题意简述:做一个极其简单的搜索系统,对以下四种输入进行分析与搜索: 1. 只有一个单词:如 term, 只需找到含有这个单词的document,然后把这个document的含有这个单词term的那些行输出. 2.term1 AND term2, 找到同时含有term1 和 term2 的document,然后把这个document的含有这个单词term1 或 term2 的那些行输出. 3.term1 OR term2, 找到含有term1 或 term2 的document,然后把这个docu

UVA - 123 Searching Quickly

题目链接 这道题就是给定 一系列ignore词(全部是小写),以::结尾 然后  输入一系列文本,每行不包括ignore词的作为关键词,(与ignore词比较不区分大小写) ,然后排序输出.每一行中可能出现几个关键词,那就以出现顺序先后输出,如果有几行包括了同一个关键词,就以输入时顺序输出,其余的按照字典序排序输出.输出的时候时候除了关键词大写外,其余都要小写. 这道题做的时候有点长,不过幸好1A. 我的思路是先把文本全部转化为小写,然后取出关键词,同时保存它的初始位置在哪一行以及在这一行出现的

基于STL的字典生成模块-模拟搜索引擎算法的尝试

该课题来源于UVA中Searching the Web的题目:https://vjudge.net/problem/UVA-1597 按照题目的说法,我对按照特定格式输入的文章中的词语合成字典,以满足后期的快速查找. 针对于字典的合成途径,我利用了STL中的map与set的嵌套形成了一种特定的数据结构来解析文章中的单词 1 #include<map> 2 #include<iostream> 3 #include<set> 4 #include<algorithm

第五章-習題(1-11)待續

5-1 代碼對齊(UVa 1593) 不難,按行讀取,然後stringstream輸入到vector<string>那裏去,算出行最大單詞數,再算出列單詞最大寬度,然後就可以格式化輸出了: #include<iostream> #include<string> #include<algorithm> #include<vector> #include<cstdio> #include<sstream> using name

Multicast over TCP/IP HOWTO

http://www.tldp.org/HOWTO/Multicast-HOWTO.html 1. Introduction 1.1 What is Multicast Multicast is... a need. Well, at least in some scenarios. If you have information (a lot of information, usually) that should be transmitted to various (but usually