HDU 4782 Beautiful Soup(模拟)

题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=4782

Problem Description

  Coach Pang has a lot of hobbies. One of them is playing with “tag soup” with the help of Beautiful Soup. Coach Pang is satisfied with Beautiful Soup in every respect, except the prettify() method, which attempts to turn a soup into a nicely formatted string.
He decides to rewrite the method to prettify a HTML document according to his personal preference. But Coach Pang is always very busy, so he gives this task to you. Considering that you do not know anything about “tag soup” or Beautiful Soup, Coach Pang kindly
left some information with you:

  In Web development, “tag soup” refers to formatted markup written for a web page that is very much like HTML but does not consist of correct HTML syntax and document structure. In short, “tag soup” refers to messy HTML code.

  Beautiful Soup is a library for parsing HTML documents (including “tag soup”). It parses “tag soup” into regular HTML documents, and creates parse trees for the parsed pages.

  The parsed HTML documents obey the rules below.

HTML

  HTML stands for HyperText Markup Language.

  HTML is a markup language.

  A markup language is a set of markup tags.

  The tags describe document content.

  HTML documents consist of tags and texts.

Tags

  HTML is using tags for its syntax.

  A tag is composed with special characters: ‘<’, ‘>’ and ‘/’.

  Tags usually come in pairs, the opening tag and the closing tag.

  The opening tag starts with “<” and the tagname. It usually ends with a “>”.

  The closing tag starts with “</” and the same tagname as the corresponding opening tag. It ends with a “>”.

  There will not be any other angle brackets in the documents.

  Tagnames are strings containing only lowercase letters.

  Tags will contain no line break (‘\n’).

  Except tags, anything occured in the document is considered as text content.

Elements

  An element is everything from an opening tag to the matching closing tag (including the two tags).

  The element content is everything between the opening and the closing tag.

  Some elements may have no content. They’re called empty elements, like <hr></hr>.

  Empty elements can be closed in the opening tag, ending with a “/>” instead of “>”.

  All elements are closed either with a closing tag or in the opening tag.

  Elements can have attributes.

  Elements can be nested (can contain other elements).

  The <html> element is the container for all other elements, it will not have any attributes.

Attributes

  Attributes provide additional information about an element.

  Attributes are always specified in the opening tag after the tagname.

  Tag name and attributes are separated by single space.

  An element may have several attributes.

  Attributes come in name="value" pairs like class="icpc".

  There will not be any space around the ‘=‘.

  All attribute names are in lowercase.

A Simple Example <a href="http://icpc.baylor.edu/">ACM-ICPC</a>

  The <a> element defines an HTML link with the <a> tag.

  The link address is specified in the href attribute.

  The content of the element is the text “ACM-ICPC”

  

  You are feeling dizzy after reading all these, when Coach Pang shows up again. He starts to spout for hours about his personal preference and you catch his main points with difficulty. Coach Pang says:

  Your task is to write a program that will turn parsed HTML documents into formatted parse trees. You should print each tag or text content on its own line preceded by a number of spaces that indicate its depth in the parse tree. The depth of the root of the
a parse tree (the <html> tag) is 0. He is satisfied with the tags, so you shouldn’t change anything of any tag. For text content, throw away unnecessary white spaces including space (ASCII code 32), tab (ASCII code 9) and newline (ASCII code 10), so that words
(sequence of characters without white spaces) are separated by single space. There should not be any trailing space after each line nor any blank line in the output. The line contains only white spaces is also considered as blank line. You quickly realize
that your only job is to deal with the white spaces.

Input

  The first line of the input is an integer T representing the number of test cases.

  Each test case is a valid HTML document starts with a <html> tag and ends with a </html> tag. See sample below for clarification of the input format.

  The size of the input file will not exceed 20KB.

Output

  For each test case, first output a line “Case #x:”, where x is the case number (starting from 1).

  Then you should write to the output the formatted parse trees as described above. See sample below for clarification of the output format.

Sample Input

2
<html><body>
<h1>ACM
ICPC</h1>
<p>Hello<br/>World</p>
</body></html>
<html><body><p>
Asia Chengdu Regional</p>
<p class="icpc">
ACM-ICPC</p></body></html>

Sample Output

[pre]Case #1:
<html>
 <body>
  <h1>
   ACM ICPC
  </h1>
  <p>
   Hello
   <br/>
   World
  </p>
 </body>
</html>
Case #2:
<html>
 <body>
 <p>
   Asia Chengdu Regional
  </p>
  <p class="icpc">
   ACM-ICPC
  </p>
 </body>
</html>
[/pre]

Hint

Please be careful of the number of leading spaces of each line in above sample output.

Source

2013 Asia Chengdu Regional Contest

题意:

输出一堆乱排版的<html>标签,去多余空字符,转换为按缩进输出。

代码如下:

#include <cstdio>
#include <cstring>
#include <vector>
#include <iostream>
#include <algorithm>
#include <string>
using namespace std;
vector<string> vv;
int main()
{
    int t;
    int cont, k;
    int cas = 0;
    char ss[1017];
    char c;
    scanf("%d",&t);
    while(t--)
    {
        memset(ss,'\0',sizeof(ss));
        cont = k = 0;
        vv.clear();
        c = getchar();
        while(1)
        {
            while(c==' ' || c=='\n' || c=='\t')
                c = getchar();
            if(c != '<')
            {
//                ss[k++] = c;
//                if(c == '>')
//                    c = getchar();
                while(c!='<'&&c!='\n'&&c!='\t'&&c!=' ')
                {
                    ss[k++] = c;
                    c = getchar();
                }
                ss[k] = '\0';
                vv.push_back(ss);
                k = 0;
                //printf("ss1:%s\n",ss);
            }
            else
            {
                ss[k++] = '<';
                while(c != '>')
                {
                    c = getchar();
                    ss[k++] = c;
                }
                ss[k] = '\0';
                vv.push_back(ss);
                k = 0;
                if(strcmp(ss,"</html>") == 0)
                    break;
                c = getchar();
                //printf("ss2:%s\n",ss);
            }
        }
//        for(int i = 0; i < vv.size(); i++)
//        {
//            cout<<vv[i]<<endl;
//        }
        printf("Case #%d:\n",++cas);
        int flag = 0;
        for(int i = 0; i < vv.size(); i++)
        {
            if(vv[i][0] == '<')
            {
                flag = 0;
                if(vv[i][1] != '/')//打开标签
                {
                    for(int j = 0; j < cont; j++)
                    {
                        printf(" ");
                    }
                    cout<<vv[i]<<endl;
                    int len = vv[i].size();
                    if(vv[i][len-2]!='/')//不是关闭标签
                        cont++;
                }
                else//关闭标签
                {
                    cont--;
                    for(int j = 0; j < cont; j++)
                    {
                        printf(" ");
                    }
                    cout<<vv[i]<<endl;
                }
            }
            else if(!flag)
            {
                for(int j = 0; j < cont; j++)
                {
                    printf(" ");
                }
                cout<<vv[i];
                flag = 1;
                if(vv[i+1][0] == '<')
                    printf("\n");
            }
            else if(flag)
            {
                printf(" ");
                cout<<vv[i];
                if(vv[i+1][0] == '<')
                    printf("\n");
            }
        }
    }
    return 0;
}
时间: 2024-08-24 10:19:39

HDU 4782 Beautiful Soup(模拟)的相关文章

hdu4872 Beautiful Soup 模拟

Beautiful Soup Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Total Submission(s): 1912    Accepted Submission(s): 391 Problem Description Coach Pang has a lot of hobbies. One of them is playing with “tag soup” wit

hdu 4781 Beautiful Soup 构造

并不是很难的一个构造,我在比赛的时候把题目读错了,补题的时候想得比较粗糙,迟迟没过这题,之后想法慢慢细致起来,还是将这题过了. #include<iostream> #include<cstdio> #include<cstring> #include<cstdlib> #include<algorithm> #define REP(i,a,b) for(int i=a;i<=b;i++) #define MS0(a) memset(a,0

爬虫---Beautiful Soup 爬取知乎热榜

前两章简单的讲了Beautiful Soup的用法,在爬虫的过程中相信都遇到过一些反爬虫,如何跳过这些反爬虫呢?今天通过豆瓣网写一个简单的反爬中 什么是反爬虫 简单的说就是使用任何技术手段,阻止别人批量获取自己网站信息的一种方式.关键也在于批量. 反反爬虫机制 增加请求头---headers为了模拟更真实的用户场景 更改IP地址---网站会根据你的IP对网站访问频密,判断你是否属于爬虫 ua限制---UA是用户访问网站时候的浏览器标识,其反爬机制与ip限制类似 模拟帐号登录----通过reque

[Python]HTML/XML解析器Beautiful Soup

[简介] Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.即HTML/XMLX的解析器. 它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作.它可以大大节省你的编程时间. [安装] 下载地址:点击打开链接 Linux平台安装: 如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装: $ apt-get install Python-bs4 B

python标准库Beautiful Soup与MongoDb爬喜马拉雅电台的总结

Beautiful Soup标准库是一个可以从HTML/XML文件中提取数据的Python库,它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式,Beautiful Soup将会节省数小时的工作时间.pymongo标准库是MongoDb NoSql数据库与python语言之间的桥梁,通过pymongo将数据保存到MongoDb中.结合使用这两者来爬去喜马拉雅电台的数据... Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是

HDU-4782-Beautiful Soup(模拟)

Problem Description Coach Pang has a lot of hobbies. One of them is playing with "tag soup" with the help of Beautiful Soup. Coach Pang is satisfied with Beautiful Soup in every respect, except the prettify() method, which attempts to turn a sou

hdu 1175 连连看(模拟循环队列)

连连看 Time Limit: 20000/10000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others) Total Submission(s): 18149    Accepted Submission(s): 4741 Problem Description "连连看"相信很多人都玩过.没玩过也没关系,下面我给大家介绍一下游戏规则:在一个棋盘中,放了很多的棋子.如果某两个相同的棋子,可以通过一条线连起来(这条

HDU 4608 I-number--简单模拟

I-number Time Limit: 5000ms   Memory limit: 65536K  有疑问?点这里^_^ 题目描述 The I-number of x is defined to be an integer y, which satisfied the the conditions below: 1.  y>x; 2.  the sum of each digit of y(under base 10) is the multiple of 10; 3.  among all

爬虫学习——网页解析器Beautiful Soup

一.Beautiful Soup的安装与测试 官方网站:https://www.crummy.com/software/BeautifulSoup/ Beautiful Soup安装与使用文档:  https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 1.首先测试一下bs4模块是否已经存在,若不存在再安装即可,我用的是kali测试发现bs4模块已经存在,下面介绍如何测试与安装 新建python文档输入以下代码 1 import bs4 2