网页解析正则表达式

在写爬虫的过程中，最麻烦的就是写正则表达式，还要一个一个的尝试，一次次的调试，很是费时间。于是我就写了一个网页版的，只需要输入要爬的网址，和正则式，网页上就可以显示爬到的数据。

思路：其实很简单，将网址和正则式传到服务器，服务器解析之后，将结果返回到前端。我用的是bootcss(前端)+bottle(后台用python处理)，代码很简单，就是过程有些复杂。由于传递的参数是一个网址，而后台判断参数结束的标志是/......./,所以每次都是传值失败，后来想到用先用base64加密再传递

webRegx.py

import urllib2
import re
import json

def getHtml(url):
    html = urllib2.urlopen(url).read()
    return html

def getResult(url,reg):
    html = urllib2.urlopen(url).read()
    reg = re.compile(reg)
    results = reg.findall(html)
    if len(results)>0:
        for result in results:
            print result
    else:
        print "not result"
    return json.dumps(results)

注意：最后要返回一个json结构的数据

main.py

from bottle import route,request,template,run,Bottle,static_file
from webRegx import getResult
import base64

app = Bottle()

@app.route('/')
def show():
    return template('templates/index')

@app.route('/jiexi/:webstr#.*?#',method='post')
def test(webstr):
    #return "hello{}!".format(name)
    #webstr = webstr.replace(',','?')
    base64_url,base64_reg =webstr.split(",")
    url=base64.decodestring(base64_url)#解密
    reg=base64.decodestring(base64_reg)
    return getResult(url,reg)

@app.route('/templates/:filename')
def send_static(filename):
    return static_file(filename, root='./templates')

run(app, host='localhost', port=8080)

index.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="description" content="">
    <meta name="author" content="">

    <title>Sticky Footer Template for Bootstrap</title>

   <!-- 新 Bootstrap 核心 CSS 文件 -->
    <link rel="stylesheet" href="http://cdn.bootcss.com/bootstrap/3.2.0/css/bootstrap.min.css">

<!-- 可选的Bootstrap主题文件（一般不用引入） -->
    <link rel="stylesheet" href="http://cdn.bootcss.com/bootstrap/3.2.0/css/bootstrap-theme.min.css">

<!-- jQuery文件。务必在bootstrap.min.js 之前引入 -->
    <script src="http://cdn.bootcss.com/jquery/1.11.1/jquery.min.js"></script>
    <script src="./templates/base64.js"></script>
<!-- 最新的 Bootstrap 核心 JavaScript 文件 -->
    <script src="http://cdn.bootcss.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
    <!-- Custom styles for this template -->
    <style type="text/css">
      /* Sticky footer styles
      -------------------------------------------------- */
      html {
        position: relative;
        min-height: 100%;
      }
      body {
        /* Margin bottom by footer height */
        margin-bottom: 60px;
        font-family: 'microsoft yahei', 'Times New Roman', 宋体, Times, serif;
      }
      .footer {
        position: absolute;
        bottom: 0;
        width: 100%;
        /* Set the fixed height of the footer here */
        height: 60px;
        background-color: #f5f5f5;
      }

      /* Custom page CSS
      -------------------------------------------------- */
      /* Not required for template or sticky footer method. */

      .container {
        width: auto;
        max-width: 800px;
        padding: 0 15px;
      }
      .container .text-muted {
        margin: 20px 0;
      }
    </style>

  </head>

  <body>

    <!-- Begin page content -->
    <div class="container">
      <div class="page-header">
        <h1>正则匹配</h1>
      </div>

      <div>
          <div class="input-group input-group-lg">
            <span class="input-group-addon">url</span>
            <input type="text" class="form-control" placeholder="输入网址" id="url" name ="url">
          </div><br/>
          <div class="input-group input-group-lg">
            <span class="input-group-addon">reg</span>
            <input type="text" class="form-control" placeholder="输入正则表达式" id="reg" name ="reg">
            <span class="input-group-btn">
              <button class="btn btn-default" type="submit"  onclick="HtmlRegx()" id="myButton">搜索</button>
            </span>
          </div>
          <div class="modal fade" id="tip">
            <div class="modal-dialog">
             <div class="modal-content">
               <h3 class="modal-title">提示</h3>
                <div class="modal-body"><p><h3>正在加载...</h3></p></div>
              </div>
           </div>
          </div>
      </div>
      <br/>
      <div>
        <ul class="list-group" id="data-table">
        </ul>
      </div>

     </div>
    <div class="footer">
      <div class="container">
        <p class="text-muted">Place sticky footer content here.</p>
      </div>
    </div>

  </body>

<script type="text/javascript">
function HtmlRegx()
{
  $('#tip').modal('show');
  var url = document.getElementById("url").value; //网址

  var reg = document.getElementById("reg").value; //正则式
  if(url=="" || reg=="")
  {
    alert("网址或者正则式为空");
    return;
  }
  var base64 = new Base64();
  var base64_url = base64.encode(url);
  var base64_reg = base64.encode(reg);

  //var posturl = "/jiexi/"+ url.split("?")+""+reg;
  var posturl = "/jiexi/"+base64_url+","+base64_reg; 

  postdata(posturl,reg);
}

function postdata(url,reg)
{
    $.ajax({
            type:"POST",
            url:url,
            dataType:"json",
            success:function(data)
              {
              console.log(data[0]);
             /* $("#table").append('<tr><td>' + data.length + '</td></tr>')*/
              show(data);
               }
            });
 }

function show(data)
{
   $('#tip').modal('hide');
    for(var i=0;i<data.length;i++)
    {
     $("#data-table").append('<li class="list-group-item">'+data[i]+'</li>');
   }
}
</script>
</html>

查询用的是ajax方式。

最后效果：

时间： 2024-10-28 01:40:50

网页解析正则表达式的相关文章

网页解析器

1.网页解析器:从网页中提取有价值的数据. 2.python网页解析的方式: 正则表达式.html.parser(python自带).Beautiful Soup(第三方).lxml(python自带). Beautiful Soup可以使用html.parser或者lxml作为解析器 3.网页解析器就是结构化解析-DOM(Document Object Model)树 4.安装Beautiful Soup以及官网地址 pip install beautifulsoup4 http://www.

关于爬虫中常见的两个网页解析工具的分析 —— lxml / xpath 与 bs4 / BeautifulSoup

读者可能会奇怪我标题怎么理成这个鬼样子,主要是单单写 lxml 与 bs4 这两个 py 模块名可能并不能一下引起大众的注意,一般讲到网页解析技术,提到的关键词更多的是 BeautifulSoup 和 xpath ,而它们各自所在的模块(python 中是叫做模块,但其他平台下更多地是称作库),很少被拿到明面上来谈论.下面我将从效率.复杂度等多个角度来对比 xpath 与 beautifulsoup 的区别. 效率从效率上来讲,xpath 确实比 BeautifulSoup 高效得多,每次分步

Python网页解析

续上篇文章,网页抓取到手之后就是解析网页了. 在Python中解析网页的库不少,我最开始使用的是BeautifulSoup,貌似这个也是Python中最知名的HTML解析库.它主要的特点就是容错性很好,能很好地处理实际生活中各种乱七八糟的网页,而且它的API也相当灵活而且丰富. 但是我在自己的正文提取项目中,逐渐无法忍受BeautifulSoup了,主要是因为下面几个原因: 由于BeautifulSoup 3(当前的版本)依赖于Python内建的sgmllib.py,而sgmllib.py有好些

网页解析库-Xpath语法

网页解析库简介除了正则表达式外,还有其他方便快捷的页面解析工具如:lxml (xpath语法) bs4 pyquery等 Xpath 全称XML Path Language, 即XML路径语言,是一门在XML文档中查找信息的语言,同样也适用于HTML文档的搜索. 爬虫我们需要抓取的只是某个网站或者应用的一部分内容结构化先有结构再有数据转化为字典处理(如:json => dict) 非结构化 html 正则 lxml bs4等去解析安装: pip install beautifu

phantomjs介绍-(js网页截屏、javascript网页解析渲染工具)

phantomjs介绍-(js网页截屏.javascript网页解析渲染工具) phantomjs 是一个基于js的webkit内核无头浏览器也就是没有显示界面的浏览器,这样访问网页就省去了浏览器的界面绘制所消耗的系统资源,比较适合用于网络测试等应用 .利用这个工具,我们可以轻松的搭建一个接口用于获取我们想要的url的整页截屏. PhantomJS is a headless WebKit with JavaScript API. It has fast and native support

转：Python网页解析：BeautifulSoup vs lxml.html

转自:http://www.cnblogs.com/rzhang/archive/2011/12/29/python-html-parsing.html Python里常用的网页解析库有BeautifulSoup和lxml.html,其中前者可能更知名一点吧,熊猫开始也是使用的BeautifulSoup,但是发现它实在有几个问题绕不过去,因此最后采用的还是lxml: 1. BeautifulSoup太慢.熊猫原来写的程序是需要提取不定网页里的正文,因此需要对网页进行很多DOM解析工作,经过测试

ios非UTF-8格式的网页解析

网上有很多关于ios xml解析的方法,关于非UTF-8格式的网页解析也不少,我也试着看了好几个,但都没成功.今天无意中却弄好了,所以想和大家分享下.其实很简单,下面说下怎么得到非UTF-8格式的网页要解析的代码内容,这也是比较关键的一步,剩余的解析过程网上很多,这里就不在赘述了,如果实在有需要得话,再跟我联系. 1.以百度(http://www.baidu.com)为例,其网页格式为gb2312,新建一个项目project,在-(void)viewDidLoad内输入一下内容: 1NSURL

Python_爬虫_BeautifulSoup网页解析库

BeautifulSoup网页解析库 from bs4 import BeautifulSoup 0.BeautifulSoup网页解析库包含的几个解析器 Python标准库[主要,系统自带;] 使用方法: BeautifulSoup(markup,"html.parser")[注:markup是html文档] Python的内置标准库案例: ` from bs4 import BeautifulSoup ` soup = BeautifulSoup.(html,'html.pa

JavaScript（2）——网页解析过程

JavaScript 网页解析过程前端编程工具:Visual Studio Code 快捷语法:Emmett语法正题: 当我们在浏览器输入网址的时候,从服务器下载网页:这个文字经过HTML解析器的处理生成一大堆对象,因此打开一个网页的时候会占用很大的内存.网页最终变成一副图片.网页解析成对象后,这些对象会被HTML渲染器(Rander)监视,然后把他们绘制成一张张图片:它会根据W3C去绘制,例如把button绘制成按钮,那么必须就绘制成按钮,绘制成怎样的按钮不做规定.HTML解析器不能渲染非