『Re』知识工程作业_主体识别

作业要求

环境路径

类似于这样的，一共50篇文档，

均为中文文档，是法院判决书的合集。

程序

程序如下，我完全使用正则表达式来实现功能，

import re
import glob
import copy

name_list = glob.glob(‘./*.txt‘)

date_totul = []
indictee_totul = []
court_totul = []
procuratorate_totul = []
with open(‘./result.txt‘,‘a‘,encoding=‘utf-8‘) as f_r:
    for name in name_list:
        f_r.write(‘<{0}>\n\n‘.format(name.split(‘\\‘)[-1]))
        with open(name,encoding=‘utf-8‘) as f:
            lines = f.read()

            # 时间匹配
            #xxxx年xx月xx日；同年xx月xx日；xxxx年xx月x旬；xxxx年xx月底；xxxx年xx月；xx月xxx日
            # |优先匹配前面的，无符合才匹配后面的
            pattern_t = re.compile(                ‘[0-9〇一二三四五六七八九]{4}年.{1,2}月.{1,3}日‘
                ‘|同年.{1,2}月.{1,3}日‘
                ‘|[0-9〇一二三四五六七八九]{4}年.{1,2}月.{1}旬‘
                ‘|[0-9〇一二三四五六七八九]{4}年.{1,2}月底‘
                ‘|[0-9〇一二三四五六七八九]{4}年.{1,2}月‘
                ‘|[0-9〇一二三四五六七八九十]{1,2}月.{1,3}日‘)
            date_step = [date for date in pattern_t.findall(lines)]
            # print(date_step)
            for i in date_step:
                f_r.write(‘<time>{0}</time>\n\n‘.format(i))
            date_totul.extend(date_step)

            # 被告匹配
            pattern_i = re.compile(‘被告人(.{2,4}?)[,，]|被上诉人：(.+?)。|被执行人：(.+?)。‘)
            pattern_i2 = re.compile(‘被申诉人\（.+\）：(.+)。|被申请人\（.+\）：(.+)。‘)
            defendant = list(set([item[0] for item in re.findall(pattern_i,lines) + pattern_i2.findall(lines) if ‘死刑‘ not in item[0]]))

            if defendant != [] and defendant != [‘‘]:
                print(defendant)

                # indictee_totul.append(indictee_step)
                for item in defendant:
                    f_r.write(‘<defendant>{0}</defendant>\n\n‘.format(item))

            # 法院匹配
            pattern_c = re.compile(‘[\n。,，《；](.{,15}?人民法院)‘)
            _court_list =  [name.group(1) for name in pattern_c.finditer(lines)]
            _court_step = []
            # print(_court_list)
            for _court in _court_list:
                _court_step.append(_court.split(‘。‘)[-1]
                                   .split(‘，‘)[-1]
                                   .split(‘《‘)[-1]
                                   .split(‘、‘)[-1]
                                   .split(‘；‘)[-1])
                _court_step = list(set(_court_step))
                for court_name in _court_step:
                    if ‘由‘ in court_name or                                    ‘向‘ in court_name or                                     ‘受‘ in court_name or                                     ‘和‘ in court_name:
                        # print(name)
                        _court_step.append(court_name.split(‘由‘)[-1].
                                           split(‘向‘)[-1].
                                           split(‘受‘)[-1].
                                           split(‘和‘)[-1])
                        _court_step.remove(court_name)
                        # print(_court_step)
                    if ‘不服‘ in court_name or                                     ‘后被‘ in court_name or                                     ‘报请‘ in court_name or                                     ‘书证‘ in court_name or                                     ‘核准‘ in court_name or                                     ‘指令‘ in court_name or                                     ‘维持‘ in court_name or                                     ‘撤销‘ in court_name or                             ‘参照‘ in court_name:
                        _court_step.append(copy.deepcopy(court_name[2::]))
                        _court_step.remove(court_name)

                    _court_step = list(set(_court_step))
                    if ‘人民法院‘ in _court_step:
                        _court_step.remove(‘人民法院‘)

            for i in _court_step:
                f_r.write(‘<court>{0}</court>\n\n‘.format(i))
            # print(name,_court_step)

            # 检察院匹配
            pattern_p = re.compile(‘审理(.+)指控‘)
            procuratorate_step = list(set([name.group(1) for name in pattern_p.finditer(lines)]))
            procuratorate_totul.extend(procuratorate_step)
            # print(name,procuratorate_step)
            for i in procuratorate_step:
                f_r.write(‘<procuratorate>{0}</procuratorate>\n\n‘.format(i))

            # 地点匹配
            pattern_pl_1 = re.compile(‘(.{2}省.+?县)‘)
            pattern_pl_2 = re.compile(‘.{2}省.{2}市‘)
            pattern_pl_3 = re.compile(‘.{2}省.+?自治州‘)
            pattern_pl_4 = re.compile(‘.{2}省.+?乡‘)
            pattern_pl_5 = re.compile(‘.{2}市.{2}区‘)
            pattern_pl_6 = re.compile(‘.{2}市.{2}镇‘)
            pattern_pl_7 = re.compile(‘.{2}市.+?开发区‘)
            place_step = list(set([name.group(0) for name in pattern_pl_1.finditer(lines)]))
            place_step.extend(list(set([name.group(0) for name in pattern_pl_2.finditer(lines)])))
            place_step.extend(list(set([name.group(0) for name in pattern_pl_3.finditer(lines)])))
            place_step.extend(list(set([name.group(0) for name in pattern_pl_4.finditer(lines)])))
            place_step.extend(list(set([name.group(0) for name in pattern_pl_5.finditer(lines)])))
            place_step.extend(list(set([name.group(0) for name in pattern_pl_6.finditer(lines)])))
            place_step.extend(list(set([name.group(0) for name in pattern_pl_7.finditer(lines)])))
            place_step_n = []
            for place_name in place_step:
                if len(place_name)<=15:
                    # print(name,place_name)
                    place_step_n.append(place_name)
                if ‘××‘ in place_name:
                    if place_name in place_step_n:
                        place_step_n.append(place_name.split(‘××‘)[0])
                        place_step_n.remove(place_name)
                if ‘XX‘ in place_name:
                    if place_name in place_step_n:
                        place_step_n.append(place_name.split(‘XX‘)[0])
                        place_step_n.remove(place_name)
        #     print(name, [(len(item)) for item in place_step_n])
        # print(name,place_step,len(place_step))
        # print(name,place_step_n,len(place_step_n))
        for i in place_step_n:
            f_r.write(‘<location>{0}</location>\n\n‘.format(i))
        f_r.write(‘</{0}>\n\n‘.format(name.split(‘\\‘)[-1]))

截取结果文档中某一文件的结果贴出来，展示如下，

<11273.txt>

<time>1991年7月3日</time>

<time>2008年8月7日</time>

<time>2008年9月16日</time>

<time>2009年3月18日</time>

<time>2011年2月6日</time>

<time>2012年2月2日</time>

<time>2013年3月28日</time>

<time>2013年6月14日</time>

<time>2014年4月14日</time>

<time>2014年10月27日</time>

<time>2013年5月8日</time>

<time>5月10日</time>

<time>二〇一五年二月二十七日</time>

<defendant>杨飞程</defendant>

<court>云南省丽江市中级人民法院</court>

<court>云南省高级人民法院</court>

<court>最高人民法院</court>

<procuratorate>丽江市人民检察院</procuratorate>

<location>云南省丽江市</location>

<location>云南省大理市</location>

<location>丽江市古城区</location>

<location>大理市</location>

</11273.txt>

re总结

这次使用了不少这则表达式，虽然不怎么高深，不过还是略有心得，特此总结一下。

这里给出一个比较完备的正则表达式介绍，但是自己的使用还是有一些自己的理解重点，所以这篇文章还要继续233

几个基础函数

re.compile(pattern, flags=0)

将正则表达式模式编译成一个正则表达式对象，它可以用于匹配使用它的match ()和search ()等方法。

实际有两种使用方式：

　　pattern.匹配方法(string) 或者 re.匹配方法(pattern,string)

使用或|来强化匹配规则：

pattern_t = re.compile(                ‘[0-9〇一二三四五六七八九]{4}年.{1,2}月.{1,3}日‘
                ‘|同年.{1,2}月.{1,3}日‘
                ‘|[0-9〇一二三四五六七八九]{4}年.{1,2}月.{1}旬‘
                ‘|[0-9〇一二三四五六七八九]{4}年.{1,2}月底‘
                ‘|[0-9〇一二三四五六七八九]{4}年.{1,2}月‘
                ‘|[0-9〇一二三四五六七八九十]{1,2}月.{1,3}日‘)

re.findall(pattern, string, flags=0): 返回字符串

re.finditer(pattern, string, flags=0): 返回一个迭代器符合

正则表达式迭代器对象

之所以单提出来，是因为迭代器在匹配组groups的时候真的好用，

pattern_c = re.compile(‘[\n。,，《；](.{,15}?人民法院)‘)
_court_list = [name.group(1) for name in pattern_c.finditer(lines)]

group(1)表示匹配到的符合第一组的部分，2、3……类推，而0表示包含全部匹配的各个组结果的元组。

贪婪匹配

比如正则表达式：

‘审理(.+)指控‘

我希望不去贪婪匹配，那么应该是

‘审理(.+)指控？‘

而非

‘审理(.+？)指控‘

时间： 2024-11-09 22:02:36

『Re』知识工程作业_主体识别的相关文章

『Re』正则表达式模块_常用方法记录

『Re』知识工程作业_主体识别一个比较完备的正则表达式介绍几个基础函数 re.compile(pattern, flags=0) 将正则表达式模式编译成一个正则表达式对象,它可以用于匹配使用它的match ()和search ()等方法. 实际有两种使用方式: pattern.匹配方法(string) 或者 re.匹配方法(pattern,string) 使用或|来强化匹配规则: pattern_t = re.compile( '[0-9〇一二三四五六七八九]{4}年.{1,2}月.{1,3

『TensorFlow』读书笔记_降噪自编码器

『TensorFlow』降噪自编码器设计之前学习过的代码,又敲了一遍,新的收获也还是有的,因为这次注释写的比较详尽,所以再次记录一下,具体的相关知识查阅之前写的文章即可(见上面链接). # Author : Hellcat # Time : 2017/12/6 import numpy as np import sklearn.preprocessing as prep import tensorflow as tf from tensorflow.examples.tutorials.mni

『TensorFlow』分布式训练_其二_多GPU并行demo分析（待续）

建议比对『MXNet』第七弹_多GPU并行程序设计 models/tutorials/image/cifar10/cifer10_multi_gpu-train.py # Copyright 2015 The TensorFlow Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file exc

『TensorFlow』迁移学习_他山之石，可以攻玉

目的: 使用google已经训练好的模型,将最后的全连接层修改为我们自己的全连接层,将原有的1000分类分类器修改为我们自己的5分类分类器,利用原有模型的特征提取能力实现我们自己数据对应模型的快速训练.实际中对于一个陌生的数据集,原有模型经过不高的迭代次数即可获得很好的准确率. 实战: 实机文件夹如下,两个压缩文件可以忽略: 花朵图片数据下载: 1 curl -O http://download.tensorflow.org/example_images/flower_photos.tgz 已经

『PyTorch』第二弹_张量

参考:http://www.jianshu.com/p/5ae644748f21# 几个数学概念: 标量(Scalar)是只有大小,没有方向的量,如1,2,3等向量(Vector)是有大小和方向的量,其实就是一串数字,如(1,2) 矩阵(Matrix)是好几个向量拍成一排合并而成的一堆数字,如[1,2;3,4] 其实标量,向量,矩阵它们三个也是张量,标量是零维的张量,向量是一维的张量,矩阵是二维的张量,除此之外,张量不仅可以是三维的,还可以是四维的.五维的... 一点小注意: 1.由于torc

『TensorFlow』测试项目_对评论分类

数据介绍 neg.txt:5331条负面电影评论 pos.txt:5331条正面电影评论函数包自然语言工具库 Natural Language Toolkit 下载nltk相关数据: import nltk nltk.download() 测试安装是否成功: from nltk.corpus import brown print(brown.words()) 常用的函数有两个: from nltk.tokenize import word_tokenize """ 'I'

『TensorFlow』图像预处理_

部分代码单独测试: 这里实践了图像大小调整的代码,值得注意的是格式问题: 输入输出图像时一定要使用uint8编码, 但是数据处理过程中TF会自动把编码方式调整为float32,所以输入时没问题,输出时要手动转换回来!使用numpy.asarray(dtype)或者tf.image.convert_image_dtype(dtype)都行都行 1 import numpy as np 2 import tensorflow as tf 3 import matplotlib.pyplot as

『TensorFlow』分布式训练_其三_多机demo分析（待续）

tensorflow/tools/dist_test/python/mnist_replica.py # Copyright 2016 The TensorFlow Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the Licens

『计算机视觉』RCNN学习_其二：Mask-RCNN

参考资料 Mask R-CNN Mask R-CNN详解开源代码: Tensorflow版本代码链接: Keras and TensorFlow版本代码链接: MxNet版本代码链接一.Mask-RCNN Mask R-CNN是一个实例分割(Instance segmentation)算法,通过增加不同的分支,可以完成目标分类.目标检测.语义分割.实例分割.人体姿势识别等多种任务,灵活而强大. Mask R-CNN进行目标检测与实例分割 Mask R-CNN进行人体姿态识别其抽象架构如下: