PDF数据提取------3.解析Demo

1.PDF中文本字符串格式中关键值信息抓取（已完成）

简介:这种解析比较传统最简单主要熟练使用Regular Expression做语义识别和验证.例如抓取下面红色圈内关键信息

        string mettingData=GetMeetingData();       

        public string GetMeetingData()
        {
            string patternAll = @"(?<NDAandCAMDate>会\s*议\s*.{2,15}\d{2,4}\s*年\s*\d{1,2}\s*月\s*\d{1,2}\s*日.{0,15})";
            PdfAnalyzer pa = new PdfAnalyzer();
            PDFNet.Initialize();
            PDFDoc doc = new PDFDoc(item);
            doc.InitSecurityHandler();
            List<PdfString> foundAll = pa.RegexSearchAllPages(doc, patternAll);

            List<string> patternFilter = new List<string>();
            patternFilter.Add(@"(?<year>\d{2,4})年(?<month>\d{1,2})月(?<day>\d{1,2})日((\(|\（)(星期|周)(一|二|三|四|五|六|七)(\)|\）))?(上午)?(?<hour>\d{1,2})(\:|点|时)(?<minute>\d{1,2})");
            patternFilter.Add(@"(?<year>\d{2,4})年(?<month>\d{1,2})月(?<day>\d{1,2})日((\(|\（)(星期|周)(一|二|三|四|五|六|七)(\)|\）))?下午(?<hour>\d{1,2})(\:|点|时)(?<minute>\d{1,2})");
            patternFilter.Add(@"(?<year>\d{2,4})年(?<month>\d{1,2})月(?<day>\d{1,2})日((\(|\（)(星期|周)(一|二|三|四|五|六|七)(\)|\）))?(上午)?(?<hour>\d{1,2})点半");
            patternFilter.Add(@"(?<year>\d{2,4})年(?<month>\d{1,2})月(?<day>\d{1,2})日((\(|\（)(星期|周)(一|二|三|四|五|六|七)(\)|\）))?下午(?<hour>\d{1,2})点半");
            patternFilter.Add(@"(?<year>\d{2,4})年(?<month>\d{1,2})月(?<day>\d{1,2})日((\(|\（)(星期|周)(一|二|三|四|五|六|七)(\)|\）))?(上午)?(?<hour>\d{1,2})(点|时)");
            patternFilter.Add(@"(?<year>\d{2,4})年(?<month>\d{1,2})月(?<day>\d{1,2})日((\(|\（)(星期|周)(一|二|三|四|五|六|七)(\)|\）))?下午(?<hour>\d{1,2})(点|时)");
            patternFilter.Add(@"(?<year>\d{2,4})年(?<month>\d{1,2})月(?<day>\d{1,2})日");

            return GetMeetingDateFilter(foundAll, patternAll);
        }

        private string GetMeetingDateFilter(List<PdfString> foundAll, List<string> patternAll)
        {
            string meetingDate = "     ";
            Match ma = null;
            string result = string.Empty;

            foreach (PdfString pdfString in foundAll)
            {
                result = pdfString.ToString().Replace(" ", "");
                for (int i = 0; i < patternAll.Count; i++)
                {
                    ma = (new Regex(patternAll[i])).Match(result);
                    if (ma.Success)
                    {
                        if (IsValid(ma))
                            return meetingDate;
                        else
                            meetingDate = "     ";
                    }
                }
            }
            return meetingDate;
        }

注解：

a.第一次通过通过 pa.RegexSearchAllPages(doc, patternAll);搜索所有关于时间数据信息

b.第二次通过正则匹配获取带有关键词信息Meeting Data

2.PDF类似表格形式关键值数据抓取。（已完成）

简介:这种格式需要用的封装数据结构PdfString类和PdfAnalyzer类，根据给定关键词在指定范围提取数据,例如提取下面数据。

private string GetPremium(string path, string ricCode)
        {
            string result = string.Empty;
            PDFDoc doc = null;
            try
            {
                PDFNet.Initialize();
                doc = new PDFDoc(path);
                doc.InitSecurityHandler();

                if (doc == null)
                {
                    string msg = string.Format("can‘t load pdf to doc = new PDFDoc({0}); ", path);
                    Logger.Log(msg, Logger.LogType.Error);
                    return result;
                }

                int x1 = 0;
                int y1 = 0;
                PdfAnalyzer pa = new PdfAnalyzer();
                List<PdfString> listX1 = pa.RegexSearchAllPages(doc, ricCode);
                List<PdfString> listY1 = pa.RegexSearchAllPages(doc, @"[P|p]remium");
                List<PdfString> listResult = pa.RegexSearchAllPages(doc, @"(?<Result>\d+\.\d+\%)");

                if (listX1.Count == 0 || listY1.Count == 0 || listResult.Count == 0)
                {
                    string msg = string.Format("({0}),([P|p]remium) exist missing value ,so Gearing is empty value.", ricCode);
                    Logger.Log(msg, Logger.LogType.Warning);
                    return result;
                }

                x1 = System.Convert.ToInt32(listX1[0].Position.x1);
                y1 = System.Convert.ToInt32(listY1[0].Position.y1);

                int subX1 = 0;
                int subY1 = 0;
                //use Gearing position (x1,y1) to get the right result value
                foreach (var item in listResult)
                {
                    subX1 = x1 - System.Convert.ToInt32(item.Position.x1);
                    if (subX1 < 0) subX1 = 0 - subX1;
                    subY1 = y1 - System.Convert.ToInt32(item.Position.y1);
                    if (subY1 < 0) subY1 = 0 - subY1;

                    if (subX1 <= 10 && subY1 <= 10)
                    {
                        result = item.ToString().Replace("%", "");
                        return result;
                    }
                }

                Logger.Log(string.Format("stock code:{0},extract premium failed .", ricCode), Logger.LogType.Error);
                return result;
            }
            catch (Exception ex)
            {
                string msg = string.Format("PDF analysis failed for " + ricCode + "! Action: Need manually input gearing and premium \r\n error msg:{0}", ex.Message);
                Logger.Log(msg, Logger.LogType.Warning);
                return result;
            }
        }

3.需要PDF中大量数据转换到Excel中去（已完成）

简介:基与2的延伸，加入一个自动模糊匹配到行和列边界范围，根据位置坐标排序提取正确数据信息。如图：

private void StartExtractFile()
        {
            List<List<string>> bulkFileFilter = null;
            List<LineFound> bulkFile = null;
            PDFNet.Initialize();
            PDFDoc doc = new PDFDoc(config.FilePath1);
            doc.InitSecurityHandler();
            string patternTitle = @"コード";
            int page = 3;
            PdfString ricPosition = GetRicPosition(doc, patternTitle, page);
            if (ricPosition == null)
                return;

            string patternRic = @"\d{4}";
            string patternValue = @"(\-|\+)?\d+(\,|\.|\d)+";
            bulkFile = GetValue(doc, ricPosition, patternRic, patternValue);
            int indexOK = 0;
            bulkFileFilter = FilterBulkFile(bulkFile, indexOK);
            string filePath = Path.Combine(config.OutputFolder, string.Format("Type1ExtractedFromPdf{0}.csv", DateTime.Now.ToString("dd-MM-yyyy")));

            if (File.Exists(filePath))
                File.Delete(filePath);

            XlsOrCsvUtil.GenerateStringCsv(filePath, bulkFileFilter);
            AddResult(Path.GetFileNameWithoutExtension(filePath), filePath, "type1");
        }

        private List<List<string>> FilterBulkFile(List<LineFound> bulkFile, int indexOK)
        {
            List<List<string>> result = new List<List<string>>();

            if (bulkFile == null || bulkFile.Count == 0)
            {
                Logger.Log("no value data extract from pdf");
                return null;
            }
            int count = bulkFile[indexOK].LineData.Count;

            List<string> line = null;
            foreach (var item in bulkFile)
            {
                if (item.LineData == null || item.LineData.Count <= 0)
                    continue;

                line = new List<string>();
                if (item.LineData.Count.CompareTo(count) == 0)
                {
                    foreach (var value in item.LineData)
                    {
                        line.Add(value.Words.ToString());
                    }
                }
                else
                {
                    line.Add(item.LineData[0].Words.ToString());
                    for (int i = 1; i < count; i++)
                    {
                        line.Add(string.Empty);
                    }
                }
                result.Add(line);
            }

            return result;
        }

        private List<LineFound> GetValue(PDFDoc doc, PdfString ricPosition, string patternRic, string patternValue)
        {
            List<LineFound> bulkFile = new List<LineFound>();
            try
            {
                List<string> line = new List<string>();
                List<PdfString> ric = null;

                //for (int i = 1; i < 10; i++)
                for (int i = 1; i < doc.GetPageCount(); i++)
                {
                    ric = pa.RegexExtractByPositionWithPage(doc, patternRic, i, ricPosition.Position);
                    foreach (var item in ric)
                    {
                        LineFound lineFound = new LineFound();
                        lineFound.Ric = item.Words.ToString();
                        lineFound.Position = item.Position;
                        lineFound.PageNumber = i;
                        lineFound.LineData = pa.RegexExtractByPositionWithPage(doc, patternValue, i, item.Position, PositionRect.X2);
                        bulkFile.Add(lineFound);
                    }
                }
            }
            catch (Exception ex)
            {
                string msg = string.Format("\r\n         ClassName:  {0}\r\n         MethodName: {1}\r\n         Message:    {2}",
                                            System.Reflection.MethodBase.GetCurrentMethod().DeclaringType.ToString(),
                                            System.Reflection.MethodBase.GetCurrentMethod().Name,
                                            ex.Message);
                Logger.Log(msg, Logger.LogType.Error);
            }

            return bulkFile;
        }

        private PdfString GetRicPosition(PDFDoc doc, string pattern, int page)
        {
            try
            {
                List<PdfString> ricPosition = null;
                ricPosition = pa.RegexSearchByPage(doc, @"コード", page);
                if (ricPosition == null || ricPosition.Count == 0)
                {
                    Logger.Log(string.Format("there is no ric title found by using pattern:{0} to find the ric title ,in the page:{1} of the pdf:{2}"));
                    return null;
                }

                return ricPosition[0];
            }
            catch (Exception ex)
            {
                string msg = string.Format("\r\n         ClassName:  {0}\r\n         MethodName: {1}\r\n         Message:    {2}",
                                            System.Reflection.MethodBase.GetCurrentMethod().DeclaringType.ToString(),
                                            System.Reflection.MethodBase.GetCurrentMethod().Name,
                                            ex.Message);
                Logger.Log(msg, Logger.LogType.Error);
                throw;
            }
        }
    }

    struct LineFound
    {
        public string Ric { get; set; }
        public Rect Position { get; set; }
        public int PageNumber { get; set; }
        public List<PdfString> LineData { get; set; }
    }

注解：

a.由于PDF中数据坐标位置信息是基于页的所以必须按页来解析抓取数据

b.大概思路，第一次获取“コード”位置，来获取每页中Ric List的集合（获取列并排序）

c.根据每一列信息获取每一行信息（获取并排序），组合成表格信息

改进：

现在这部分还需要代码中手动干预，下一步打算加入自动识别功能，通过获取大量PDF数据自动根据位置信息组合成Table信息

4.PDF中数据保存图片格式（未完成）

想法：这种PDF文件我目前还没好的处理办法，应该需要用到图像识别方面的算法。对着这种文件格式表示我现在确实无能为力，

希望那位大神提供一些好的建议。

时间： 2024-11-17 07:03:53

PDF数据提取------3.解析Demo的相关文章

PDF数据提取------1.介绍

1.关于PDF文件 PDF(Portable Document Format的简称,意为“便携式文件格式”)是由Adobe Systems在1993年用于文件交换所发展出的文件格式.它的优点在于跨平台.能保留文件原有格式(Layout).开放标准,能自由授权(Royalty-free)自由开发PDF兼容软件.(PDF - 维基百科) 2.关于解析PDF 就像大神灵感之源的博文关于PDF的代码,真是多得不得了...,由于现在实习公司需要从大量文档中提取金融数据.对于网页解析我们有强大的Ht

PDF数据提取------2.相关类介绍

1.简介构造数据类型PdfString封装Rect类,PdfAnalyzer类中定义一些PDF解析方法. 2.PdfString类与Rect类 public class PdfString : IComparable<PdfString> { public string Words { get; set; } public Rect Position { get; set; } public int PageNumber { get; set; } public PdfString(stri

db数据库利用第三方框架进行提取和解析数据

db的数据包用从github上下载的三方框架进行解析和数据提取,格式一般为数组和字典.db的查看工具是firefox上的解析db插件SQLite 三方框架为FMDB #import "ViewController.h" //1. 引入头文件,需要引入libsqlite3的库 #import "FMDB.h" #import "Word.h" @interface ViewController () @end @implementation Vie

PHP实例————表单数据插入数据库及数据提取

网站在进行新用户注册时,都会将用户的注册信息存入数据库中,需要的时候再进行提取.今天写了一个简单的实例. 主要完成以下几点功能: (1)用户进行注册,实现密码重复确认,验证码校对功能. (2)注册成功后,将用户进行插入数据库中进行保存. (3)将数据库表中数据进行提取,并打印. 1.注册表单在以前的几篇博客中,分享过注册及登录表单的代码.这次的代码,大致相同,只是略有变化.仅作为实例探讨 <html> <head> <title>注册页面</title>

PHP+Mysql————表单数据插入数据库及数据提取

站点在进行新用户注冊时,都会将用户的注冊信息存入数据库中,须要的时候再进行提取.今天写了一个简单的实例. 主要完毕下面几点功能: (1)用户进行注冊,实现password反复确认,验证码校对功能. (2)注冊成功后,将用户进行插入数据库中进行保存. (3)将数据库表中数据进行提取,并打印. 1.注冊表单在曾经的几篇博客中,分享过注冊及登录表单的代码.这次的代码,大致同样,仅仅是略有变化.仅作为实例探讨表单页面实在没什么可讲的,除了格式对齐上加几个&nbsp(空格). <html>

PHP实例表单数据插入数据库及数据提取用户注册验证

转：SQL SERVER数据库中实现快速的数据提取和数据分页

探讨如何在有着1000万条数据的MS SQL SERVER数据库中实现快速的数据提取和数据分页.以下代码说明了我们实例中数据库的“红头文件”一表的部分数据结构: CREATE TABLE [dbo].[TGongwen] ( --TGongwen是红头文件表名 [Gid] [int] IDENTITY (1, 1) NOT NULL , --本表的id号,也是主键 [title] [varchar] (80) COLLATE Chinese_PRC_CI_AS NULL , --红头文件

PHP+Mysql-表单数据插入数据库及数据提取完整过程

数据提取

页面解析和数据提取一般来讲对我们而言,需要抓取的是某个网站或者某个应用的内容,提取有用的价值.内容一般分为两部分,非结构化的数据和结构化的数据. 非结构化数据:先有数据,再有结构, 结构化数据:先有结构.再有数据不同类型的数据,我们需要采用不同的方式来处理. 非结构化的数据处理文本.电话号码.邮箱地址正则表达式 HTML 文件正则表达式 XPath CSS选择器结构化的数据处理 JSON 文件 JSON Path 转化成Python类型进行操作(json类) XML 文件转化成