Data Flow ->> Fuzzy Lookup & Fuzzy Grouping

这两个任务的作用是数据清洗（Data Cleansing）。

Fuzzy Lookup通过引用另外一张数据库表或者索引来进行相似值匹配。这种组件对于标准化和查找可能错误的客户端数据非常有用。例如像地址或者像城市名这种属性栏位非常有用。

Fuzzy Lookup不仅会输出它的匹配值，同时还会输出similarity和confidence两个属性列。similarity用一个0到1之间的浮点值来表示匹配对间值得相似度。比如Jerry Chan和Jerry Chen的相似度可能是0.89。而对于Confidence，它的值越高代表它可选的匹配对越少。

Fuzzy Lookup一共有4种选择来配置参考表（Reference Table）：

1）Generate New Index：根据参考表的参考栏位在内存中建立一条临时索引用来做数据匹配，任务完成后把它删除；

2）Generate New Index + Store New Index选项：相当于建立一条索引在数据库中；

3）Generate New Index + Store New Index选项 + Maintain Stored Index选项：这种情况下勾了Maintain Stored Index选项将会在reference表建一个触发器来捕捉更新以同步更新到该新建的索引；

4）Use Existing Index

Fuzzy Lookup Transformation: Capable of joining to external data based on data similarity,
the Fuzzy Lookup Transformation is a core data cleansing tool in SSIS. This transformation
is perfect if you have dirty data input that you want to associate to data in a table in your
database based on similar values. Later in the chapter, you’ll take a look at the details of the
Fuzzy Lookup Transformation and what happens behind the scenes

Fuzzy Grouping Transformation: The main purpose is de-duplication of similar data. The
Fuzzy Grouping Transformation is ideal if you have data from a single source and you know
you have duplicates that you need to find.

时间： 2024-08-09 17:05:30

Data Flow ->> Fuzzy Lookup & Fuzzy Grouping的相关文章

Fuzzy Lookup Transformation Usage

Fuzzy Lookup 预先加载一个Reference表,在执行时,Fuzzy Lookup将源数据提取出来,逐行和Reference表中的每行数据进行模糊匹配,输出匹配程度的指数:相似度和信任度(Similarity and Confidence).Fuzzy Lookup的匹配算法简单描述为:将Reference表中的标准字符串拆分成多个substring(单个字符在字符串中的相对位置不变),只要输入字符串包含任意一个substring,Fuzzy Lookup 就认为匹配成功,按照Fuz

监控 Data Flow Execution Performance

在每个Package执行时,SSIS Engine都会记录日志信息,Logging Level共有四个:无,基本,性能,详细.如果想监控Data Flow Execution的性能,可以将Logging Level设置为性能,这样就能收集每个数据流组件的“活动时间(以秒为单位)”. step1,设置日志记录级别为性能 2,查看package执行日志 3,点击“执行性能”,查看package的Execution Performance 查看数据路组件的各个组件的活动时间如果数据流存在问题,那么可

Data Flow ->> Import Column & Export Column

这两个transformation的作用是把DT_TEXT, DT_NTEXT, DT_IMAGE类型的数据在文件系统和数据库间导出或者导入.比如把某个数据库表的image类型的字段导出到文件系统成为img文件.做法是在导入导出的时候必须提供完全文件名和路径.需要注意的地方是需要在Import Column组件的Input and Output Properties页面加一个output字段来导入比如img文件,然后columnn的id必须在input column和output column间

UVA 10594 Data Flow (最小费用流)

http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&category=116&page=show_problem&problem=1535 Problem F Data Flow Time Limit 5 Seconds In the latest Lab of IIUC, it requires to send huge amount of data from the local s

Intel® Threading Building Blocks (Intel® TBB) Developer Guide 中文 Parallelizing Data Flow and Dependence Graphs并行化data flow和依赖图

https://www.threadingbuildingblocks.org/docs/help/index.htm Parallelizing Data Flow and Dependency Graphs In addition to loop parallelism, the Intel® Threading Building Blocks (Intel® TBB) library also supports graph parallelism. It's possible to cre

Data Flow ->> Fuzzy Lookup & Fuzzy Grouping

Data Flow ->> Fuzzy Lookup & Fuzzy Grouping的相关文章

Fuzzy Lookup Transformation Usage

监控 Data Flow Execution Performance

Data Flow ->> Import Column & Export Column

UVA 10594 Data Flow (最小费用流)

Intel® Threading Building Blocks (Intel® TBB) Developer Guide 中文 Parallelizing Data Flow and Dependence Graphs并行化data flow和依赖图

SSIS的 Data Flow 和 Control Flow

Redux:data flow

Data Flow ->> Multiple Excel Sheet Loaded Into One Table

uva 10594 Data Flow （最小费最大流+题目给的数据有错）