Data analysis system

A data analysis system, particularly, a system capable of efficiently analyzing big data is provided. The data analysis system includes an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device includes a caching memory, a data transmission interface, and a controller for obtaining a data access pattern of the client terminal with respect to the at least one data storage unit, performing caching operations on the at least one data storage unit according to a caching criterion to obtain and store cache data in the caching memory, and sending the cache data to the analyst server via the data transmission interface, such that the analyst server analyzes the cache data to generate an analysis result, which may be used to request a change in the caching criterion.

BACKGROUND

1. Field of the Invention

The present invention relates to data analysis systems, and more particularly, to a system for analyzing big data according to caching criteria of a caching device.

2. Background of the Related Art

With information devices being in wide use, data sources nowadays are becoming more abundant. In addition to conventional manual input and system computation, data is generated at every moment as a result of the Internet, the emergence of cloud computing, the rapid development of mobile computing and the Internet of Things (IOT), and the ubiquitous mobile apparatuses, RFID, and wireless sensors.

Big data cannot work by itself. A large storage unit is required to provide sufficient data storage space. A caching device, especially a solid-state storage device, typically stores data replicas in the large storage unit (for example, a hard disk drive) to speed up data access of the system.

BRIEF SUMMARY

One embodiment of the present invention provides a data analysis system comprising an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device comprises a cache memory, a data transmission interface, and a controller in communication with the analyst server, the client terminal, and the storage unit. The controller obtains a data access pattern of the client terminal with respect to the storage unit and performs caching operations on the storage unit according to a caching criterion to obtain and store cache data in the cache memory and send the cache data to the analyst server via the data transmission interface, thereby allowing the analyst server to analyze the cache data and generate an analysis result.

Another embodiment of the present invention provides a caching device comprising a cache memory, a data transmission interface, and a controller connected to the cache memory and the data transmission interface. The controller obtains a data access pattern of a client terminal with respect to a storage unit and performs caching operations on the storage unit according to a caching criterion to obtain and store cache data in the cache memory and send the cache data to an analyst server via the data transmission interface.

Yet another embodiment of the present invention provides a data processing method comprising: (a) obtaining a data access pattern of a client terminal with respect to a data storage unit, (b) performing caching operations on the data storage unit according to a caching criterion to thereby obtain and store cache data in the cache memory, and (c) sending the cache data to an analyst server via the data transmission interface so as for the analyst server to analyze the cache data and thereby generate an analysis result.

DETAILED DESCRIPTION

Embodiments of the present invention select useful information from big data in a short period of time with methods and tools to analyze the useful information thus selected. For example, traffic on highways can be instantly smoothened by quickly identifying a key section of a road rather than the road in its entirety, analyzing its traffic flow data, and allocating lanes accordingly.

Instead of analyzing all the data in a storage device directly, the present invention discloses enabling a caching device to monitor a data access pattern of a client terminal with respect to the storage device in real time, cache appropriate or crucial data replicas from the storage device according to caching criteria to meet a wide variety of objectives and needs of data analysis, and send out the data replicas to serve as samples for data analysis.

For example, if hot data is regarded as a caching criterion, then the caching device will retrieve and send the hot data to the analyst server for analysis. The hot data, for example, includes video, personal or corporate data or stock-related data, which is intensively accessed within a fixed period of time for analysis by the analyst server. Afterward, characteristics of hot data are used in making operation policy, for example, placing popular video data at a server near the client terminal to enhance performance and service quality.

According to an embodiment of the present invention, a data analysis system comprises an analyst server, at least one data storage unit, a client terminal independent of the analyst server, and a caching device independent of the analyst server. The caching device further comprises a cache memory, a data transmission interface, and a controller connected to the analyst server, the client terminal, and the storage unit. The controller obtains a data access pattern of the client terminal with respect to the at least one data storage unit, performs caching operations on the at least one data storage unit according to a caching criterion to obtain and store cache data in the caching memory, and sends the cache data to the analyst server via the data transmission interface, such that the analyst server analyzes the cache data to generate an analysis result.

In another embodiment, the present invention further provides a caching device for use in the data analysis system and a data processing method for use with the caching device.

Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present invention should be or are in any single embodiment of the invention. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present invention. Thus, discussion of the features and advantages, and similar language, throughout this specification may, but do not necessarily, refer to the same embodiment.

Furthermore, the described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize that the invention may be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention.

Referring now to FIG. 1 through FIG. 3, computer systems, methods, and computer program products are illustrated as structural or functional block diagrams or process flowcharts according to various embodiments of the present invention. The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

<Data Analysis System>

FIG. 1 is a block diagram of a data analysis system 10 according to an embodiment of the present invention. The data analysis system 10 comprises an analyst server 100, a client terminal 102, a storage unit 104, and a caching device 106.FIG. 1 is not restrictive of the quantity of an analyst server, a storage unit, a client terminal, and a caching device of the data analysis system of the present invention.

The analyst server 100 is a server, for example, IBM‘s System X, Blade Center or eServer server, which has programs for executing data analytic applications, such as Microsoft‘s SQL Server products.

The client terminal 102 is independent of the analyst server 100 and is exemplified by a personal computer, a mobile device, or another server, which does not limit the present invention.

The storage unit 104 may, for example, be in the form of a network-attached storage (NAS), a storage area network (SAN), or a direct attached storage (DAS) to enable the client terminal 102 to perform data access. However, the storage unit 104 can be directly connected to the client terminal 102 to function as a local device for use with the client terminal 102, and the present invention is not limited thereto.

The caching device 106 is also independent of the analyst server 100. Related details are described below in conjunction with FIG. 2.

The analyst server 100, the client terminal 102, the storage unit 104, and the caching device 106 are linked, as needed, by a local bus, a local area network, the Internet, or any other data transmission channel to perform data communication. In a preferred embodiment, the caching device 106 is directly linked to the storage unit 104 via a local bus (not shown). To enhance stability and security, the analyst server 100 is independent of the client terminal 102, the storage unit 104, and the caching device 106.

<Caching Device>

FIG. 2 is a block diagram of the caching device 106 in accordance with one embodiment. The caching device 106 further comprises a cache memory 200, a controller 202, and a data transmission interface 204. Preferably, the cache memory200 is a solid-state memory (for example, a flash memory) which reads and writes data faster than the storage unit 104does, though the present invention is not limited thereto. The cache memory 200 may, for example, be in the form of a hard disk drive or any other storage device. The cache memory 200 and the controller 202 are linked, as needed, by a local bus, a local area network, the Internet, or any other data transmission channel to perform data communication.

The controller 202 is able to perform conventional caching operations and stores cache data (that is, replicas of specific data in the storage unit 104) in the cache memory 200. Hence, the client terminal 102 (as shown in FIG. 1) reads and writes data from the cache memory 200 directly, rather than reads and writes data from the storage unit 104 slowly. The improvements of the controller 202 and its conventional counterparts are described below in conjunction with the flow chart of FIG. 3.

<Caching Criteria>

Step 300: the controller 202 monitors how the client terminal 102 performs data access to the storage unit 104 within a given period and calculates a data access pattern, e.g., access frequency. In this embodiment, the data access pattern is provided as a log of data access performed by the client terminal 102 to the storage unit 104 within a given period, and thus those portions of the data access pattern which are not related to the present invention are omitted.

Step 302: in this step, the controller 202 performs caching operations on the storage unit 104 according to a caching criterion so as to obtain cache data (that is, replicas of specific data in the storage unit 104) and store the cache data in the cache memory 200.

In an embodiment, a caching criterion may relate to a given access frequency, and thus cache data may be defined as data (i.e., hot data) acquired as a result of access by the client terminal 102 to the storage unit 104 within a given period when the access frequency exceeds a given value. Alternatively, cache data may be defined as data (i.e., cold data) acquired at an access frequency below a given value. Likewise, it is also feasible to set the caching criterion to a given range of access frequency.

In another embodiment, a caching criterion may relate to a given access sequence. For example, cache data may be defined as data, which consists of the latest 1000 pieces of data or the earliest 500 pieces of data, acquired as a result of access by the client terminal 102 to the storage unit 104. Likewise, it is feasible to set the caching criterion to a given range of access sequence.

In yet another embodiment, a caching criterion may relate to a given access period. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104 before or after a specific point in time. Likewise, it is feasible to set the caching criterion to a given range of access period.

In a further embodiment, a caching criterion may relate to a given data address. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104 at a given data address. Likewise, it is feasible to set the caching criterion to a given range of data addresses.

In a still further embodiment, a caching criterion may relate to a given data size. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104, wherein the size of the data acquired is larger or smaller than a given data size. Likewise, it is feasible to set the caching criterion to a given range of data size.

In another embodiment, a caching criterion may relates to a given string. For example, cache data may be defined as data acquired as a result of access by the client terminal 102 to the storage unit 104, wherein the data acquired has a given string. Likewise, it is feasible to set the caching criterion to any particular combination of strings.

In an additional embodiment, a caching criterion may relate to a given value of at least a parameter contained in the data access pattern. Hence, in step 300, the caching criterion may be defined as a given value of a parameter available in the data access pattern calculated by the controller 202. For example, if the data access pattern comprises a data-related file name, a given file name can function as the caching criterion.

Step 302 does not necessarily follow step 300. Step 300 and step 302 can take place simultaneously, provided that cache data in step 302 is acquired after step 300.

Step 304: the controller 202 sends cache data stored in the cache memory 200 to the analyst server 100 via the data transmission interface 204. If the caching device 106 is mounted on a motherboard (not shown), the data transmission interface 204 can be a PCI-e interface or an InfiniBand interface.

Step 306: the analyst server 100 analyzes cache data to generate an analysis result. For example, an analysis result may be generated using SQL Server products of Microsoft Corporation, which are applicable to data mining as described in "Predictive Analysis with SQL Server 2008", a White Paper published by Microsoft Corporation. The present invention is not restrictive of a way of analyzing cache data.

Step 308: selectively, the analyst server 100 sends an instruction to the controller 202 to change the caching criterion, and then the process flow of the method goes back to step 300, or will go back to step 302 if the data access pattern need not be updated. Afterward, the process flow of the method proceeds to steps 304-306.

SRC=https://www.google.com.hk/patents/US20140068180

Data analysis system,布布扣,bubuko.com

时间: 2024-10-31 18:43:21

Data analysis system的相关文章

Python For Data Analysis -- IPython

IPython Basics 首先比一般的python shell更方便一些 比如某些数据结构的pretty-printed,比如字典 更方便的,整段代码的copy,执行 并且可以兼容部分system shell , 比如目录浏览,文件操作等   Tab Completion 这个比较方便,可以在下面的case下,提示和补全未输入部分 a. 当前命名空间中的名字 b.对象或模块的属性和函数 c. 文件路径   Introspection, 内省 ?,在标识符前或后加上,显示出对象状况和docst

Learning Spark: Lightning-Fast Big Data Analysis 中文翻译

Learning Spark: Lightning-Fast Big Data Analysis 中文翻译行为纯属个人对于Spark的兴趣,仅供学习. 如果我的翻译行为侵犯您的版权,请您告知,我将停止对此书的开源翻译. Translation the book of Learning Spark: Lightning-Fast Big Data Analysis is only for spark developer educational purposes. If I violated you

Spark的Python和Scala shell介绍(翻译自Learning.Spark.Lightning-Fast.Big.Data.Analysis)

Spark提供了交互式shell,交互式shell让我们能够点对点(原文:ad hoc)数据分析.如果你已经使用过R,Python,或者Scala中的shell,或者操作系统shell(例如bash),又或者Windows的命令提示符界面,你将会对Spark的shell感到熟悉. 但实际上Spark shell与其它大部分shell都不一样,其它大部分shell让你通过单个机器上的磁盘或者内存操作数据,Spark shell让你可以操作分布在很多机器上的磁盘或者内存里的数据,而Spark负责在集

Python For Data Analysis -- NumPy

NumPy作为python科学计算的基础,为何python适合进行数学计算,除了简单易懂,容易学习 Python可以简单的调用大量的用c和fortran编写的legacy的库   The NumPy ndarray: A Multidimensional Array Object ndarray,可以理解为n维数组,用于抽象矩阵和向量 Creating ndarrays 最简单的就是,从list初始化, 当然还有其他的方式,比如, 汇总,     Data Types for ndarrays

Python For Data Analysis -- Pandas

首先pandas的作者就是这本书的作者 对于Numpy,我们处理的对象是矩阵 pandas是基于numpy进行封装的,pandas的处理对象是二维表(tabular, spreadsheet-like),和矩阵的区别就是,二维表是有元数据的 用这些元数据作为index更方便,而Numpy只有整形的index,但本质是一样的,所以大部分操作是共通的 大家碰到最多的二维表应用,关系型数据库中的表,有列名和行号,这些就是元数据 当然你可以用抽象的矩阵来对这些二维表做统计,但使用pandas会更方便  

《Python For Data Analysis》学习笔记-1

在引言章节里,介绍了MovieLens 1M数据集的处理示例.书中介绍该数据集来自GroupLens Research(http://www.groupLens.org/node/73),该地址会直接跳转到https://grouplens.org/datasets/movielens/,这里面提供了来自MovieLens网站的各种评估数据集,可以下载相应的压缩包,我们需要的MovieLens 1M数据集也在里面. 下载解压后的文件夹如下: 这三个dat表都会在示例中用到,但是我所阅读的<Pyt

About Data Analysis

About Data Analysis 工具不能解决代码中的问题.它可以帮助你更好地了解你的代码正在做什么,通过捕捉应用程序运行时的详细统计信息,并将它们呈现给你进行分析.由于每个应用程序都不同,查找和解决问题的实际步骤各不相同.因此,您必须学习如何通过过滤不需要的数据来解释信息工具,并钻入与应用程序相关的数据.然后,您必须执行一些检查工作,将您识别的任何数据与应用程序中的代码关联起来,这样您就可以进行改进.Instruments doesn't fix problems with your c

Python 探索性数据分析(Exploratory Data Analysis,EDA)

此脚本读取的是 SQL Server ,只需给定表名或视图名称,如果有数据,将输出每个字段符合要求的每张数据分布图. # -*- coding: UTF-8 -*- # python 3.5.0 # 探索性数据分析(Exploratory Data Analysis,EDA) __author__ = 'HZC' import math import sqlalchemy import numpy as np import pandas as pd import matplotlib.pyplo

《python for data analysis》第四章,numpy的基本使用

<利用python进行数据分析>第四章的程序,介绍了numpy的基本使用方法.(第三章为Ipython的基本使用) 科学计算.常用函数.数组处理.线性代数运算.随机模块-- # -*- coding:utf-8 -*-# <python for data analysis>第四章, numpy基础# 数组与矢量计算import numpy as npimport time # 开始计时start = time.time() # 创建一个arraydata = np.array([[