http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/
Big data architecture paradigms are commonly separated into two (supposedly) diametrical models, the more traditional batch and the (near) real-time processing. The most popular technologies representing the two are Hadoop with MapReduce and Storm. However, a hybrid solution, the Lambda Architecture, challenges the idea that these approaches have to exclude each other. The Lambda Architecture combines a slow and fast lane of data processing to achieve the best of both worlds. Fast results and deep, large scale processing.
Usually one or the other architecture has been implemented due to a business requirement. Commonly, business users or customers eventually arrive at the point where they either would like to get a more historic view or more real time insight either of which can not be provided by the deployed architecture. At this point a hybrid solution becomes the only realistic solution. One which brings some surprising benefits with it.
Lambda Architecture explained
The Lambda Architecture centrally receives data and does as little as possible processing before copying and splitting the data stream to the real time and batch layer. The batch layer collects the data in a data sink like HDFS or S3 in its raw form. Hadoop jobs regularly process the data and write the result to a data store.
Lambda architecture duplicates incoming data and processes them in parallel at different speeds
Since this process is fully batched the data store can have some
significant simplification. It should support random reads, i.e. needs
some kind of index, however, it can do away with random writing,
locking, and consistency issues. This simplifies the store
significantly. An example of such a system is ElephantDB.
The problem with batch processing is the time it takes. For example,
the above process may take hours or days. In the meantime data has been
arriving and subsequent processes or services continue to work with
hours or days old information. The real time layer solves this by taking
its copy of the data and processing it in seconds or minutes and stores
it in a fast random read and write store. This store is more complex
since it has to be constantly updated.
The complexity of the real time layer and it’s store is manageable
since it only has to store and serve a sliding window of data, which
needs to be roughly as long as the batch process takes. Both layers’
results are merged and real time information is replaced in favour of
batch layer data. In many cases this enables for the real time process
to work with good approximations since its results are replaced by
highly precise data within a short period.
Lambda Architecture benefits
The addition of another layer to an architecture has major
advantages. Firstly, data can (historically) be processed with high
precision and involved algorithms without losing short-term information,
alerts, and insights provided by the real time layer. Secondly, the
addition of a layer is offset by dramatically reducing the random write
storage requirements. The batch write storage provides also the option
to switch data at predefined times and version data.
Lastly and importantly, the addition of the data sink of raw data
offers the option to recover from human mistakes, i.e. deploying bugs
which write erroneous aggregated data from which other architectures can
not recover. Another option is to retrospectively enhance data
extraction or learning algorithms and apply them on the whole of the
historic dataset. This is extremely helpful in agile and startup
environments where MVPs push what can be done down the track.
原文地址:https://www.cnblogs.com/dadadechengzi/p/12639176.html