Why we love Cassandra

Why we love Cassandra?

Posted on April 22, 2015 by Ajay Tiwari - App42 Backend as a Service

App42 provides lots of readymade APIs for developers and each API solves different problem of App development. To solve a different problem you need a different solution.App42 architecture uses hybrid solution for each of the Services on database layer. Some services are a good candidate for RDBMS however others are for NoSQL and some require In-Memory persistence.

App42 performs lots of Analytics on the data and also provides a service to App developers in the form of Marketing Automation. Implementing Analytics solution requires different persistence solution on DB layer. We chose Cassandra as our DB layer for implementation and fell deeply in love with it. There were other candidates like HBase and MongoDB for the solution however we decide to go ahead done with Cassandra and here are the reasons why.

1. Cassandra Scales linearly with massive write.

App42 analytics generates quite a lot of data when an event is generated. Events through a single app may result in thousands of insertions on the database. We process billions of events and we wanted to have a storage which can withstand very heavy write operations and scale. We were stuck with two options for our requirement here, one was Cassandra and other was HBase. Though MongoDB was also a candidate however due to write lock issue on database level and cascading poor insertion performance, it was out from the list at the very beginning of our selection process. Cassandra and HBase both are good with heavy write operations however we opted to go along with Cassandra looking at the benchmarks (http://planetcassandra.org/nosql-performance-benchmarks/) available in the market and considering the ease of managing the cluster. For us Cassandra was the perfect choice for heavy write load scenarios and it scales linearly as new machines are added in the ring.

2. Cassandra is an excellent choice for real-time analytic workloads

Due to its ability of supporting heavy write operations, it becomes naturally a good choice for Real Time Analytics. Thumb rule of performing real time analytics is that you should have your data already calculated and should persist in the database. If you know the reports you want to show in real time, you can have your schema defined accordingly and generate your data at real time. Batch mutation and Distributed Global Counter is something that we really liked while using Cassandra. if you are looking for similar kind of solution most likely Casssandra will suffice your needs.

3. Cassandra can be integrated with Hadoop, Hive and Apache Spark for batch Processing

As illustrated above Cassandra is a good candidate for real time analytics, however there might be scenarios where you might have to perform batch processing on the stored data. Cassandra can be easily integrated with Hadoop and Hive to achieve this. Also, on-demand in-memory analytics can be done through Apache Spark integration.

4. Tunable Consistency and CAP parameters.

Every database can provide two parameters out of Consistency (C) Availability (A) and Network Partitioning Tolerance (P) at a time according to CAP Theorem (http://en.wikipedia.org/wiki/CAP_theorem). It is impossible to achieve all at the same time. Cassandra allows you to configure and tune these parameters based on your priority. By default it is categorized under AP category.

There are many other features however these were certain points of considerations for us and we chose Cassandra based on that

Hope this post helps others who are thinking of Architecting their products which requires analytics over large amount of data and want be resilient against scalability.

If you have a requirement of Big Data Analytics for heavy write operation, Cassandra can stand out to be a perfect choice for you. Your feedback and suggestion on post are heartily welcome and you are free to reach out to us at [email protected] for further query or feedback.

src: http://blogs.shephertz.com/2015/04/22/love-cassandra/

时间： 2024-12-20 02:27:04

Why we love Cassandra的相关文章

cassandra 的一次调试

配置好 cassandra.yaml 之后 ,两台主机竟然不通信: [[email protected]_190 apache-cassandra-2.1.2]# cat conf/cassandra.yaml |grep 172 - seeds: "172.16.1.141,172.16.1.190" listen_address: 172.16.1.190 rpc_address: 172.16.1.190 原来第二台主机的防火墙一直开着呢, service iptables s

cassandra的schema version, gossip_generation 和host id

这是cassandra里面很重要的三个值; schema version是cassandra cluster里每个node的schema版本,什么叫版本呢?因为cassandra是无中心化的,所以你很难知道所有的node上的schema是否是一致的.你不可能每次把所有的schema都拿了去比较一次.这样很不高效.所以cassandra里就有了schema version这个概念.每次执行DDL操作的时候,都会新生成一个新的schema version, 当这个DDL操作复制到其他node的时候,

Cassandra存储time series类型数据时的内部数据结构？

因为我一直想用Cassandra来存储我们的数字电表中的数据,按照之前的文章(getting-started-time-series-data-modeling)的介绍,Cassandra真的和适合用于存储time series类型的数据,那么我就想要弄清楚,对于下面这张表 CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weathers

cassandra指定数据库路径

参考 https://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html 我们讨论的是 tarball installation 的方式,即自己下载源码放到指定路径,假设放在 /home/user/cassandra下这个路径下有bin,data,conf等文件夹默认情况下,数据sst和log都存放在data目录下. data目录下的data就是sst存放目录,里面根据数

cassandra的源代码的入口

参考 http://ju.outofmemory.cn/entry/115864 cassandra自带服务端,这和leveldb不一样. 入口就从服务端程序说起. 具体的入口程序在 CassandraDaemon 类(路径 org.apache.cassandra.service 下). 这个类有一个函数叫start,这就是入口在这里可以打印一句话,然后ant编译一下,启动服务,在输出类表里可以找到打印的那句话.

Cassandra 和 Spark 数据处理一窥

关于Linux的学习,请参考书籍<Linux就该这么学> Apache Cassandra数据库近来引起了很多的兴趣,这主要源于现代云端软件对于可用性及性能方面的要求.那么,Apache Cassandra 是什么?它是一种为高可用性及线性可扩展性优化的分布式的联机交易处理 (OLTP) 数据库具体说到 Cassandra 的用途时,可以想想你希望贴近用户的系统,比如说让我们的用户进行交互的系统.需要保证实时可用的程序等等,如:产品目录,物联网,医疗系统,以及移动应用.对这些程序而言,下线时

Cassandra 总接归纳

清空表里的所有数据 Truncate falcon_gps; TRUNCATE accepts a single argument for the column family name, and permanently removes all data from said column family. 查询Cassandra某表里一空有多少行记录 select count(*) from falcon_gps; 批量导入数据 BEGIN BATCH USING CONSISTENCY QUORU

Cassandra - Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

In cassandra 2.1.4, if you run "nodetool status" without any keyspace specified, you will get a Note: ? 1 2 3 4 5 6 7 8 9 $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving --

cassandra-cli的基本操作——cassandra总结(三)

一.启动cassandra客户端首先启动cassandra,然后运行bin\cassandra-cli.bat启动客户端,默认hostname为localhost,port为9160 F:\apache-cassandra-2.1.11-bin\bin>cassandra-cli Starting Cassandra Client org.apache.thrift.transport.TTransportException: java.net.ConnectException: Conn e

cassandra高级操作之JMX操作

需求场景项目中有这么个需求:统计集群中各个节点的数据量存储大小,不是记录数. 一开始有点无头绪,后面查看cassandra官方文档看到Monitoring章节,里面说到:Cassandra中的指标使用Dropwizard Metrics库进行管理. 这些指标可以通过JMX查询,也可以使用多个内置和第三方报告插件推送到外部监控系统(Jconsole).那么数据量存储大小是不是也是cassandra的某项指标了? 带着疑问,我通过Jconsole看到了cassandra的一些指标(先启动cassa