Friday, July 18, 2014

Deploying Cassandra Across Multiple Data Centers with Replication

Cassandra provides a highly scalable key/value storage that can be used for many applications. When Cassandra is to be used in production one might consider deploying it across multiple data centers for various reasons. For example, your current architecture is such that you update data in one data center and all the other data centers should have a replication of the same data but you are ok with eventual consistency.

In this post I discuss how can we deploy a Cassandra across two data centers(nodes) making sure every data center contains full copy of the complete data set.

Scenario:
My setup contains two nodes with valid public ip address. (e.g: for the sake of better explanation I assign two ip address to two nodes node1: 129.97.74.12, Node2:193.166.167.5 ). Now follow the next steps: 

Steps
Note that all these steps, except Step 4, must be followed in EACH AND EVERY node of the cluster. These steps are tested on Cassandra 2.0.4 version.

Step 1: Configure cassandra.yaml
Open up $CASSANDRA_HOME/conf/cassandra.yaml in your favorite test editor.
I assume you already familiar downloaded and configured Cassandra on each of the boxes in your data centers. Since most of the steps we are doing here should be done for each node in every data center, I encourage you to first look at my previous post about deploying a single node cassandra Here.

1.1) find the following lines and Change the seeds to the public ip address of the other nodes(at least one node per datacenter): 
e.g: 

      # Ex: "<ip1>,<ip2>,<ip3>"  
      - seeds: "129.97.74.12,193.166.167.5"  


1.2) change the listen address and rpc_address to public ip address of the node that you are setting up. e.g:
 # Setting this to 0.0.0.0 is always wrong.  
 listen_address: 193.166.167.5  
 rpc_address: 193.166.167.5  
 # port for Thrift to listen for clients on  
 rpc_port: 9160  


1.3) Change the endpoint_snitch line to PropertyFileSnitch. this tells the cassandra to check the cassandra-topology.properties for the topology of the nodes. 
# of the snitch, which will be assumed to be on your classpath.
endpoint_snitch: PropertyFileSnitch

Step 2: Configure Snitch File
Open up $CASSANDRA_HOME/conf/cassandra-topology.properties. Modify and add all nodes' ip address like:

 # Cassandra Node IP=Data Center:Rack  
 129.97.74.12=DC1:RAC1  
 193.166.167.5=DC2:RAC2 

#default for unknown nodes
default=DC1:RAC1
Step 4: Start Your Cluster
Goto $CASSANDRA_HOME on each node and type ./bin/cassandra -f to bring up the node. Once you do this in all the nodes, type./bin/nodetool -h localhost ring to make sure all the nodes are up and running.

Step 5: Create Data Model with Replication
We are almost there. Now you need to tell Cassandra to use this configuration for our data model. The easiest way to do is through cassandra-cli. Goto $CASSANDRA_HOME/bin and type cassandra-cli -p 9160 -h ip_address.

Now you need to create the keyspace with proper replication. Assuming your keyspace name is usertable type the following.

 create keyspace usertable with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = [{DC1:1,DC2:1}];  

Now create the your table or column family in cassandra with name data and insert a test data:

 use usertable;  
 create column family data with column_type = 'Standard' and comparator = 'UTF8Type';  
 ASSUME data validator AS UTF8Type;  
 ASSUME data keys AS UTF8Type;  
 ASSUME data comparator AS UTF8Type;  
 set data[2][1]='test';  


As you defined a replication policy of one copy per node at the time of creating your usertable you should now be able to see the inserted item('test') in all your node. so go to each node and type in cli:
use usertable; 
list data;

your data has to be shown like this:

 [default@hsntest] list data;  
 Using default limit of 100  
 Using default cell limit of 100  
 -------------------  
 RowKey: 32  
 => (name=1, value=74657374, timestamp=1405689037312000)  
 1 Row Returned.  
 Elapsed time: 84 msec(s).  

All Done now you have a two cluster with replicated data in each one. enjoy :)

No comments:

Post a Comment