Wednesday, January 9, 2013

NoSQL – a simple & comprehensive guide


Well lets start with the name, the meaning is Not Only SQL, so trying to explain it in one phrase is simple: "An attempt to handle big amounts of data using non-relational solutions", a better name will be "Not only relational"
So let's be more specific & a bit more formal, what are we trying to solve? What are the problems RDBMS today cannot help?


Todays

  • Relational DBs Cannot Handle Web-Scale, the amounts of data are just too big for them, try to imagine how many posts Tweeter handles every day? Facebook? How many site google scans & index every day?
  • They are not distributed, they were never designed to be, thus they are not fault tolerance, you usually have several DB servers, if one goes down – you're in BIG trouble.
  • RDBMS gives use ACID operations:
o   Atomic – All of the work in a transaction completes (commit) or none of it completes.
o   Consistent – A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints.
o   Isolated – The results of any changes made during a transaction are not visible until the transaction has committed.
o   Durable – The results of a committed transaction survive failures
That is great, but what if we don't must have those? What if most of our CRUD operations are Read? Then we don't really benefit from ACID (unless if you're a developer lives in Goa J)

BASE & CAP

The BASE acronym was defined by Eric Brewer, who is also known for formulating the CAP theorem.
The CAP theorem states that a distributed computer system cannot guarantee all of the following three properties at the same time:
  • Consistency - all nodes see the same data at the same time
  • Availability - a guarantee that every request receives a response about whether it was successful or failed
  • Partition tolerance - the system continues to operate despite arbitrary message loss or failure of part of the system
A BASE system gives up on consistency:
Basically available indicates that the system does guarantee availability, in terms of the CAP theorem.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time, given that the system doesn't receive input during that time.

NoSQL DataBase System characteristics

  • Scalable replication and distribution - potentially thousands of machines distributed around the world
  • Queries need to return answers quickly
  • Mostly query, few updates
  • Asynchronous Inserts & Updates
  • NoSQL does not use SQL as its query language.
  • Do not necessarily follow a fixed schema.
  • NoSQL cannot necessarily give full ACID guarantees instead it gives us BASE.
  • NoSQL has a distributed, fault-tolerant architecture. 
  • Open source development
Wow… now that we've passed that boring definition part, let's dive into the goodies: types & implementations
     There are many types of NoSQL DB's, let's talk about 4 of the most common ones:
  •     Column Store (Tabular) – Each storage block contains data from only one column
  •     Document Store – stores documents made up of tagged elements
  •     Key-Value Store – Hash table of keys
  •     Graph DB - designed for data whose relations are well represented as a graph.

Column Store

column-oriented DBMS stores data tables as sections of columns of data rather than as rows of data, like most relational DBMSs. This has advantages for data warehouses, CRM systems, and library card catalogs, and other ad-hoc inquiry systems where aggregates are computed over large numbers of similar data items.
A relational database management system must show its data as two-dimensional tables, of columns and rows, but store it as one-dimensional strings. For example, a database might have this table.

EmpId
Lastname
Firstname
Salary
1
Smith
Joe
40000
2
Jones
Mary
50000
3
Johnson
Cathy
44000

This table exists in the computer's memory (RAM) and storage (hard drive), A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on.

      1,Smith,Joe,40000;
      2,Jones,Mary,50000;
      3,Johnson,Cathy,44000;

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.

      1,2,3;
      Smith,Jones,Johnson;
      Joe,Mary,Cathy;
      40000,50000,44000;

1.     Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
2.       Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
3.       Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
4.       Row-oriented organizations are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.
Column store examples: Apache HBase, Google BigTable

Document Store

This kind of DB's store the "document" itself, where each document-oriented database implementation differs on the details of this definition, but in general, they all assume that documents are encoded in some standard formats like XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on).
There are different ways to organize those document in the DB
  • Collections
  • Tags
  • Non-visible Metadata
  • Directory hierarchies

Document Store characteristics:
  • Documents in a collection may have fields that are completely different.
  • Documents are addressed in the database via a unique key that represents that document.
  • Beyond the simple key-document (or key–value) lookup that you can use to retrieve a document, the database will offer an API or query language that will allow retrieval of documents based on their contents.

{
  "_id": "guid goes here",
  "_rev": "314159",
  "type": "abstract",
  "author": "Keith W. Hare"
  "title": "SQL Standard and NoSQL Databases",
  "body": "NoSQL databases (either no-SQL or Not Only SQL)
           are currently a hot topic in some parts of
           computing.",
  "creation_timestamp": "2011/05/10 13:30:00 +0004"
}

Document Store examples: MongoDB, Apache CouchDB, Oracle NoSql DB

Key-Value Store

Key–value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. Because of this, there is no need for a fixed data model.The following types exist:
It is a single table with two columns: one being the (Primary) Key, and the second thing being the Value. And that's it, that's all the NoSQL magic.
user3371_color         Blue
user4344_color         Brackish
user1923_height        6' 0"
user3371_age           34
error_msg_457          There is no file %1 here
error_message_1        There is no user with %1 name
1923_name              Jim
user1923_name          Jim Smith
user1923_lname         Smith
Application_Installed  true
log_errors             1
install_path           C:\Windows\System32\Restricted
ServerName             localhost
test                   test
test1                  test
test123                Brackish

Key-Value Store examples: Apache Cassandra, Oracle Coherence, FreeBase

Graph DB

This kind of database is designed for data whose relations are well represented as a graph (elements interconnected with an undetermined number of relations between them). The kind of data could be social relations, public transport links, road maps or network topologies, for example.
One of the common language to use that kind of DB is SPARQL

PREFIX abc: <http://example.com/exampleOntology#>
SELECT ?capital ?country
WHERE {
  ?x abc:cityname ?capital ;
     abc:isCapitalOf ?y .
  ?y abc:countryname ?country ;
     abc:isInContinent abc:Africa .
}

Graph DB examples: Neo4j, IBM DB2, AllegroGraph


The Big Picture
This post was just a glimpse from the big world of big data:



References:




1 comment: