Bigger Than Life - Big Data Blog

Well lets start with the name, the meaning is Not Only SQL, so trying to explain it in one phrase is simple: "An attempt to handle big amounts of data using non-relational solutions", a better name will be "Not only relational"

So let's be more specific & a bit more formal, what are we trying to solve? What are the problems RDBMS today cannot help?

Todays

Relational DBs Cannot Handle Web-Scale, the amounts of data are just too big for them, try to imagine how many posts Tweeter handles every day? Facebook? How many site google scans & index every day?
They are not distributed, they were never designed to be, thus they are not fault tolerance, you usually have several DB servers, if one goes down – you're in BIG trouble.
RDBMS gives use ACID operations:

o Atomic – All of the work in a transaction completes (commit) or none of it completes.

o Consistent – A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints.

o Isolated – The results of any changes made during a transaction are not visible until the transaction has committed.

o Durable – The results of a committed transaction survive failures

That is great, but what if we don't must have those? What if most of our CRUD operations are Read? Then we don't really benefit from ACID (unless if you're a developer lives in Goa J)

BASE & CAP

The BASE acronym was defined by Eric Brewer, who is also known for formulating the CAP theorem.

The CAP theorem states that a distributed computer system cannot guarantee all of the following three properties at the same time:

Consistency - all nodes see the same data at the same time
Availability - a guarantee that every request receives a response about whether it was successful or failed
Partition tolerance - the system continues to operate despite arbitrary message loss or failure of part of the system

A BASE system gives up on consistency:

Basically available indicates that the system does guarantee availability, in terms of the CAP theorem.

Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.

Eventual consistency indicates that the system will become consistent over time, given that the system doesn't receive input during that time.

NoSQL DataBase System characteristics

Scalable replication and distribution - potentially thousands of machines distributed around the world
Queries need to return answers quickly
Mostly query, few updates
Asynchronous Inserts & Updates
NoSQL does not use SQL as its query language.
Do not necessarily follow a fixed schema.
NoSQL cannot necessarily give full ACID guarantees instead it gives us BASE.
NoSQL has a distributed, fault-tolerant architecture.
Open source development

Wow… now that we've passed that boring definition part, let's dive into the goodies: types & implementations

There are many types of NoSQL DB's, let's talk about 4 of the most common ones:

Column Store (Tabular) – Each storage block contains data from only one column
Document Store – stores documents made up of tagged elements
Key-Value Store – Hash table of keys
Graph DB - designed for data whose relations are well represented as a graph.

Column Store

A column-oriented DBMS stores data tables as sections of columns of data rather than as rows of data, like most relational DBMSs. This has advantages for data warehouses, CRM systems, and library card catalogs, and other ad-hoc inquiry systems where aggregates are computed over large numbers of similar data items.

A relational database management system must show its data as two-dimensional tables, of columns and rows, but store it as one-dimensional strings. For example, a database might have this table.

EmpId	Lastname	Firstname	Salary
1	Smith	Joe	40000
2	Jones	Mary	50000
3	Johnson	Cathy	44000

This table exists in the computer's memory (RAM) and storage (hard drive), A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on.

1,Smith,Joe,40000;

2,Jones,Mary,50000;

3,Johnson,Cathy,44000;

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.

1,2,3;

Smith,Jones,Johnson;

Joe,Mary,Cathy;

40000,50000,44000;

1. Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.

2. Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.

3. Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.

4. Row-oriented organizations are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.

Column store examples: Apache HBase, Google BigTable

Document Store

This kind of DB's store the "document" itself, where each document-oriented database implementation differs on the details of this definition, but in general, they all assume that documents are encoded in some standard formats like XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on).

There are different ways to organize those document in the DB

Collections
Tags
Non-visible Metadata
Directory hierarchies

Document Store characteristics:

Documents in a collection may have fields that are completely different.
Documents are addressed in the database via a unique key that represents that document.
Beyond the simple key-document (or key–value) lookup that you can use to retrieve a document, the database will offer an API or query language that will allow retrieval of documents based on their contents.

{

"_id": "guid goes here",

"_rev": "314159",

"type": "abstract",

"author": "Keith W. Hare"

"title": "SQL Standard and NoSQL Databases",

"body": "NoSQL databases (either no-SQL or Not Only SQL)

are currently a hot topic in some parts of

computing.",

"creation_timestamp": "2011/05/10 13:30:00 +0004"

}

Document Store examples: MongoDB, Apache CouchDB, Oracle NoSql DB

Key-Value Store

Key–value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. Because of this, there is no need for a fixed data model.The following types exist:

It is a single table with two columns: one being the (Primary) Key, and the second thing being the Value. And that's it, that's all the NoSQL magic.

user3371_color Blue

user4344_color Brackish

user1923_height 6' 0"

user3371_age 34

error_msg_457 There is no file %1 here

error_message_1 There is no user with %1 name

1923_name Jim

user1923_name Jim Smith

user1923_lname Smith

Application_Installed true

log_errors 1

install_path C:\Windows\System32\Restricted

ServerName localhost

test test

test1 test

test123 Brackish

Key-Value Store examples: Apache Cassandra, Oracle Coherence, FreeBase

Graph DB

This kind of database is designed for data whose relations are well represented as a graph (elements interconnected with an undetermined number of relations between them). The kind of data could be social relations, public transport links, road maps or network topologies, for example.

One of the common language to use that kind of DB is SPARQL

PREFIX abc: <http://example.com/exampleOntology#>

SELECT ?capital ?country

WHERE {

?x abc:cityname ?capital ;

abc:isCapitalOf ?y .

?y abc:countryname ?country ;

abc:isInContinent abc:Africa .

}

Graph DB examples: Neo4j, IBM DB2, AllegroGraph

The Big Picture

This post was just a glimpse from the big world of big data:

References:

Wikipedia
“Scalable SQL”, ACM Queue, Michael Rys, April 19, 2011
http://queue.acm.org/detail.cfm?id=1971597
“a practical guide to noSQL”, Posted by Denise Miura on March 17, 2011 at http://blogs.marklogic.com/2011/03/17/a-practical-guide-to-nosql/
NoSQL News websites: http://nosql.mypopescu.com, http://www.nosqldatabases.com
http://dba.stackexchange.com/questions/607/what-is-a-key-value-store-database
http://stackoverflow.com/questions/3342497/explanation-of-base-terminology
http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772