Tuesday, November 25, 2014

Big Data... Is Hadoop the good way to start?

In the past 2 years, I have met many developers, architects that are working on “big data” projects. This sounds amazing, but quite often the truth is not that amazing.

TL;TR

You believe that you have a big data project?
  • Do not start with the installation of an Hadoop Cluster -- the "how"
  • Start to talk to business people to understand their problem -- the "why"
  • Understand the data you must process
  • Look at the volume -- very often it is not "that" big
  • Then implement it, and take a simple approach, for example start with MongoDB + Apache Spark

The infamous "big data project"

A typical discussion would look like:

Me: “Can you tell me more about this project, what do you do with your data?”

Mr. Big Bytes: “Sure, we have a 40 nodes Hadoop cluster..."

Me: “This is cool but which type of data do you store, and what is the use case, business value?"

Mr. Big Bytes: “We store, all the logs of our applications, we have hundreds of gigabits…"
After a long blank:“We have not yet started to analyze these data. For now it is jut  'us, the IT team,' we store the data, like that soon we will be able to do interesting things with them"

You can meet the same person few months later; the cluster is still sitting here, with no activity on it. I even met some consultants telling me they received calls from their customer asking the following:
“Hmmm, we have an Hadoop cluster installed, can you help us to find what to do with it?"

Wrong! That is wrong!!!!! This means that the IT Team has spent lot of time for nothing, at least for the business; and I am not even sure the team has learned something technically.

Start with the "Why" not with the "How"!

The solution to this, could be obvious, start your “big data project” answering the “why/what” questions first! The “how”, the implementation, will come later.

I am sure that most of the enterprises will benefit of a so called “big data project”, but it is really important to understand the problems first. And these problems are not technical… at least at the beginning. So you must spend time with the business persons to understand what could help them. Let's take some examples.

You are working in a bank or insurance, business people will be more than happy to predict when/why customer will be leaving the company by doing some churn analysis; or it will be nice to be able to see when it makes lot of sense to sell new contracts, service to existing customers. 

You are working in retail/commerce, your business will be happy to see if they can adjust the price to the market, or provide precise recommendations to a user from an analysis of other customer behavior.

We can find many other examples. But as you can see we are not talking about technology, just business and possible benefits. In fact nothing new, compare with the applications you are building, you need first to have some requirements/ideas to build a product. Here we just need to have some "data input" to see how we can enrich the data with some business value.

Once you have started to ask all these questions you will start to see some input, and possible processing around them:
  • You are an insurance, you customers has no contact with your representative, or the customer satisfaction is medium/bad; you start to see some customer name in quotes coming from price comparison website…. hmm you can guess that they are looking for a new insurance. 
  • Still in the insurance, when your customer are close to the requirement age, or has teenagers learning how to drive, moving to college, you know that you have some opportunity to sell new contract, or adapt existing ones to the new needs
  • In retail, you may want to look to all customers and what they have ordered, and based on this be able to recommend some products to a customer that "looks" the same.
  • Another very common use case these days, you want to do some sentiment analysis of social networks to see how your brand is perceived by your community
As you can see now, we can start to think about the data we have to use and the type of processing we have to do on them.

Let's now talk about the "How"

Now that you have a better idea about what you want to do, it does not mean that you should dive into a large cluster installation.

Before that, you should continue to analyze the data:
  • What is the structure of the data that I have to analyze?
  • How big is my dataset?
  • How much data I have to ingest on a period of time (minute, hour, day, ...)
All these questions will help you to understand better your application. This is where it is often interesting too, and we realize that for most of us the "big data" is not that big!

I was working the other day with a Telco company in Belgium, and I was talking about possible new project. I simply said:
  • Belgium is what, 11+ millions of people
  • If you store a 50kb object for each person this represent:
  • Your full dataset will be 524Gb, yes not even a Terabyte!
Do you need a large Hadoop cluster to store and process this? You can use it, but you do not need to! You can find something smaller, and easier to start with.

Any database will do the job, starting with MongoDB. I think it is really interesting to start this project with a MongoDB cluster, not only because it will allow you to scale out as much as you need, but also because you will leverage the flexibility of the document model. This will allow you to store any type of data, and easily adapt the structure to the new data, or requirements.

Storing the data is only one part of the equation. The other part is how you achieve the data processing. Lately I am playing a lot with Apache Spark. Spark provides a very powerful engine for large scale data processing, and it is a lot simpler than Map Reduce jobs. In addition to this, you can run Spark without Hadoop! This means you can connect you Spark to your MongoDB, with the MongoDB Hadoop Connector and other data sources and directly execute job on your main database.

What I like also about this approach, you can when you dataset starts to grow, and it become harder to process all the data on your operational database, you can easily add Hadoop; and keep most of your data processing layer intact, and only change the data sources information. In this case you will connect MongoDB and Hadoop to get/push the data into HDFS, once again using the MongoDB Hadoop Connector.

Conclusion

Too many times, projects are driven by technology instead of focusing on the business value. This is particularly true around big data projects. So be sure you start by understanding the business problem, and find the data that could help to solve it.

Once you have the business problem and the data, select the good technology, that could be very simple, simple files and python scripts, or more often a database like MongoDB with a data processing layer like Spark. And start to move to Hadoop when it is really mandatory... a very, very, very, large dataset. 





Thursday, August 21, 2014

Introduction to MongoDB Geospatial feature


This post is a quick and simple introduction to Geospatial feature of MongoDB 2.6 using simple dataset and queries.


Storing Geospatial Informations

As you know you can store any type of data, but if you want to query them you need to use some coordinates, and create index on them. MongoDB supports three types of indexes for GeoSpatial queries:
  • 2d Index : uses simple coordinate (longitude, latitude). As stated in the documentation: The 2d index is intended for legacy coordinate pairs used in MongoDB 2.2 and earlier. For this reason, I won't detail anything about this in this post. Just for the record 2d Index are used to query data stored as points on a two-dimensional plane
     
  • 2d Sphere Index : support queries of any geometries on an-earth-like sphere, the data can be stored as GeoJSON and legacy coordinate pairs (longitude, latitude). For the rest of the article I will use this type of index and focusing on GeoJSON.
     
  • Geo Haystack : that are used to query on very small area. It is today less used by applications and I will not describe it in this post.
So this article will focus now on the 2d Sphere index with GeoJSON format to store and query documents.

So what is GeoJSON?

You can look at the http://geojson.org/ site, let's do a very short explanation. GeoJSON is a format for encoding, in JSON, a variety of geographic data structures, and support the following types:  Point , LineString , Polygon , MultiPoint , MultiLineString , MultiPolygon and Geometry.

The GeoJSON format  is quite straightforward based, for the simple geometries, on two attributes: type and coordinates. Let's take some examples:

The city where I spend all my childhood, Pleneuf Val-André, France, has the following coordinates (from Wikipedia)
 48° 35′ 30.12″ N, 2° 32′ 48.84″ W
This notation is a point, based on a latitude & longitude using the WGS 84 (Degrees, Minutes, Seconds) system. Not very easy to use by application/code, this is why it is also possible to represent the exact same point using the following values for latitude & longitude:
48.5917, -2.5469
This one uses the WGS 84 (Decimal Degrees) system. This is the coordinates you see use in most of the application/API you are using as developer (eg: Google Maps/Earth for example)

By default GeoJSON, and MongoDB use these values but the coordinates must be stored in the longitude, latitude order, so this point in GeoJSON will look like:

{
  "type": "Point",
  "coordinates": [
    -2.5469,  
    48.5917 
  ]
}


This is a simple "Point", let's now for example look at a line, a very nice walk on the beach :

{
  "type": "LineString",
  "coordinates": [
    [-2.551082,48.5955632],
    [-2.551229,48.594312],
    [-2.551550,48.593312],
    [-2.552400,48.592312],
    [-2.553677, 48.590898]
  ]
}


So using the same approach you will be able to create MultiPoint, MultiLineString, Polygon, MultiPolygon. It is also possible to mix all these in a single document using a GeometryCollection. The following example is a Geometry Collection of MultiLineString and Polygon over Central Park:

{
  "type" : "GeometryCollection",
  "geometries" : [
    {
      "type" : "Polygon",
      "coordinates" : [
         [
	  [ -73.9580, 40.8003 ],
          [ -73.9498, 40.7968 ],
	  [ -73.9737, 40.7648 ],
	  [ -73.9814, 40.7681 ],
	  [ -73.9580, 40.8003  ]
	 ]
       ]
    },
    {
      "type" : "MultiLineString",
      "coordinates" : [
         [ [ -73.96943, 40.78519 ], [ -73.96082, 40.78095 ] ],
 	 [ [ -73.96415, 40.79229 ], [ -73.95544, 40.78854 ] ],
         [ [ -73.97162, 40.78205 ], [ -73.96374, 40.77715 ] ],
         [ [ -73.97880, 40.77247 ], [ -73.97036, 40.76811 ] ]
       ]
     }
  ]
}

Note: You can if you want test/visualize these JSON documents using the http://geojsonlint.com/ service. 


Now what? Let's store data!

Once you have a GeoJSON document you just need to store it into your document. For example if you want to store a document about JFK Airport with its location you can run the following command:

db.airports.insert(
  {
    "name" : "John F Kennedy Intl",
    "type" : "International",
    "code" : "JFK",
    "loc" : {
      "type" : "Point",
      "coordinates" : [ -73.778889, 40.639722 ]
    }
}

Yes this is that simple! You just save the GeoJSON as one of the attribute of the document, (loc in this example)

Querying Geospatial Informations

Now that we have the data stored in MongoDB, it is now possible to use the geospatial information to do some interesting queries. 

For this we need a sample dataset. I have created one using some open data found in various places. This dataset contains the following informations:
  • airports collection with the list of US airport (Point)
  • states collection with the list of US states (MultiPolygon)
I have created this dataset from various OpenData sources ( http://geocommons.com/http://catalog.data.gov/dataset ) and use toGeoJSON to convert them into the proper format.

Let's install the dataset:
  1. Download it from here
  2. Unzip geo.zip file
  3. Restore the data into your mongoDB instance, using the following command
    mongorestore geo.zip
MongoDB allows applications to do the following types of query on geospatial data:

  • inclusion
  • intersection
  • proximity
Obviously, you will be able to use all the other operator in addition to the geospatial ones. Let's now look at some concrete examples. 

Inclusion

Find all the airports in California. For this you need to get the California location (Polygon) and use the command $geoWithin in the query. From the shell it will look like :

use geo

var cal = db.states.findOne(  {code : "CA"}  );

db.airports.find( 
  { 
    loc : { $geoWithin : { $geometry : cal.loc } } 
  },
  { name : 1 , type : 1, code : 1, _id: 0 } 
);

Result:

{ "name" : "Modesto City - County", "type" : "", "code" : "MOD" }
...
{ "name" : "San Francisco Intl", "type" : "International", "code" : "SFO" }
{ "name" : "San Jose International", "type" : "International", "code" : "SJC" }
...

So the query is using the "California MultiPolygon" and looks in the airports collection to find all the airports that are in these polygons. This looks like the following image on a map:



You can use any other query features or criteria, for example you can limit the query to international airport only sorted by name :

db.airports.find( 
  { 
    loc : { $geoWithin : { $geometry : cal.loc } },
    type : "International" 
  },
  { name : 1 , type : 1, code : 1, _id: 0 } 
).sort({ name : 1 });

Result:

{ "name" : "Los Angeles Intl", "type" : "International", "code" : "LAX" }
{ "name" : "Metropolitan Oakland Intl", "type" : "International", "code" : "OAK" }
{ "name" : "Ontario Intl", "type" : "International", "code" : "ONT" }
{ "name" : "San Diego Intl", "type" : "International", "code" : "SAN" }
{ "name" : "San Francisco Intl", "type" : "International", "code" : "SFO" }
{ "name" : "San Jose International", "type" : "International", "code" : "SJC" }
{ "name" : "Southern California International", "type" : "International", "code" : "VCV" }


I do not know if you have looked in detail, but we are querying these documents with no index. You can run a query with the explain()to see what's going on. The $geoWithin operator does not need index but your queries will be more efficient with one so let's create the index:

db.airports.ensureIndex( { "loc" : "2dsphere" } );

Run the explain and you will se the difference.


Intersection

    Suppose you want to know what are all the adjacent states to California, for this we just need to search for all the states that have coordinates that "intersects" with California. This is done with the following query:

    var cal = db.states.findOne(  {code : "CA"}  );
    
    
    
    db.states.find(
        loc : { $geoIntersects : { $geometry : cal.loc  }  } ,
        code : { $ne : "CA"  }  
      }, 
      { name : 1, code : 1 , _id : 0 } 
    );


    Result:

    { "name" : "Oregon", "code" : "OR" }
    { "name" : "Nevada", "code" : "NV" }
    { "name" : "Arizona", "code" : "AZ" }
    
    
    
    
    Same as before $geoIntersect operator does not need an index to work, but it will be more efficient with the following index:

    db.states.ensureIndex( { loc : "2dsphere" } );

    Proximity

    The last feature that I want to highlight in this post is related to query with proximity criteria. Let's find all the international airports that are located at less than 20km from the reservoir in NYC Central Park. For this you will be using the $near operator.

    db.airports.find(
    {
    loc : {
    $near : {
    $geometry : { 
    type : "Point" , 
    coordinates : [-73.965355,40.782865]  
    }, 
    $maxDistance : 20000
    }
    }, 
    type : "International"
    },
    {
    name : 1,
    code : 1,
    _id : 0
    }
    );


    Results:

    { "name" : "La Guardia", "code" : "LGA" } { "name" : "Newark Intl", "code" : "EWR"


    So this query returns 2 airports, the closest being La Guardia, since the $near operator sorts the results by distance. Also it is important to raise here that the $near operator requires an index.

    Conclusion

    In this first post about geospatial feature you have learned:
    • the basic of GeoJSON
    • how to query documents with inclusion, intersection and proximity criteria.
    You can now play more with this for example integrate this into an application that expose data into some UI, or see how you can use the geospatial operators into a aggregation pipeline.




    Friday, March 28, 2014

    db.person.find( { "role" : "DBA" } )

    Wow! it has been a while since I posted something on my blog post. I have been very busy, moving to MongoDB, learning, learning, learning…finally I can breath a little and answer some questions.

    Last week I have been helping my colleague Norberto to deliver a MongoDB Essentials Training in Paris. This was a very nice experience, and I am impatient to deliver it on my own. I was happy to see that the audience was well balanced between developers and operations, mostly DBA.

    What! I still need DBA?



    This is a good opportunity to raise a point, or comment a wrong idea: the fact that you are using MongoDB, or any other NoSQL datastore does not mean that you do not need a DBA… Like any project, an administrator is not mandatory, but if you have one it is better. So even when MongoDB is pushed by development team it is very important to understand the way the database works, and how to administer, monitor it.

    If you are lucky enough to have real operations teams, with good system and database administrators use them! They are very important for your application.

    Most DBA/System Administrators have been maintaining systems in production for many years. They know how to keep your application up and running. They also most of the time experienced many “disasters”, and then recover (I hope).

    Who knows, you may encounter big issues with your application and you will be happy to have them on your side at this moment.

    "Great, but the DBA is slowing down my development!"

    I hear this, sometimes, and I had this feeling in the past to as developer in large organization. Is it true?

    Developers and DBA are today, not living in the same worlds:

    • Developers want to integrate new technologies as soon as possible, not only because it is fun and they can brag about it during meetups/conferences; but because these technologies, most of the time, are making them more productive, and offer better service/experience to the consumer 
    • DBA, are here to keep the applications up and running! So every time they do not feel confident about a technology they will push back. I think this is natural and I would be probably the same in their position. Like all geeks, they would love to adopt new technologies but they need to understand and trust it before.

    System administrators, DBAS look at the technology with a different angle than developers.

    Based on this assumption, it is important to bring the operation team as early as possible when  the development team wants to integrate MongoDB or any new data store. Having the operation team in the loop early will ease the global adoption of MongoDB in the company.

    Personally, and this will show my age, I have seen a big change in the way developers and DBAs are working together.

    Back in the 90's, when the main architecture was based on client/server architecture  developers and DBAs where working pretty well togethers; probably because they were speaking the same language: SQL was everywhere.  I had regular meetings wit

    Then, since mid 2000, mots of applications have moved to a web based architecture , with for example Java middleware, and the developers stopped working with DBAs. Probably because the abstraction data layer provided by the ORM exposed the database as a "commodity" service that is supposed to work: "Hey Mr DBA, my application has been written with the best middleware technology on the market, so now deal with the performance and scalability! I am done!"

    Yes it is a cliché, but I am sure that some of you will recognize that.

    Nevertheless each time I can, I have been pushing developers to talk more to administrators and look closely to their database!

    A new era for operations and development teams

    The fast adoption of MongoDB by developers, is a great opportunity to fix what we have broken 10 years ago in large information systems:

    • Let's talk again!

    MongoDB has been built first for developers. The document oriented approach gives lot of flexibility to quickly adapt to change. So anytime your business users need a new feature you can implement it, even if this change impact the data structure. Your data model is now driven and controlled by the application, not the database engine.

    However, the applications still need to be available 24x7, and performs well. These topics are managed - and shared- by administrator and developers! This has been always the case but, as I described it earlier, it looks like some of us have forgotten that.

    Schemas design, change velocity, are driven by the application, so the business and development teams, but all this impacts the database, for example:

    • How storage will grow ?
    • Which indexes must be created to speed up my application?
    • How to organize my cluster to leverage the infrastructure properly:
      • Replica-Set organization (and related write concerns, managed by developer)
      • Sharding options
    • And the most important of them : backup/recovery strategies

    So many things that could be managed by the project team, but if you have an operation team with you, it will be better to do that as a single team.

    You, the developer, are convinced that MongoDB is the best database for your projects! Now it is time to work with the ops team and convince them too.  So you should for sure explain why MongoDB is good for you as developer, but also you should highlight all the benefits for the operations, starting with built-in high-availability with replica sets, and easy scalability with sharding. MongoDB is also here to make the life of the administrator easier! I have shared in the next paragraph a lit of resources that are interesting for operations people.

    Let’s repeat it another time, try to involve the operation team as soon as possible, and use that as an opportunity to build/rebuild the relationship between developers and system administrators!

    Resources

    You can find many good resources on the Site to helps operations or learn about this:




    Tuesday, October 1, 2013

    Pagination with Couchbase

    If you have to deal with a large number of documents when doing queries against a Couchbase cluster it is important to use pagination to get rows by page. You can find some information in the documentation in the chapter "Pagination", but I want to go in more details and sample code in this article.


    For this example I will start by creating a simple view based on the beer-sample dataset, the view is used to find brewery by country:

    function (doc, meta) {
      if (doc.type == "brewery" && doc.country){
       emit(doc.country);
      } 
    }
    


    This view list all the breweries by country, the index looks like:

    Doc idKeyValue
    bersaglierArgentinanull
    cervecera_jeromeArgentinanull
    brouwerij_nacional_balashiArubanull
    australian_brewing_corporationAustralianull
    carlton_and_united_breweriesAustralianull
    coopers_breweryAustralianull
    foster_s_australia_ltdAustralianull
    gold_coast_breweryAustralianull
    lion_nathan_australia_hunter_streetAustralianull
    little_creatures_breweryAustralianull
    malt_shovel_breweryAustralianull
    matilda_bay_brewingAustralianull
    .........
    .........
    .........
    yellowstone_valley_brewingUnited Statesnull
    yuengling_son_brewingUnited Statesnull
    zea_rotisserie_and_breweryUnited Statesnull
    fosters_tien_gangViet Namnull
    hue_breweryViet Namnull


    So now you want to navigate in this index with a page size of 5 rows.

    Thursday, July 18, 2013

    How to implement Document Versioning with Couchbase

    Introduction

    Developers are often asking me how to "version" documents with Couchbase 2.0. The short answer is: the clients and server do not expose such feature, but it is quite easy to implement.

    In this article I will use a basic approach, and you will be able to extend it depending of your business requirements. 

    Design

    The first thing to do is to select how to "store/organize" the versions of your document, and for this you have different designs:
    • copy the versions the document into new documents
    • copy the versions of the document into a list of embedded documents
    • store the list of attributes that have been changed into a embedded element (or new documents)
    • store the "delta"
    You will have to chose the design based on your application requirements (business logic, size of the dataset, ...).  For this article, let's use one of the most simplistic approach: create new document for each version with the following rules for the keys:
    1. The current version is is a simple Key/Document, no change to the key.
    2. The version is a copy of the document, and the version number is added to the key.
    This looks like:
    Current Version  mykey
    Version 1  mykey::v1
    Version 2  mykey::v2
    ...     ...

    With this approach, existing applications will always use the current version of the document, since the key is not changed. But this approach creates new documents that will be indexed by existing views.

    For example, in the Beer Sample application, the following view is used to list the beer by name:

    function (doc, meta) {
        if(doc.type && doc.type == "beer") {
            emit(doc.name);
        }
    }
    


    It is quite simple to "support" versioning without impacting the existing code, except the view itself. The new view needs to emit keys,value only for the current version of the document. This is the new view code:

    function (doc, meta) {
        if(doc.type && doc.type == "beer" && (meta.id).indexOf("::v") == -1   ) {
            emit(doc.name);
        }
    }
    

    With this change the existing applications that are using this view will continue to work with the same behavior.

    Implementing the versioning

    Based on this design, when the application needs to version the document, the following logic should happen:
    1. Get the current version of the document
    2. Increment the version number (for example using another key that maintains the version number for each document)
    3. Create the version with the new key  "mykey::v1"
    4. Save the document current version
    Let's look at the code in Java

      Object obj = client.get(key);
      if (obj != null) {
        // get the next version, create or use the key: mykey_version
        long version = client.incr(key + "_version", 1, 1); 
        String keyForVersion = key + "::v" + version; // mykey::v1
        try {
            client.set(keyForVersion, obj).get();
        } catch (Exception e) {
            logger.severe("Cannot save version "+ version + " for key "+ key +" - Error:"+ e.getMessage() );
        }
       }
       client.set(key, value);
    

    Quite simple isn't?

    The application can access the document using the key, but also get one version or the list of all versions, this is one of the reasons why it is interesting to create a key (mykey_version), and use it also to delete documents and related versions.

    Based on the previous comment, the delete operation looks like:

      Object obj = client.get(key);
      // need to delete all the version first
      Object vObject = this.get(key + "_version");
      if (vObject != null) {
        long biggerVersion = Long.parseLong((String) vObject);
        try {
            // delete all the versions
            for (int i = 1; i <= biggerVersion; i++) {
                String versionKey = key + "::v" + i;
                client.delete(versionKey).get();
            }
            // delete the counter
            client.delete(key + "_version").get();
        } catch (InterruptedException e) {
          e.printStackTrace();
        } catch (ExecutionException e) {
          e.printStackTrace();
        }
      }
      client.delete(key);
    

    Use versioning

    As an example, I have created a small library available on GitHub https://github.com/tgrall/couchbase-how-to-versioning, this library extends the Couchbase Client and overrides some of the operations : set, replace and delete. (the basic one: no TLL, no durability) As I said before this is just an example.

    Build and Install 

    git clone https://github.com/tgrall/couchbase-how-to-versioning.git
    cd how-to-versioning
    mvn clean install
    

    Then add this library to your project in addition to Couchbase Java Client, for example in your pom.xml
    ...
      
          com.couchbase.howtos
          couchbase-how-to-versioning
          1.0-SNAPSHOT
      
      
          couchbase
          couchbase-client
          1.1.8
      
    
    ...
    

    Code your application

    Create a document and version it:

     List uris = new LinkedList();
     uris.add(URI.create("http://127.0.0.1:8091/pools"));
     CouchbaseClientWithVersioning client = null
     try {
      client = new CouchbaseClientWithVersioning(uris, "default", "");
      String key = "key-001";
      client.set(key, "This is the original version");
      System.out.printf("Original '%s' .\n", client.get(key));
      client.set(key, "This is a new version", true); // create a new version
      System.out.printf("Current Version '%s' .\n", client.get(key));
      System.out.printf("Version 1 '%s' .\n", client.get(key, 1));
      client.set(key, "This is another version", true); // create a new version
      System.out.printf("All versions %s .\n", client.getAllVersions(key));
      client.deleteVersion(key, 1); // create a new version
      System.out.printf("All versions %s (after delete 1 version).\n", client.getAllVersions(key));
      client.delete(key); // create a new version
      System.out.printf("All versions %s (after delete the main key).\n", client.getAllVersions(key));
     } catch (Exception e) {
      e.printStackTrace();
     }
     if (client !=null) {
      client.shutdown();
     }
    

    Quick explanation:
    • Line 5: instead of using the CouchbaseClient, the application uses the extended  CouchbaseClientWithVersioning class.
    • Line 7: create a new entry
    • Line 9: create a new version, the boolean value to "true" force the versioning of the document
    • The application use other methods such as get a specific version (line 11), get all versions (line 13), delete a specific version (line 14), and finally delete the key and all versions (line 16).
    So using this approach the developer controls explicitly when to create a version, since he has to add the boolean parameter in the set operation. In this small sample library it is also possible to do auto versioning, in this case all set and replace calls will create a version, to achieve that the developer just needs to call the setAutoVersioning(true) method. Something like:

        client = new CouchbaseClientWithVersioning(uris, "default", "");
        client.setAutomaticVersionning(true);

    With this approach you can provide versioning to your application with minimal code change. You can test it in the Beer Sample application, just do not forget to change the views as documenter above to only return current version of the documents.

    Conclusion

    As you can see doing versioning in Couchbase is not that complicated, but it is something that must be done by your application based on its requirements and constraints. You have many different solution and none of these options is perfect for all use cases.

    In this specific sample code, I am working with a simple design where I create a copy of the documents for each version. With this approach also, it is interesting to mention that you can version "anything", not only JSON document but also any values.  As I said before, this is one possible approach, and like any design, it has some impact on the application or database, in this case most the database:
    • Increase the number of keys and documents
    • Double - or more- the number of operations, for example when updating a document, the application needs to get the current value, create a version, save the current version.
    • Consistency management when adding new version and incrementing the version number (need to deal with errors when creating a new version, deleting the versions and counter....)
    Many features could be added to this easily, for example:
    • Limit to a specific number of version,
    • Enable the versioning only of replace() operation
    • Add specific attribute about versions in JSON document (for example date of the version)
    • ....

    If you are using versioning in your Couchbase application feel free to comment or write a small article that describes the way your are doing it.

    Thursday, July 11, 2013

    Deploy your Node/Couchbase application to the cloud with Clever Cloud



    Introduction

    Clever Cloud is the first PaaS to provide Couchbase as a service allowing developers to run applications in a fully managed environment. This article shows how to deploy an existing application to Clever Cloud.




    I am using a very simple Node application that I have documented in a previous article: “Easy application development with Couchbase, Angular and Node”.

    Clever Cloud provides support for various databases MySQL, PostgreSQL, but also and this is most important for me Couchbase. No only Clever Cloud allows you to use database services but also you can deploy and host your application that could be developed in the language/technology of your choice : Java, Node, Scala, Python, PHP, … and all this in a secure, scalable and managed environment.

    Setting up your Clever Cloud environment

    Create your account

    1. Go to the Clever Cloud site : http://www.clever-cloud.com/
    2. Click on “Login” link and follow the steps to create your account.
    3. After few seconds you will received an email and be redirected to the Clever Cloud Console.

    Create a Couchbase instance

    The Clever Cloud Console allows you to create your Couchbase Bucket in few clicks:

    1. Cick on “Services” in the left menu

    2.  Click on “Add a Service” in the left menu 
      3. Click on “Couchbase” button.

      4. Select the size of the RAM quota for your bucket

      The size of the RAM Quota for your bucket will have an impact on performance but also on the pricing.

      5. Click “Add this Service”


      You are done, you should receive an email with all the information to access your newly created bucket.

      The mail from Clever Cloud contains the following information:

      db_host = xxxxxxxx.couchbase.clvrcld.netLocation of the database, this is where the endpoint is located.
      db_name = yyyyyyyyName of the Couchbase bucket
      db_username = xxxxxxxxNot used in Couchbase context
      db_password = zzzzzzzzPassword to connect to the Couchbase Bucket

      So you are now ready to use your bucket.

      Note: In the current version of the Clever Cloud Couchbase Service you do not have access to a management console. If you want to get some information about the database or create views you need to do it from you application code.

      Connect your Application to Couchbase@Clever-Cloud

      The first step is to get some code, so let’s clone the “Couchbase Ideas Sample Application”, and install the dependencies, using the following commands:

      git clone -b 03-vote-with-value https://github.com/tgrall/couchbase-node-ideas.git
      cd couchbase-node-ideas
      git branch mybranch
      
      git checkout mybranch
      
      npm install
      

      Open the app.js and edit the connection info to point your application to the Couchbase instance and modify the HTTP port of your application to 8080 - this is a mandatory step documented here :

      dbConfiguration = {
       "hosts": ["xxxxxxxxxxx.couchbase.clvrcld.net:8091"],
       "bucket": "xxxxxxxxxxx",
       "user": "xxxxxxxxxx",
       "password": "yyyyyyyyyyyyyyyyyyyyyyyyy"
      };
      ...
      ...
      
        appServer = app.listen(8080, function() {
       console.log("Express server listening on port %d in %s mode", appServer.address().port, app.settings.env);
        });
      
      

      Launch your application using
      
      
      node app.js
      

      Go to http://localhost:8080

      Your application is now using Couchbase on the cloud powered by Clever Cloud. Let’s now deploy the application itself on Clever Cloud

      Deploy your application on Clever Cloud

      The easiest way to deploy an application to Clever Cloud is using git. The first thing to do is to add your SSH public key into Clever Cloud Console. If you do not have any SSH yet, follow the steps described on Github : “Generating SSH Keys”.

      Add your SSH key

      Note: As you can guess this should be done only once
      Open the id_rsa.pub file with a text editor. This is your SSH key. Select all and copy to your clipboard.
      1. Go to the Clever Cloud Console
      2. Click on “Profile” entry in the left menu
      3. Click on “SSH Keys”
      4. Click on “Add a SSH Key”
      5. Enter a name (anything you want) and paste your key
      6. Click “Add” button
      You are now ready to deploy applications to Clever Cloud. The next thing to do, is to create a new node application in Clever Cloud.

      Create your Application

      1. Click “Add an app” in the Application menu in the top menu.
      2. Give a name and description to this application
      3. Select the Instance type, in this case “Node.js”
      4. Configure your instances, you can keep the default values for now, click “Next”
      5. Check the configuration, and click “Create”
      Your application is created, you are redirected to the generic information page, where you can find a Git URL that we will use to deploy the application.
      You can navigate into the entries in the left menu to see more information about your application. In addition to the Information page, you can look at the following entries:
      1. “Domain Names” to configure the URL to access your application
      2. “Logs” to view the application logs

      Deploy the Application

      So we are almost there!
      The deployment to Clever Cloud is done using a Git push command, so you need to add the deployment URL as a remote repository to your application, using the following command:
      
      
      git remote add clever git+ssh://git@push.clever-cloud.com/app_[your_app_id].git
      
      
      git commit -a -m “Couchbase on Clever Cloud connection”
      
      
      git push clever mybranch:master
      
      

      Once you have added the application as remote repository you can commit and push your application.

      The last command pushes the application  to Clever Cloud. It is important to note that Clever Cloud will always deploy the application on the “master” branch on the remote repository. The notation mybranch:master is used to mention it. If you work locally on your master branch just use “master”.

      You can now go to the Clever Cloud console and look in the log and click on the URL in the “Domain Names” section to test your application.

      You should be able to see your application, that is running on the Clever Cloud PaaS.

      When you update your application, you just need to do a  git push and git commit.

      Conclusion

      In this tutorial you have learned how to:
      • Create your Clever Cloud account
      • Create a Couchbase instance
      • Create and deploye a Node.js application

      Feel free to test this yourself, with Node or other technology, as you can see it is quite easy to setup.

      Wednesday, July 3, 2013

      SQL to NoSQL : Copy your data from MySQL to Couchbase


      TL;DR: Look at the project on Github.

      Introduction

      During my last interactions with the Couchbase community, I had the question how can I easily import my data from my current database into Couchbase. And my answer was always the same:
      • Take an ETL such as Talend to do it
      • Just write a small program to copy the data from your RDBMS to Couchbase...
      So I have written this small program that allows you to import the content of a RDBMS into Couchbase. This tools could be used as it is, or you can look at the code to adapt it to your application.



      The Tool: Couchbase SQL Importer

      The Couchbase SQL Importer, available here, allows you with a simple command line to copy all -or part of- your SQL schema into Couchbase. Before explaining how to run this command, let's see how the data are stored into Couchbase when they are imported:
      • Each table row is imported a single JSON document
        • where each table column becomes a JSON attribute
      • Each document as a key made of the name of the table and a counter (increment)
      The following concrete example, based on the MySQL World sample database, will help you to understand how it works. This database contains 3 tables : City, Country, CountryLanguage. The City table looks like:
      +-------------+----------+------+-----+---------+----------------+
      | Field       | Type     | Null | Key | Default | Extra          |
      +-------------+----------+------+-----+---------+----------------+
      | ID          | int(11)  | NO   | PRI | NULL    | auto_increment |
      | Name        | char(35) | NO   |     |         |                |
      | CountryCode | char(3)  | NO   |     |         |                |
      | District    | char(20) | NO   |     |         |                |
      | Population  | int(11)  | NO   |     | 0       |                |
      +-------------+----------+------+-----+---------+----------------+
      
      
      The JSON document that matches this table looks like the following:

      city:3805
      { 
        "Name": "San Francisco",
        "District": "California",
        "ID": 3805,
        "Population": 776733,
        "CountryCode": "USA"
      }
      
      

      You see that here I am simply taking all the rows and "moving" them into Couchbase. This is a good first step to play with your dataset into Couchbase, but it is probably not the final model you want to use for your application; most of the time you will have to see when to use embedded documents, list of values, .. into your JSON documents.

      In addition to the JSON document the tool create views based on the following logic:
      • a view that list all imported documents with the name of the "table" (aka type) as key
      • a view for each table with the primary key columns
      View: all/by_type
      {
        "rows": [
          {"key": "city", "value": 4079}, 
          {"key": "country", "value": 239}, 
          {"key": "countrylanguage", "value": 984}
         ]
      }
      

      As you can see this view allows you to get with a single Couchbase query the number of document by type. 

      Also for each table/document type, a view is created where the key of the index is built from the table primary key. Let's for example query the "City" documents.

      View: city/by_pk?reduce=false&limit=5
      {
        "total_rows": 4079,
        "rows": [
          {"id": "city:1", "key": 1, "value": null}, 
          {"id": "city:2", "key": 2, "value": null}, 
          {"id": "city:3", "key": 3, "value": null}, 
          {"id": "city:4", "key": 4, "value": null},
          {"id": "city:5", "key": 5, "value": null}
        ]
      }
      

      The index key matches the value of the City.ID column.  When the primary key is made of multiple columns the key looks like:

      View: CountryLanguage/by_pk?reduce=false&limit=5
      {
        "total_rows": 984,
        "rows": [
          {"id": "countrylanguage:1", "key": ["ABW", "Dutch"], "value": null}, 
          {"id": "countrylanguage:2", "key": ["ABW", "English"], "value": null}, 
          {"id": "countrylanguage:3", "key": ["ABW", "Papiamento"], "value": null},
          {"id": "countrylanguage:4", "key": ["ABW", "Spanish"], "value": null},
          {"id": "countrylanguage:5", "key": ["AFG", "Balochi"], "value": null}
        ]
      }
      


      This view is built from the CountryLanguage table primary key made of CountryLanguage.CountryCode and CountryLanguage.Language columns.

      +-------------+---------------+------+-----+---------+-------+
      | Field       | Type          | Null | Key | Default | Extra |
      +-------------+---------------+------+-----+---------+-------+
      | CountryCode | char(3)       | NO   | PRI |         |       |
      | Language    | char(30)      | NO   | PRI |         |       |
      | IsOfficial  | enum('T','F') | NO   |     | F       |       |
      | Percentage  | float(4,1)    | NO   |     | 0.0     |       |
      +-------------+---------------+------+-----+---------+-------+
      


      How to use Couchbase SQL Importer tool? 

      The importer is a simple Java based command line utility, quite simple to use:

      1. Download the CouchbaseSqlImporter.jar file from here. This file is contains all the dependencies to work with Couchbase: the Java Couchbase Client, and GSON.

      2. Download the JDBC driver for the database you are using as data source. For this example I am using MySQL and I have download the driver for MySQL Site.

      3. Configure the import using a properties file.
      ## SQL Information ##
      sql.connection=jdbc:mysql://192.168.99.19:3306/world
      sql.username=root
      sql.password=password
      
      ## Couchbase Information ##
      cb.uris=http://localhost:8091/pools
      cb.bucket=default
      cb.password=
      
      ## Import information
      import.tables=ALL
      import.createViews=true
      import.typefield=type
      import.fieldcase=lower
      

      This sample properties file contains three sections :

      • The two first sections are used to configure the connections to your SQL database and Couchbase cluster (note that the bucket must be created first)
      • The third section allow you to configure the import itself
        • import.tables : ALL to import all tables, or a the list of tables you want to import, for example City, Country
        • import.createViews : true or false, to force the creation of the views.
        • import.typefield : this is use to add a new attribute in all documents that contains the "type".
        • import.fieldcase : null, lower, upper : this will force the case of the attributes name and the value of the type (City or city or CITY for example).
      4. Run the tool !

      java -cp "./CouchbaseSqlImporter.jar:./mysql-connector-java-5.1.25-bin.jar" com.couchbase.util.SqlImporter import.properties 

      So you run the Java command with the proper classpath (-cp parameter).

      And you are done, you can get your data from your SQL database into Couchbase.

      If you are interested to see how it is working internally, you can take a look to the next paragraph.

      The Code: How it works?


      The main class of the tool is really simple  com.couchbase.util.SqlImporter, the process is:

      1. Connect to the SQL database

      2. Connect to Couchbase

      3. Get the list of tables

      4. For each tables execute a "select * from table"

        4.1. Analyze the ResultSetMetadata to get the list of columns
        
        4.2. Create a Java map for each rows where the key is the name of the columns and the value…is the value

        4.3. Serialize this Map into a GSON document and save it into Couchbase

      The code is available in the ImportTable(String table) Java method.

      One interesting point is that you can use and extend the code to deal with your application.

      Conclusion

      I have created this tool quickly to help some people in the community, if you are using it and need new features, let me know, using comment or pull request.