Wednesday, February 4, 2015

Introduction to MongoDB Security


Last week at the Paris MUG, I had a quick chat about security and MongoDB, and I have decided to create this post that explains how to configure out of the box security available in MongoDB.

You can find all information about MongoDB Security in following documentation chapter:


In this post, I won't go into the detail about how to deploy your database in a secured environment (DMZ/Network/IP/Location/...)

I will focus on Authentication and Authorization, and provide you the steps to secure the access to your database and data.

I have to mention that by default, when you install and start MongoDB, security is not enabled. Just to make it easier to work with.

The first part of the security is the Authentication, you have multiple choices documented here. Let's focus on "MONGODB-CR" mechanism.

The second part is Authorization to select what a user can do or not once he is connected to the database. The documentation about authorization is available here.

Let's now document how-to:
  1. Create an Administrator User
  2. Create Application Users
For each type of users I will show how to grant specific permissions.

Sunday, February 1, 2015

Moving My Beers From Couchbase to MongoDB

See it on my new blog : here

Few days ago I have posted a joke on Twitter

So I decided to move it from a simple picture to a real project. Let’s look at the two phases of this so called project:
  • Moving the data from Couchbase to MongoDB
  • Updating the application code to use MongoDB
Look at this screencast to see it in action:



Friday, January 23, 2015

Everybody Says “Hackathon”!

TLTR:

  • MongoDB & Sage organized an internal Hackathon
  • We use the new X3 Platform based on MongoDB, Node.js and HTML to add cool features to the ERP
  • This shows that “any” enterprise can (should) do it to:
    • look differently at software development
    • build strong team spirit
    • have fun!

Introduction

I have like many of you participated to multiple Hackathons where developers, designer and entrepreneurs are working together to build applications in few hours/days. As you probably know more and more companies are running such events internally, it is the case for example at Facebook, Google, but also ING (bank), AXA (Insurance), and many more.

Last week, I have participated to the first Sage Hackathon!

In case you do not know Sage is a 30+ years old ERP vendor. I have to say that I could not imagine that coming from such company… Let me tell me more about it.



Tuesday, January 20, 2015

Nantes MUG : Event #2

Last night the Nantes MUG (MongoDB Users Group) had its second event. More than 45 people signed up and joined us at the Epitech school (thanks for this!).  We were lucky to have 2 talks from local community members:

How “MyScript Cloud” uses MongoDB

First of all, if you do not know MyScript I invite you to play with the online demonstration. I am pretty sure that you are already using this technology without noticing it, since it is embedded in many devices/applications including: your car look at the Audi Touchpad!

That said Mathieu was not here to talk about the cool features and applications of MyScript but to explain how MongoDB is used to run their cloud product. 

Mathieu explained how you can use MyScript SDK online. You just need to call a REST API to add Handwriting Recognition to your application. Let's make the long story short, and see how MongoDB was chosen and how it is used today:
  • The prototype was done with a single RDBMS instance
  • With the success of the project MyScript Cloud the team had to move to a more flexible solution:
    • Flexible schema to support heterogeneous structures,
    • Highly available solution with automatic failover,
    • Multi datacenter supports with localized read,
  • This is when Mathieu looked at different solution and selected MongoDB and deployed it on AWS.
Mathieu highlighted the following points:
  • Deploy and Manage a Replica Set is really easy, and it is done on multiple AWS data centers,
  • Use the proper read preference  (nearest in this case) to deliver the data as fast as possible,
  • Develop with JSON Documents provides lot of flexibility to the developers, that can add new features faster.





Aggregation Framework

Sebastien "Seb" is software engineering at SERLI and working with MongoDB for more than 2 years now. Seb introduced the reasons why aggregations are needed in applications and the various ways of doing it with MongoDB: simple queries, map reduce, and aggregation pipeline; with a focus on a Aggregation Pipeline.

Using cool demonstrations, Seb explained in a step by step approach the key features and capabilities of MongoDB Aggregation Pipeline:
  • $match, $group, ...
  • $unwind arrays
  • $sort and $limit
  • $geonear
To close his presentation, Seb talked about aggregation best practices, and behavior in a sharded cluster.




And...

As usual the event ended with some drinks and a late dinner!

This event was really great and I am very happy to see what people are doing with MongoDB, including storing ink like MyScript, thanks again to the speakers!

This brings me to the last point : MUGs are driven by the community. You are using MongoDB and want to talk about what you, do not hesitate to reach out the organizers they will be more than happy to have you.

You can find a MUG near you, look here.





Monday, January 12, 2015

How to create a pub/sub application with MongoDB ? Introduction

In this article we will see how to create a pub/sub application (messaging, chat, notification), and this fully based on MongoDB (without any message broker like RabbitMQ, JMS, ... ).

So, what needs to be done to achieve such thing:

  • an application "publish" a message. In our case, we simply save a document into MongoDB
  • another application, or thread, subscribe to these events and will received message automatically. In our case this means that the application should automatically receive newly created document out of MongoDB
All this is possible with some very cool MongoDB features : capped collections and tailable cursors

Capped Collections and Tailable Cursors

As you can see in the documentation, Capped Collections are fixed sized collections, that work in a way similar to circular buffers: once a collection fills its allocated space, it makes room for new documents by overwriting the oldest documents.

MongoDB Capped Collections can be queried using Tailable Cursors, that are similar to the unix tail -f command.  Your application continue to retrieve documents as they are inserted into the collection. I also like to call this a "continuous query".

Now that we have seen the basics, let's implement it.

Building a very basic application 

Create the collection

The first thing to do is to create a new capped collection :


For simplicity, I am using the MongoDB Shell to create the messages collection in the chat database.

You can see on line #7 how to create a capped collection, with 2 options:
  • capped : true : this one is obvious
  • size : 10000 :  this is a mandatory option when you create a capped collection. This is the maximum size in bytes. (will be raised to a multiple of 256)
Finally, on line #9, I insert a dummy document, this is also mandatory to be able to get the tailable cursor to work. 

Write an application

Now that we have the collection, let's write some code.  First in node.js:


From lines #1 to 5 I just connect to my local MongoDB instance.

Then on line #7, I get the messages collection.

And on line #10, I execute a find, using a tailable cursor, using specific options:

  • {} : no filter, so all documents will be returned
  • tailable : true : this one is clear, to say that we want to create a tailable cursor
  • awaitdata : true : to say that we wait for data before returning no data to the client
  • numberOfRetries : -1 :  The number of times to retry on time out, -1 is infinite, so the application will keep trying
Line #11 just force the sort to the natural order,.

Then on line #12, the cursor returns the data, and the document is printed in the console each time it is inserted.

Test the Application

Start the application

node app.js

Insert documents in the messages collection, from the shell or any other tool. 

You can find below a screencast showing this very basic application working:


The source code of this sample application in this Github repository, take the step-01 branch; clone this branch using:

git clone -b step-01 https://github.com/tgrall/mongodb-realtime-pubsub.git


I have also created a gist showing the same behavior in Java:


Mathieu Ancelin has written it in Scala:

Add some user interface

We have the basics of a publish subscribe based application:
  • publish by inserting document into MongoDB
  • subscribe by reading document using a tailable cursor
Let's now push the messages to a user using for example socket.io. For this we need to:
  • add socket.io dependency to our node project
  • add HTML page to show messages
The following gists shows the updated version of the app.js and index.html, let's take a look:

The node application has been updated with the following features:

  • lines #4-7: import of http, file system and socket.io
  • lines #10-21: configure and start the http server. You can see that I have created a simple handler to serve static html file
  • lines #28-39: I have added support to Web socket using socket.io where I open the tailable cursor, and push/emit the messages on the socket.
As you can see, the code that I have added is simple. I do not use any advanced framework, nor manage exceptions, this for simplicity and readability.

Let's now look at the client (html page).

Same as the server, it is really simple and does not use any advanced libraries except socket.io client (line #18) and JQuery (line #19), and used:

  • on line #22 to received messages ans print them in the page using JQuery on line #23
I have created a screencast of this version of the application:




You can find the source code in this Github repository, take the step-02 branch; clone this branch using:

git clone -b step-02 https://github.com/tgrall/mongodb-realtime-pubsub.git


Conclusion

In this first post, we have:

  • learned about tailable cursor and capped collection
  • see how it can be used to develop a pub/sub application
  • expose this into a basic web socket based application
In the next article we will continue to develop a bigger application using these features.


Tuesday, November 25, 2014

Big Data... Is Hadoop the good way to start?

In the past 2 years, I have met many developers, architects that are working on “big data” projects. This sounds amazing, but quite often the truth is not that amazing.

TL;TR

You believe that you have a big data project?
  • Do not start with the installation of an Hadoop Cluster -- the "how"
  • Start to talk to business people to understand their problem -- the "why"
  • Understand the data you must process
  • Look at the volume -- very often it is not "that" big
  • Then implement it, and take a simple approach, for example start with MongoDB + Apache Spark

The infamous "big data project"

A typical discussion would look like:

Me: “Can you tell me more about this project, what do you do with your data?”

Mr. Big Bytes: “Sure, we have a 40 nodes Hadoop cluster..."

Me: “This is cool but which type of data do you store, and what is the use case, business value?"

Mr. Big Bytes: “We store, all the logs of our applications, we have hundreds of gigabits…"
After a long blank:“We have not yet started to analyze these data. For now it is jut  'us, the IT team,' we store the data, like that soon we will be able to do interesting things with them"

You can meet the same person few months later; the cluster is still sitting here, with no activity on it. I even met some consultants telling me they received calls from their customer asking the following:
“Hmmm, we have an Hadoop cluster installed, can you help us to find what to do with it?"

Wrong! That is wrong!!!!! This means that the IT Team has spent lot of time for nothing, at least for the business; and I am not even sure the team has learned something technically.

Start with the "Why" not with the "How"!

The solution to this, could be obvious, start your “big data project” answering the “why/what” questions first! The “how”, the implementation, will come later.

I am sure that most of the enterprises will benefit of a so called “big data project”, but it is really important to understand the problems first. And these problems are not technical… at least at the beginning. So you must spend time with the business persons to understand what could help them. Let's take some examples.

You are working in a bank or insurance, business people will be more than happy to predict when/why customer will be leaving the company by doing some churn analysis; or it will be nice to be able to see when it makes lot of sense to sell new contracts, service to existing customers. 

You are working in retail/commerce, your business will be happy to see if they can adjust the price to the market, or provide precise recommendations to a user from an analysis of other customer behavior.

We can find many other examples. But as you can see we are not talking about technology, just business and possible benefits. In fact nothing new, compare with the applications you are building, you need first to have some requirements/ideas to build a product. Here we just need to have some "data input" to see how we can enrich the data with some business value.

Once you have started to ask all these questions you will start to see some input, and possible processing around them:
  • You are an insurance, you customers has no contact with your representative, or the customer satisfaction is medium/bad; you start to see some customer name in quotes coming from price comparison website…. hmm you can guess that they are looking for a new insurance. 
  • Still in the insurance, when your customer are close to the requirement age, or has teenagers learning how to drive, moving to college, you know that you have some opportunity to sell new contract, or adapt existing ones to the new needs
  • In retail, you may want to look to all customers and what they have ordered, and based on this be able to recommend some products to a customer that "looks" the same.
  • Another very common use case these days, you want to do some sentiment analysis of social networks to see how your brand is perceived by your community
As you can see now, we can start to think about the data we have to use and the type of processing we have to do on them.

Let's now talk about the "How"

Now that you have a better idea about what you want to do, it does not mean that you should dive into a large cluster installation.

Before that, you should continue to analyze the data:
  • What is the structure of the data that I have to analyze?
  • How big is my dataset?
  • How much data I have to ingest on a period of time (minute, hour, day, ...)
All these questions will help you to understand better your application. This is where it is often interesting too, and we realize that for most of us the "big data" is not that big!

I was working the other day with a Telco company in Belgium, and I was talking about possible new project. I simply said:
  • Belgium is what, 11+ millions of people
  • If you store a 50kb object for each person this represent:
  • Your full dataset will be 524Gb, yes not even a Terabyte!
Do you need a large Hadoop cluster to store and process this? You can use it, but you do not need to! You can find something smaller, and easier to start with.

Any database will do the job, starting with MongoDB. I think it is really interesting to start this project with a MongoDB cluster, not only because it will allow you to scale out as much as you need, but also because you will leverage the flexibility of the document model. This will allow you to store any type of data, and easily adapt the structure to the new data, or requirements.

Storing the data is only one part of the equation. The other part is how you achieve the data processing. Lately I am playing a lot with Apache Spark. Spark provides a very powerful engine for large scale data processing, and it is a lot simpler than Map Reduce jobs. In addition to this, you can run Spark without Hadoop! This means you can connect you Spark to your MongoDB, with the MongoDB Hadoop Connector and other data sources and directly execute job on your main database.

What I like also about this approach, you can when you dataset starts to grow, and it become harder to process all the data on your operational database, you can easily add Hadoop; and keep most of your data processing layer intact, and only change the data sources information. In this case you will connect MongoDB and Hadoop to get/push the data into HDFS, once again using the MongoDB Hadoop Connector.

Conclusion

Too many times, projects are driven by technology instead of focusing on the business value. This is particularly true around big data projects. So be sure you start by understanding the business problem, and find the data that could help to solve it.

Once you have the business problem and the data, select the good technology, that could be very simple, simple files and python scripts, or more often a database like MongoDB with a data processing layer like Spark. And start to move to Hadoop when it is really mandatory... a very, very, very, large dataset. 





Thursday, August 21, 2014

Introduction to MongoDB Geospatial feature


This post is a quick and simple introduction to Geospatial feature of MongoDB 2.6 using simple dataset and queries.


Storing Geospatial Informations

As you know you can store any type of data, but if you want to query them you need to use some coordinates, and create index on them. MongoDB supports three types of indexes for GeoSpatial queries:
  • 2d Index : uses simple coordinate (longitude, latitude). As stated in the documentation: The 2d index is intended for legacy coordinate pairs used in MongoDB 2.2 and earlier. For this reason, I won't detail anything about this in this post. Just for the record 2d Index are used to query data stored as points on a two-dimensional plane
     
  • 2d Sphere Index : support queries of any geometries on an-earth-like sphere, the data can be stored as GeoJSON and legacy coordinate pairs (longitude, latitude). For the rest of the article I will use this type of index and focusing on GeoJSON.
     
  • Geo Haystack : that are used to query on very small area. It is today less used by applications and I will not describe it in this post.
So this article will focus now on the 2d Sphere index with GeoJSON format to store and query documents.

So what is GeoJSON?

You can look at the http://geojson.org/ site, let's do a very short explanation. GeoJSON is a format for encoding, in JSON, a variety of geographic data structures, and support the following types:  Point , LineString , Polygon , MultiPoint , MultiLineString , MultiPolygon and Geometry.

The GeoJSON format  is quite straightforward based, for the simple geometries, on two attributes: type and coordinates. Let's take some examples:

The city where I spend all my childhood, Pleneuf Val-André, France, has the following coordinates (from Wikipedia)
 48° 35′ 30.12″ N, 2° 32′ 48.84″ W
This notation is a point, based on a latitude & longitude using the WGS 84 (Degrees, Minutes, Seconds) system. Not very easy to use by application/code, this is why it is also possible to represent the exact same point using the following values for latitude & longitude:
48.5917, -2.5469
This one uses the WGS 84 (Decimal Degrees) system. This is the coordinates you see use in most of the application/API you are using as developer (eg: Google Maps/Earth for example)

By default GeoJSON, and MongoDB use these values but the coordinates must be stored in the longitude, latitude order, so this point in GeoJSON will look like:

{
  "type": "Point",
  "coordinates": [
    -2.5469,  
    48.5917 
  ]
}


This is a simple "Point", let's now for example look at a line, a very nice walk on the beach :

{
  "type": "LineString",
  "coordinates": [
    [-2.551082,48.5955632],
    [-2.551229,48.594312],
    [-2.551550,48.593312],
    [-2.552400,48.592312],
    [-2.553677, 48.590898]
  ]
}


So using the same approach you will be able to create MultiPoint, MultiLineString, Polygon, MultiPolygon. It is also possible to mix all these in a single document using a GeometryCollection. The following example is a Geometry Collection of MultiLineString and Polygon over Central Park:

{
  "type" : "GeometryCollection",
  "geometries" : [
    {
      "type" : "Polygon",
      "coordinates" : [
         [
	  [ -73.9580, 40.8003 ],
          [ -73.9498, 40.7968 ],
	  [ -73.9737, 40.7648 ],
	  [ -73.9814, 40.7681 ],
	  [ -73.9580, 40.8003  ]
	 ]
       ]
    },
    {
      "type" : "MultiLineString",
      "coordinates" : [
         [ [ -73.96943, 40.78519 ], [ -73.96082, 40.78095 ] ],
 	 [ [ -73.96415, 40.79229 ], [ -73.95544, 40.78854 ] ],
         [ [ -73.97162, 40.78205 ], [ -73.96374, 40.77715 ] ],
         [ [ -73.97880, 40.77247 ], [ -73.97036, 40.76811 ] ]
       ]
     }
  ]
}

Note: You can if you want test/visualize these JSON documents using the http://geojsonlint.com/ service. 


Now what? Let's store data!

Once you have a GeoJSON document you just need to store it into your document. For example if you want to store a document about JFK Airport with its location you can run the following command:

db.airports.insert(
  {
    "name" : "John F Kennedy Intl",
    "type" : "International",
    "code" : "JFK",
    "loc" : {
      "type" : "Point",
      "coordinates" : [ -73.778889, 40.639722 ]
    }
}

Yes this is that simple! You just save the GeoJSON as one of the attribute of the document, (loc in this example)

Querying Geospatial Informations

Now that we have the data stored in MongoDB, it is now possible to use the geospatial information to do some interesting queries. 

For this we need a sample dataset. I have created one using some open data found in various places. This dataset contains the following informations:
  • airports collection with the list of US airport (Point)
  • states collection with the list of US states (MultiPolygon)
I have created this dataset from various OpenData sources ( http://geocommons.com/http://catalog.data.gov/dataset ) and use toGeoJSON to convert them into the proper format.

Let's install the dataset:
  1. Download it from here
  2. Unzip geo.zip file
  3. Restore the data into your mongoDB instance, using the following command
    mongorestore geo.zip
MongoDB allows applications to do the following types of query on geospatial data:

  • inclusion
  • intersection
  • proximity
Obviously, you will be able to use all the other operator in addition to the geospatial ones. Let's now look at some concrete examples. 

Inclusion

Find all the airports in California. For this you need to get the California location (Polygon) and use the command $geoWithin in the query. From the shell it will look like :

use geo

var cal = db.states.findOne(  {code : "CA"}  );

db.airports.find( 
  { 
    loc : { $geoWithin : { $geometry : cal.loc } } 
  },
  { name : 1 , type : 1, code : 1, _id: 0 } 
);

Result:

{ "name" : "Modesto City - County", "type" : "", "code" : "MOD" }
...
{ "name" : "San Francisco Intl", "type" : "International", "code" : "SFO" }
{ "name" : "San Jose International", "type" : "International", "code" : "SJC" }
...

So the query is using the "California MultiPolygon" and looks in the airports collection to find all the airports that are in these polygons. This looks like the following image on a map:



You can use any other query features or criteria, for example you can limit the query to international airport only sorted by name :

db.airports.find( 
  { 
    loc : { $geoWithin : { $geometry : cal.loc } },
    type : "International" 
  },
  { name : 1 , type : 1, code : 1, _id: 0 } 
).sort({ name : 1 });

Result:

{ "name" : "Los Angeles Intl", "type" : "International", "code" : "LAX" }
{ "name" : "Metropolitan Oakland Intl", "type" : "International", "code" : "OAK" }
{ "name" : "Ontario Intl", "type" : "International", "code" : "ONT" }
{ "name" : "San Diego Intl", "type" : "International", "code" : "SAN" }
{ "name" : "San Francisco Intl", "type" : "International", "code" : "SFO" }
{ "name" : "San Jose International", "type" : "International", "code" : "SJC" }
{ "name" : "Southern California International", "type" : "International", "code" : "VCV" }


I do not know if you have looked in detail, but we are querying these documents with no index. You can run a query with the explain()to see what's going on. The $geoWithin operator does not need index but your queries will be more efficient with one so let's create the index:

db.airports.ensureIndex( { "loc" : "2dsphere" } );

Run the explain and you will se the difference.


Intersection

    Suppose you want to know what are all the adjacent states to California, for this we just need to search for all the states that have coordinates that "intersects" with California. This is done with the following query:

    var cal = db.states.findOne(  {code : "CA"}  );
    
    
    
    db.states.find(
        loc : { $geoIntersects : { $geometry : cal.loc  }  } ,
        code : { $ne : "CA"  }  
      }, 
      { name : 1, code : 1 , _id : 0 } 
    );


    Result:

    { "name" : "Oregon", "code" : "OR" }
    { "name" : "Nevada", "code" : "NV" }
    { "name" : "Arizona", "code" : "AZ" }
    
    
    
    
    Same as before $geoIntersect operator does not need an index to work, but it will be more efficient with the following index:

    db.states.ensureIndex( { loc : "2dsphere" } );

    Proximity

    The last feature that I want to highlight in this post is related to query with proximity criteria. Let's find all the international airports that are located at less than 20km from the reservoir in NYC Central Park. For this you will be using the $near operator.

    db.airports.find(
    {
    loc : {
    $near : {
    $geometry : { 
    type : "Point" , 
    coordinates : [-73.965355,40.782865]  
    }, 
    $maxDistance : 20000
    }
    }, 
    type : "International"
    },
    {
    name : 1,
    code : 1,
    _id : 0
    }
    );


    Results:

    { "name" : "La Guardia", "code" : "LGA" } { "name" : "Newark Intl", "code" : "EWR"


    So this query returns 2 airports, the closest being La Guardia, since the $near operator sorts the results by distance. Also it is important to raise here that the $near operator requires an index.

    Conclusion

    In this first post about geospatial feature you have learned:
    • the basic of GeoJSON
    • how to query documents with inclusion, intersection and proximity criteria.
    You can now play more with this for example integrate this into an application that expose data into some UI, or see how you can use the geospatial operators into a aggregation pipeline.