BigData – it seems to be a technology area that is in vogue. It encompasses technologies such as Hadoop, Hive, Pig and Sqoop – yes they all seem to have funny names.
Every quarter, in the Engineering dept. at DNN Corp, we have quarterly reviews where we review with our manager the previous quarter’s objectives and establish objectives for the upcoming quarter. At the beginning of this year one of my objectives was to become more familiar with these BigData technologies and determine which if any we need to be aware of.
Last year I blogged a few times on NoSQL Databases in a series subtitled “Look Mom NoSQL”. NoSQL Databases are a part of the BigData story – but they are not the whole story. And in many ways NoSQL databases are important regardless of BigData.
As part of my research on BigData technologies I will be continuing my blogging on NoSQL Databases, but in this blog I want to step back and look at BigData as a whole and why BigData might be important for us at DNN.
What is BigData
So lets start by establishing what we mean by BigData. Well that’s a challenge in itself as many people have different definitions. But for the most part BigData means working with very large amounts of data. As Web Developers when we think of BigData we think about sites like Google, Facebook and Twitter. If you have over 500 million users (like Facebook claims) and all those users are posting frequent status updates and uploading pictures from their smart phones that’s a huge amount of data and Facebook has developed technologies to turn all that data into a highly personalized ad based revenue engine.
Its no surprise that most of the BigData technologies have been developed by high profile large-scale web properties such as Google, Facebook, Yahoo and Amazon.
But BigData is more than status updates, photo albums, tweets and search indexes.
In the not too distant past data storage was costly, so once an Enterprise was finished with its data it would either delete the data or if required for regulatory compliance would back up the data onto a cheaper data storage medium like tape and place it into storage. But now in the age of $100 / TB Hard drives data storage is a commodity. When we need more storage we can just add more – we no longer need to manage our data storage media.
Enterprises are finding they have a lot of accumulated data, which they are now realizing can reveal hidden insights into their businesses.
The 3 V’s
To clarify the meaning of the catch-all term BigData the three V’s of Volume, Velocity and Variety are often used to characterize the three main aspects.
There is a huge benefit to be gained by being able to analyze larger amounts of data. Having more data beats out developing better models. Let us assume that I have a marketing model that analyses 5 factors from a small sub-sample of my customer data. I am going to have a significant error in my predictions.
But what if I could use all the data for a year and analyze 200 factors. The accuracy of my predictions would be much better. Lets assume I can improve my accuracy from 5% to 1%. This may not seem like a big deal but it may end up in generating millions in extra revenue.
Data is also being gathered at an increased velocity. One extreme example is the Large Hadron Collider which has about 150 million sensors delivering data at 40 million data points per second. There are about 600 million collisions a second of which only about 100 are of importance. While a complete analysis of the Large Hadron Collider data would take weeks or months – scientists approached the problem by throwing out 99.99% of the data and hoping they still had a representative enough sample.
In many scenarios its important that the “collect, analyze, implement” feedback loop is as tight as possible. The tighter the feedback loop the greater the competitive advantage.
The third V stands for Variety. Often in Big Data the data is diverse. Enterprises have different sets of data which on their own would not provide much insight but when combined with all the other data sets can provide that insight. These data sets have been collected in many different ways. They may be Web Server logs, e-Commerce orders from a relational database, images uploaded to the file system, so we need to be able to combine these diverse sets of data to gain the insight we need.
BigData and DNN
So why do we need to think about BigData with DNN. As we move into the Social arena there is a lot of potential to provide good analytics, as is demonstrated by the Community Dashboard Manager in our DNN Social product.
But even in a simple content based website – combining webserver logs with data from the website can provide important insights. Google Analytics is often used in this regard, but when you use Google Analytics you don’t own the data. More and more website owners want to be able to manage their own analytics.
In the next blog we will start to look at BigData technologies and what problems they are designed to solve.