Did you know that 90% of all the data in the digital universe is unstructured? All the online data generated from social media, mobiles, smart metering, etc. is actually called Big Data. Wouldn't it be nice if we could make a profitable use of all that data? That's were predictive analysis comes in. Let's dig a little bit further.
What Is Big Data?
Every day we collectively produce 2.5 quintillion bytes of data through social media activity, mobile phones, digital content, emails, sensors, etc. Over the last two years we've created 90% of the data in the world today. This is what Big Data is.
From an engineering perspective Big Data is a process whereby large quantities of data are structured, analyzed and modelized in order to make predictions e.g. build predictive models.
3Vs e.g. 3D Growth Challenges
- Volume. Ever growing amount of data of all types.
- Velocity. Inbound and outbound data speed.
- Variety. Different types of data.
Tools & Technologies
- Fast Data Capture & Analyisis. Real-time data processing e.g. the data is processed as soon as it is generated.
- Automated & Smart Decision Making. Algorithms that make decisions based on specific scenarii. Genetic Algorithms can also be involved for unpredictable ones.
- Distributed Storage. Increased storage capacity without having to store everything.
Hadoop is an open source software framework for distributing computing problems across a number of servers which implements the MapReduce approach pioneered by Google in compiling its search indexes. Hadoop's MapReduce involves two stages.
- Map Stage. Distributing a dataset among multiple servers and operating on the data.
- Reduce Stage. The partial results are recombined.
Each server offers local computation and storage. As such Hadoop provides a programming model which allows software engineers to easily and effectively process large datasets to get the results faster using reliable and scalable architecture.
Hadoop stores data utilizing its own distributed file system – Hadoop Distributed File System (HDFS) – which makes data available across multiple computing nodes.
Hadoop's typical usage pattern – which Facebook model follows – involves a three phases batch operation process
- loading data into HDFS
- MapReduce operations
- retrieving results from HDFS
An example of social media model following the above pattern consists in storing the core data in a database which is reflected in Hadoop where computations occur to create recommendations based on interests for instance and transfer the results back into the database for use in pages served to users.
Cloud Or In-House Data Center?
Hybrid Solution. Using on-demand cloud resources to supplement in-house deployments is the choice many organizations opt for.
Predictive Analysis & Opportunities
The availability of massive amount of data on current customers or propspects on e-commerce sites gives online retailers huge value provided they can efficiently gather and analyse this information in-time. Combined with social-media integration online consumers browsing and purchasing habits can allow content to be effectively targeted which requires to make decisions in real-time.
If the collected data generated from hundreds of thousands of customers can be categorised in part – gender, location, etc. – the majority remains unstructured. The key is to find the best representation(s) of the collected data in order to process 100% of the information and make predictions to deliver solutions tailored to customer profiles. Such predictions could for instance take the shape of special offers for services complementary to products it's been established specific customer types are buying in the near future.
Data Science is an emerging discipline to “make” big data talk e.g. collect data and use it successfully. To the best of our understanding this is a statistical job consisting in collecting the data, sampling it, determining what the best representations of the data are and eventually come up with a predictive model. Finding ways to automate such process to re-inject the findings in a real-time environment for instance – could be part of that too which would require good numerical analysis, algorithmic and coding skills as well.
Here's O'Reilly's Director of Research take on Data Science in a short video.
Here we go then. If you want to make business out of data you need data science. But here's the thing. How much data would you say a business needs in order to exploit the big data opportunity? Where would that data be coming from? Website traffic? Tweets, Facebook pages, LinkedIn Groups, etc.? It quickly adds up isn't it? Therefore the question is: are you using data science?