The cloud and big data
When the cloud came into
being, it brought with it immense storage power at cheaper rates. It ushered in
the era of Big data. As a result, it also raised expectation levels in the minds
of statisticians - and decision makers who depended on them - that this would do
wonders to their decision-making processes.
Boon or bane?
The sample space had drastically
increased due to social media and IoT, leading to more data being made available
now. Applying statistical models to this huge data would improve the probability
of a predicted event occurring (or not occurring) or improve the reliability of
the forecast by pushing the R squared value to near unity. Right? Wrong. The data
deluge only added to more noise than dependable signals.
Illusion or disillusion?
As time went by, people became
disillusioned by the failure of the system to aid them with reliable
information in decision-making. So as their hyper-expectations were not met,
they just drop off quickly without pursuing further this journey.
The signal and the noise
It was now the turn of the
experts to come with their reasons as to why such huge data could not help them
decide better. One significant reason is that while there is enough data - and
more β for the model, it requires a great deal of cleaning β removing the noise
in the data that could distort the results and predictions before this data can
be put to any use at all.
Persistence pays!
Early adapters of technology gained over the long run. Microsoft and
Amazon are examples of winners who persisted in their vision to make this big data
the fuel to their decision-making engine. They soon gathered themselves up from
the trough of disillusionment to the slope of enlightenment by applying scientific
methods to the data gathered and adopting newer methods to remove noise and
false signals from the data. This way, they could arrive at real signals that
aided in building reliable data models. They have climbed to the plateau of
productivity now with their data models helping them in better decision making
based on information.
Here are a few points to ponder:
People expect a lot from technology today, but the problem is while we have a lot of data, there are not enough people who possess skills to make this big data useful and not enough training and skill building efforts being put to make data scientists out of this huge population of technology experts.
Cleaning up data is the first big problem in predictive analysis β there are many external factors that might tend to distort the data that has been collected.
If we are considering a correlation between two variables and donβt know what causes this correlation, it is better not to consider this correlation at all. (Star fish predicting the FIFA world-cup winner or a baseball teamβs win or lose determining the movement of the share-market).
Seeking for signals desperately, people end up with more noise than signals β so they make decisions with their instinct / gut feeling / experience playing an 80% part and statistics playing the last 20%. Instead, we should be guided by statistics 80% and leave the rest to our instincts and that too only if there is a drastically negative indicator in the statistical model.
Here are some suggestions to reduce the noise
and arrive at signals:
Start with a hypothesis / instinct and then keep refining it as you go
ahead with analysis β this might sometimes lead to reverse your hypothesis.
Think probabilistically
When predicting, consider
the margin of error (uncertainty) of the historic data and then include that in
the prediction to make a decision. The person that discloses the greatest
uncertainty is doing a better job than the one who conceals this uncertainty to
his prediction. Three things to carry with while predicting: Data models,
Scientific theories that influence the situation and experience (learn from the
number of forecasts made and the feedback about the forecast)
Know where you come from
Consider the background and
the existing biases of the possible forecaster / decision maker and the
situation the data is being collected / considered
Try and err
Companies need to focus on
the 80% effort for the last 20% results to retain the competitive advantage β
real statistics of a few customers would be better than hypothetical data of a
huge number of customers.
Notes:
Large and smart companies especially Technology firms should dare to take risks in the competitive advantage area. Most of the risk-taking will pay off. As they are big, they can bear failures unlike small firms and individuals in which case this might be termed as gambling.
People make better inferences from visuals than just data presented as raw data. Charts must show simple essential info. Unless required to bring greater clarity, we must avoid showing more information that crowd together on the charts to create more noise.
People must become Bias detectors β raise business questions and be apprehensive about magic bullet solutions.
Analysts should disclose the limitations to their analyses.
- Insights from a session by Nate Silver