Almost every regular internet user has now become accustomed to personalized recommendations. Everyone is familiar with Recommender Systems on ecommerce websites like Amazon, Flipkart but there are also some more sophisticated systems in space. Netflix suggests videos to watch. TiVo records programs on its own, just in case we’re interested. Pandora builds personalized music streams by predicting what song user is interested in listening. All these enterprises use Recommender Systems to enhance customer experience whenever a user uses their service.
DynamoDB offered by Amazon pioneered the idea of Eventual Consistency as a way to achieve higher availability and scalability.
Big dataset is broken in chunks and these chunks are then sent to different machines. Some replicas of these chunks also sent to these machines to address fault-tolerance. So the two requirements which we need to deal with here are:-
Box and Whisker plot or commonly known as Box Plot, is a very useful and one of the most informative data visualization used to show the distribution at a glance. So it is a good visualization for data scientists to use it in their analysis.
Let us understand how to read a Box Plot. Suppose you are conducting a survey for a product X, that how many times users have used it in the past week. And then you plot the collected data using Box Plot.
Suppose there is a farmer who wants to plow his farm. He has only a limited sum of money to buy a livestock using which he can get his job done. Either he can buy 1024 chickens or 2 strong Oxen. A smart choice would be 1024 Chickens. The essential idea is Parallelism. We can solve large problems by breaking them into smaller pieces and then we can run these pieces at the same time.
In Data Science, a skill which distinguishes you as a Data Scientist from statisticians, programmers or database engineers is how well you are able to communicate your results and findings. Most of the time people whom you have to explain your results don’t have knowledge about Analytics. So delivering mere numbers is not a good idea. Apart from that, understanding the data is imperative for data scientists to do Analytics over it. Hence, Data Visualization is not just essential for communicating results but also to get insights from data.
19th century statistics was Bayesian while the 20th century was Frequentist, at least from the point of view of most scientific practitioners. The Bayesian-Frequentist debate reflects two different attitudes to the process of doing modeling, both looks quite legitimate.
In simple terms Bayesian statisticians are individual researchers, or a research group, trying to use all information they have to make quickest possible progress. While Frequentist statisticians draw conclusions from sample data by the emphasis on the frequency or proportion of the data only. They do not have any prior knowledge about the data. Hence, in Bayesian we have some prior knowledge while in Frequentist we don’t. You can find a more intuitive example about the difference between the two in layman terms –here.
Neural Networks are getting so popular due to their ability to create any function by feature learning when enough data is provided. Features are the information you are giving to the network, greater the feature size greater is the information you provide. They are primarily used to solve classification problems but research is still being done to make them work for regression problems as well.
Introduction to R course is a good course to get started with R. It gives a good practical introduction over the usage of data provide in form of Vectors, Matrices, Factors, Data Frames and Lists. All topics mentioned have a set of exercises with an interesting story associated with them. Things you’ll be doing throughout this 4 hour course are:-
MapReduce is a high level programming model brought by Google for large scale parallel data processing came in 2004 and its open source implementation came in 2008 named as Apache Hadoop led by researchers at Yahoo. This led to a big bang of various large scale parallel processing enterprises which started coming from various database communities that provide more features(Pig, HBase, Hive).
Some notable and permanent contributions provided by MapReduce were:-
- Fault Tolerance- When you are working on 1000 computers at a time the probability of one of them failing is extremely high. So fault tolerance during query processing so that you won’t lose the work was something MapReduce and Hadoop paper really emphasized.
- Schema on read- Relational Databases are implemented on a fixed Schema ie a Structure is made first and then you need to fix your data to that schema. But almost all times data comes from various sources so they are not provided with the Schema already. So map reduce allows you to load the data and allows you to work over it.
- User defined functions- The experience of writing, managing and maintain code in a normal database with SQL queries was not good. The reason being that a lot of people put their logic inside the application layer as opposed to database layer. MapReduce allows you to define your functions in the application layer.
R is a great tool for quick visualization and playing around with the data. Here I’ll use Boston data over which I’ll do some visualization using ggplot2 library and then will use one machine learning algrorithm.
library(ggplot2) library(MASS) data(Boston)
You can find the detailed description in its documentation using:-