Today I’ve started my internship at Decisionstats. I was briefed by Ajay Ohri, founder, that this internship is going to be all about learning and I’ll be writing about things which are useful for others as well. During this two months internship I’ll be assisting him in solving problems here “https://github.com/decisionstats/pythonfordatascience” and developing open source tutorials for them. He explained that the purpose of doing this is that everyone can learn from it and will also help in elevating my data science brand value.
Continue reading “First day Internship at Decisionstats”
MapReduce is a high level programming model brought by Google for large scale parallel data processing came in 2004 and its open source implementation came in 2008 named as Apache Hadoop led by researchers at Yahoo. This led to a big bang of various large scale parallel processing enterprises which started coming from various database communities that provide more features(Pig, HBase, Hive).
Some notable and permanent contributions provided by MapReduce were:-
- Fault Tolerance- When you are working on 1000 computers at a time the probability of one of them failing is extremely high. So fault tolerance during query processing so that you won’t lose the work was something MapReduce and Hadoop paper really emphasized.
- Schema on read- Relational Databases are implemented on a fixed Schema ie a Structure is made first and then you need to fix your data to that schema. But almost all times data comes from various sources so they are not provided with the Schema already. So map reduce allows you to load the data and allows you to work over it.
- User defined functions- The experience of writing, managing and maintain code in a normal database with SQL queries was not good. The reason being that a lot of people put their logic inside the application layer as opposed to database layer. MapReduce allows you to define your functions in the application layer.
Continue reading “MapReduce explained”
The other day I was reading an answer of an interviewer on Quora, “What-is-a-typical-data-scientist-interview-like“, he wrote: ” What is P-Value ? – I expect candidates to know to explain to me what a P-Value is and what P-Value means (even at 4am…)”. This pretty much justifies the importance of understanding P-value.
There are so many definitions already provided on web but still I always have difficulty in understanding its significance. I believe many others with non-statistical background would empathize to this. So let me give a bit intuitive understanding of p-value.
Continue reading “P-value explained”
Right now there’s a big hype about Machine learning and Big Data all around in the tech world. This is not surprising as they have played a significant role in Automation, Business advancements and predictions. But along with them Deep Learning is also now becoming a popular term in recent times. One interesting fact about deep learning is that it was abandoned in late 1980s, but later in 2007 Geoffrey Hinton brought an algorithm which all over again has invoked research in it.
Continue reading “Machine Learning’s Evolution to Deep Learning”