Author: anshulkgupta93

Internship Experience of working in DecisionStats

I have completed almost 2 months of internship in DecisionStats. It has been a great learning experience. This internship helped me in improving a wide range of skills like programming, analytics, statistics etc. Every morning I was given an Assignment by Ajay Ohri, my guide during the internship, which I was supposed to finish by evening. A typical assignment involves understanding a concept and writing code.

Some links of the work I did during Internship:- Continue reading “Internship Experience of working in DecisionStats”

Performing Data analytics with Jupyter(formerly ipython)

Performing Data analytics with Jupyter(formerly ipython)

A Jupyter Notebook contains both computer code (e.g. python, mysql) and rich text elements (paragraph, equations, figures, links, etc…) which are both human-readable documents containing the analysis description and the results (figures, tables, etc..) as well as executable documents which can be run to perform data analysis. It’s really helpful if you want to communicate your code or results to others and provides a great developing environment.

To get started  install Anaconda on your machine
Continue reading “Performing Data analytics with Jupyter(formerly ipython)”

First day Internship at Decisionstats

First day Internship at Decisionstats

 

Today I’ve started my internship at Decisionstats. I was briefed by Ajay Ohri, founder, that this internship is going to be all about learning and I’ll be writing about things which are useful for others as well. During this two months internship I’ll be assisting him in solving problems here “https://github.com/decisionstats/pythonfordatascience” and developing open source tutorials for them. He explained that the purpose of doing this is that everyone can learn from it and will also help in elevating my data science brand value.

Continue reading “First day Internship at Decisionstats”

Decision Tree Explained

Decision Tree Explained

Decision trees are a common technique used in data mining to predict a target value based on several input data. Prediction of output value involves testing of input sample on certain rules. Each terminal node of the tree represents the output to which sample it belongs. To figure out the output, we start at the root node of the tree, and ask a sequence of questions about the features. The interior nodes are labeled with questions, and the edges or branches between them labeled by the answers and based on the attributes you eventually end in a particular leaf.

Continue reading “Decision Tree Explained”

Matrix Multiplication with MapReduce

Big Data possibly now has become the most used term in the tech world for this decade. Everybody is talking about it and everybody have their own understanding towards it which has made its definition quite ambiguous. Let us deconstruct this term with a situation. Suppose you are creating a database for movie ratings where rows indicate user IDs, columns indicate movies and the values of the cells indicates rating(0-5) given by user to the corresponding movie. Now this data is likely to be sparse as you can’t have a situation where all users have rated all movies. In real world situation you can conceive the sparsity of this database and the cost it takes to store this huge database/matrix.

Continue reading “Matrix Multiplication with MapReduce”

Randomly picking equal number of samples for each label in Matlab

No_of_samples are samples which will remain in your data set for each label after you execute the following code:-

classes = unique(labels);
for i=1:numel(classes)
      cur_class_ind = find(labels==classes(i));
      ind_to_remove = cur_class_ind(randperm(numel(cur_class_ind)));
      ind_to_remove = ind_to_remove(1:(numel(cur_class_ind) - no_of_samples));
      labels(ind_to_remove,:) = [];
      data(ind_to_remove,:) = [];
end

Here ‘data’ is your input dataset with m x n dimension(m=number of samples which we are trying to crop and n =number of features) and ‘labels’ is your vector containing output classes for every corresponding input sample.

Loading files iteratively for processing in Matlab

Loading files iteratively for processing in Matlab

If you are a researcher and works in Machine Learning then your work certainly would involve data processing on Matlab. Feature engineering involves extracting features from large number of files(usually csv) and these files need to be parsed so they can be loaded iteratively and processed.

12166223_10207341791961524_1589943820_n

Continue reading “Loading files iteratively for processing in Matlab”

Tips and Tricks for training Neural Network in Theano

Theano is a popular Python’s meta programming framework used for Deep Learning on top of either CPU or GPU. Purpose of this blog is to suggest some tips which you can incorporate if you are getting trouble while performing Deep Learning on your problem.

  • Constant Validation Error– If you have just started with Theano and are applying logistic regression model to your problem (MNIST’s Digit recognition is not considered as problem here), then you are likely to get constant validation error while training. If that happens you need to fix your learning rate by determining the optimal one. Start with 0.1 and keep reducing it by a factor of 10 after every epoch until you see a fall in validation error and then use that learning rate for training. Tip- Whenever you initiate training always start with a smaller dataset, say 500-1000 samples, and try to overfit your model. Give same Dataset to training, validation and test. You should get a 100% test error. Your network should have more number of nodes compared to your input so that it can fit. If this is not happening certainly there’s some bug in your implementation.
  • Gaussian Initialization– By default Theano developers have set Initialization of weights to random uniform distribution. Change it to Gaussian(normal) Distribution, you are then likely to get improved results.

Continue reading “Tips and Tricks for training Neural Network in Theano”

Installing Theano and integrating it with GPU on Ubuntu.

If you have a working experience on theano, you probably wouldn’t have forgot that such a pain in the ass task it was. So I felt it really worth to blog about it for people aspiring to get in Deep Learning. Installation instructions given on the official website are capable enough to break down the morale of any newbie who wants to get started.  Follow the instructions
mentioned below to setup theano on Ubuntu.

Continue reading “Installing Theano and integrating it with GPU on Ubuntu.”