Best data science github repos

By Yaniv Goldenberg

As most software developers and data scientists know, GitHub is essentially a collaborative social media type platform for developers.

Today, GitHub is the largest online storage space of collaborative works that exists in the world. Since GitHub, companies and open source projects have published their development to the world so people can expand and build on top of these great technological feats.

Needless to say, GitHub has greatly impacted tech innovation, and now has an impact on data science and machine learning innovation. A Git repo (short for repository) is a directory or storage space where your projects can live.

There, you’ll find code files, text files, image files, you name it, inside a repository.
Needless to say, some of the best open source machine learning and deep learning frameworks, libraries, and learning resources have repositories that are constantly innovating.

Exploring these machine learning GitHub repo’s is like opening the door to some of the greatest data science minds out there and digging into their work. With that said, here is a list of the best data science GitHub repo’s out there today:

Pytorch is an open source machine learning library used for computer vision and natural language processing (NLP) based on the Torch library. It is primarily developed by Facebook’s AI Research lab. PyTorch provides two high-level features: Tensor computing (like NumPy) with strong acceleration via graphics processing units (GPU) and Deep neural networks built on a tape-based autodiff system.

Tensorflow is an open source library and framework used by data scientists to design, build, and train deep learning models and large-scale machine learning. The software library is used for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. It is used for both research and production at Google.‍

SKLearn also known as Scikit-learn, is a free Python machine learning library featuring various classification, regression and clustering algorithms including support for vector machines, random forests, and gradient boosting. It is designed to interoperate with the libraries NumPy and SciPy.

NLTK is a “Natural Language Toolkit” and is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength NLP libraries. NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike.

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords high-performance interactivity over large or streaming datasets. Bokeh can help anyone who would like to quickly and easily make interactive plots, dashboards, and data applications. Bokeh can produce elegant and interactive visualization like D3.js with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack. One of Matplotlib’s most important features is its ability to play well with many operating systems and graphics backends. Matplotlib supports dozens of backends and output types, which means you can count on it to work regardless of which operating system you are using or which output format you wish.

Pandas is an open source Python package that provides numerous tools for data analysis. The package comes with several data structures that can be used for many different data manipulation tasks. It also has a variety of methods that can be invoked for data analysis, which comes in handy when working on data science and machine learning problems in Python

Caffe is an open source deep learning framework specializing in language, machine vision, and multimedia. Caffe supports many different types of deep learning architectures geared towards image classification and image segmentation. It supports CNN, RCNN, LSTM and fully connected neural network designs. Caffe supports GPU- and CPU-based acceleration computational kernel libraries such as NVIDIA cuDNN and Intel MKL.

Theano is a Python library for fast numerical computation that can be run on the CPU or GPU. It is a key foundational library for Deep Learning in Python that you can use directly to create Deep Learning models or wrapper libraries that greatly simplify the process. It is a compiler for mathematical expressions in Python. It knows how to take your structures and turn them into very efficient code that uses NumPy, efficient native libraries like BLAS and native code (C++) to run as fast as possible on CPUs or GPUs. It basically uses a host of clever code optimizations to squeeze as much performance as possible from your hardware.

SciPy is an open-source Python library which is used to solve scientific and mathematical problems. It is built on the NumPy extension and allows the user to manipulate and visualize data with a wide range of high-level commands. SciPy builds on NumPy and therefore if you import SciPy, there is no need to import NumPy.

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

XGBoost is an open-source software library which provides a gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

Keras is an open-source library written in Python used for fast experimentation with deep neural networks. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. It’s designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible.

spaCy is a free and open-source library for Natural Language Processing (NLP) in Python with a lot of in-built capabilities. It’s becoming increasingly popular for processing and analyzing data in NLP. Unstructured textual data is produced at a large scale, and it’s important to process and derive insights from unstructured data. To do that, you need to represent the data in a format that can be understood by computers.

Gensim is a Natural Language Processing package that does ‘Topic Modeling for Humans’. It is a great package for processing texts, working with word vector models (such as Word2Vec, FastText etc) and for building topic models and lets you handle large text files without having to load the entire file in memory.

Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet made by Uber. The goal of Horovod is to make distributed Deep Learning fast and easy to use via ring-allreduce and requires only a few lines of modification to user code.

Spark Deep Learning Pipelines provides high-level APIs for scalable deep learning in Python with Apache Spark. It provides easy-to-use APIs that enable deep learning in very few lines of code using Spark’s powerful distributed engine to scale out deep learning on massive datasets.

MXNet is a powerful open-source deep learning framework instrument used to define, train and deploy deep neural networks. It is lean, flexible and ultra-scalable i.e. it allows fast model-training and supports a flexible programming model and multiple languages.

Now, take these brilliant frameworks and build your own high impact machine learning and deep learning innovations.