I have been using Git for quite a while but never really understood what it does under the hood. The GitHub Universe 2015 videos that were made available on YouTube this February reminded me of that fact, so I spent a couple of hours trying to understand how it does all that magic.
With a Version Control System (VCS), you can create snapshots of files at different times and jump back and forth between them. Git achieves this more or less by reading and writing to files in a directory called .git/. This directory is created for each project when running git init or git clone.
You can run Git like any other Shell program.
git <command>
Each version of a file is stored as a blob (Binary Large Object) in the local filesystem and each snapshot of a file or files is stored in a what’s called a commit. A commit holds references to blobs included in the snaphost (and other information such as the author, date, commit message and a reference to the parent commit(s)). Again, you can see how these files are managed by looking at the contents of the files in .git/, they often contain human readable plain text.
This is the initial content of the .git/ directory after running the init command.
.git
├── HEAD // Contains the path to the current branch
├── branches
├── config // Contains repository configuration, e.g. the remote origin URL
├── description
├── hooks
│ ├── applypatch-msg.sample
│ ├── commit-msg.sample
│ ├── ...
│ └── ...
├── info
│ └── exclude
├── objects // Contains all object files such as blobs and commits
│ ├── info
│ └── pack
└── refs // Contains the master copy of all refs, e.g. the branches and stashes
├── heads
└── tags
If you decide to share the files with other people, you could of course use Dropbox, but, GitHub is probably a smarter idea. The Git project for instance is hosted on GitHub. Note that Git and GitHub are two completely different things. Git is a software and GitHub is a hosting service.
Recommended Reads
If you are interested, this book, Pro Git is probably all you need. I did not read every chapter but maybe half of them to feel more confident in working wth Git and dealing with its what used to be horrifying terror.
A more datailed explanation of the contents of the Git project directory can be found here, including the files that aren’t initially created. The article also addresses the deprecation of .git/branches.
Slides
I held a small talk introducing the basic concepts of Git last month after reading Pro Git at my working place. It was really about the basics, similar to what I’ve written previously in this post but covers the following topics.
Version Control (VC)
Git
Woriking directory / Index / Repository
Commits and what they consist of
Branches and other pointers to commits
Files, blobs and references
Other
The motivation behind the development of Git was that the Linux Kernel team couldn’t continue to use BitKeeper, another Version Control System (VCS) due to some trouble with the owners (Wikipedia: They revere engineered the BitKeeper protocols). They simply descided to create their own.
As a part of a network class in my university, I had the opportunety to hold a presentation about a survey of the paper Large Scale Distributed Deep Learning by Jeff Dean et al., researchers at Google.
I will in this post try to summarize the learnings from the survey. The presentation slides are included below.
Motivation
Limitations of the current GPU approach
Training neural networks using GPUs has significantly improved the training speed. There are however limitations to this approach. The biggest concern is the CPU-GPU data transfer overhead that slows down the whole training process. What you usually do to get around this issue is to design the model allowing it to fit in the GPU memory so that it can carry out the training without the need of transferring data. Obviously, this limits the capacity of the network.
What can we do instead?
We could develop efficient algorithms for designing compact but still accurate networks, or design hardware that is dedicated specifically for training neural networks such as the Big Sur from Facebook.
The surveyed paper proposes a new approach based on distributed computing.
The Paper
The paper Large Scale Distributed Deep Learning was published in NIPS in 2012 and introduces a framework called DistBelief for training large neural networks with hundreds of millions of parameters in a distributed manner.
DistBelief
I will not get into the details of the DistBelief framework since you can read about it in the paper, but please continue reading for a short summary.
The framework consists of two parts
1. Parallelization
Dividing the units in the network into multiple subsets, both horizontally and vertically. Assigning a machine to each subset of units.
2. Model Replication
Asynchronousness is guaranteed by creating multiple instances of the parallelized model and allowing them to run independently. They communicate with a single server that stores the parameters of the network. Two alternatives are introduced to achieve this asynchronousness
Downpour SGDOnline Asynchronous Stochastic Gradient Descent - The training data is split into smaller data shards and a model replica is created for each shard. All model replicas process the data independently and communicate with a sharded central parameter server.
Sandblaster L-BFGSBatch Distributed Parameter Storage and Manipulation - Gets rid of the central parameter server by having a distributed system that is coordinated by a separate process. For each batch, the the tasks are split up into smaller sub tasks and are dynamically assigned by the coordinator to an appropriate model replica to balance the work load and maintain a high hardware utilization.
Thoughts
The framework presented in the paper improves the training time for deep neural networks and allows the training to scale with the number of parameter. It manages to do this despite the stochasticity that it introduces because of the asynchronousness in the weight updates. The drawback of this approach is that it limits the unit connectivity since too many cross machine connections will add an significant overhead to the training, leaving room for further improvements.
As stated by Jeff Dean in the GitHub issue discussion board for TensorFlow, they are highly prioritizing the task of open sourcing the distributed version of it meaning that we might see light being shed on this topic in the near future.
You’re somewhat familiar with navigating the GitHub homepage and reading Stack Overflow.
Stack Overflow, But Then?
Stack Overflow helps you tackle all kinds of programming tasks, but doesn’t necessarily solve them for you. You might need to implement that poorly documented library only to find out that there are just too many issues that are hard to debug. Answers can’t be found in any forum and you can’t afford to create a new thread waiting for it neither. The next stop could be GitHub. More precisely its search bar.
GitHub Search Bar
GitHub has a powerful search tool that let’s one find repositories, pull requests, users, issues and even source code lines. All this before the bats get out of hell [1].
Let’s talk about this search bar. Learning how to properly use it might help you implement that poorly documented library. What’s really neat is that it allows you to search directly in the source code of all the repositories to see how others have written their code in a similar situation in which you are stuck in.
How To Use It
Simple Queries
Simply write a code snippet or any free text in the search bar and press Enter. A multi word query needs to be wrapped with "s to force an exact word sequence match, e.g. "if let" to find binding operations. Write "import tensorflow", followed by Enter and then select Code as the category and Python as the language in the results page to search for files using TensorFlow for instance, the machine learning framework from Google.
It is as previously mentioned possible to search for repositories given names, pull requests, users and issues as well. Let’s look at more examples.
Advanced Queries
More sophisticated queries can be composed using a query syntax.
stars:>1000 - Queries for all repositories with more than 1000 stars.
For more advanced queries, search for an empty string to get the to main search page of GitHub. You can then create the queries using their advanced search page or view a more comprehensive list of examples. There is also a help section explaining the syntax more in detail.
Notes
It is worth mentioning that a single query cannot be longer than 256 characters and that it cannot consist of more than 5 logical AND, OR or NOT operators. Also, queries are by default incasesensitive.
How It Works
There are on average 300 search request per minute to the GitHub servers and the handling of all these are backed by the cloud based RESTful search engine Elasticsearch. They manage the 4 million users and the over 8 million repositories comprising over 2 billion documents that need to be indexed for performance, i.e. short query-request to response time. As soon as a repository is being pushed to, the new data is immediately available for search and thus needs to be indexed. This is done by the 120 GB large 128 shards. For more details on how this is carried out, please refer to Elasticsearch’s GitHub Page or this interview with the GitHub developers discussing the scaling of the data and its performance of the system (there is also a link to the recording).
Conclusions
While going though trending repositories on GitHub might be a good read, one can use the search tool to find specific pieces of code, querying all of the 2 billion documents that is available on GitHub; it can be used to study libraries or could help you find answers to problems that aren’t addressed on Stack Overflow (or any other forum). Adding the GitHub search to your box of tools might make you a better coder. Let’s do it!
There are many tutorials out there but most of them only uses the MNIST dataset available at the homepage of Yann LeCun. Let’s try to be different, and down the rabbit-hole we go.
1-channel grayscale, MNIST
MNIST is a dataset based on National Institute of Standards and Technology, NIST containing a set of handwritten single-digits digits, 60000 for training and another 10000 for testing (testing the trained network model).
The MNIST dataset is well suited for explaining CNNs since it only contains a single channel, i.e. a grayscale where a regular image usually contains three channels defined by the RGB model. We proceed mainly with the RGB model.