GSoC/2010-Network
Abstract
The goal of this project is to make getting darcs repositories over network faster. Currently, getting large repositories is unacceptably slow. There are numerous complaints from darcs users about poor network performance. To address this issue, I will make the following improvements:
- introduce feature to optimize a darcs repository for fetching over a network,
- develop a smart server that can provide clients with only files they need in one request.
Content
About the project
The simplest way to share Darcs repository is to place it in web server’s directory. However, as the repository grows, it fills with a large number of small files. This leads to performance problems when getting repository of a large project over HTTP, since for each file, a separate HTTP connection is usually established.
Another way is to connect to the repository is via SSH. This way, you can both get and put repositories from/to remote host. In order not to create a separate SSH connection for each file (which takes much more time than HTTP, and also requires a password), the Darcs team implemented a special Darcs command, transfer-mode. Darcs, invoked with this command, repeatedly takes the file names to get and returns their content without dropping the connection. Thus, darks works as a simple file server. Though this eliminates the problem of redundant connections, this method still not optimal, as darcs may perform better using smarter protocol.
In my project, I will address these issues, creating fast and reliable ways to share Darcs repositories.
Benefits for the community
Darcs is the most popular VCS for most Haskell projects. Because of the Darcs’ being slow over networks, big Haskell projects suffer, including, for example, the Glasgow Haskell Compiler. Currently, it takes several minutes to get GHC’s lazy repository without history, and several hours to get the full repository.
Since Darcs is commonly used in Haskell projects, the significant part of Haskell community has to deal with its drawbacks. With improvements to Darcs I will make, Haskell developers will spend less time on boring things like waiting for repository to fetch, and will spend more time on fun things like coding in Haskell.
Project design
To make Darcs over HTTP faster, number of transferred files should be reduced to a minimum. To do so, I will implement the command “darcs optimize –http”, which will pack lots of small files into several large archives, suitable for network transmission. The first such archive is a pristine cache snapshot, that will make getting lazy repositories faster. The second one is patches archive that stores all the patches up to the last tag (exploiting the fact that patches before tag don’t get reordered). This will speed up getting history of repository.
The next improvement I will make is implementation of a smart server. As a request it will take:
- darcs command client wants to perform,
- command options and arguments,
- context file, defining client’s repository.
Using this information, the server will determine the set of files that must be sent to the client, and will send them in optimal order as a response. Thus, getting or pulling will require only one network round-trip, making it the fastest of all methods of sharing the repository.
Also, smart server could be extended to support put/pull commands from trusted clients. This would save users from need to provide access to a host over SSH. Although its out of scope of project, I’ll consider this in server’s protocol design.
Finally, the last improvement I will make is the implementation of transfer-mode based on smart server. This will give SSH transfers the advantages of being smart I described above. The only difference is that transfer-mode will trust every user, because the transfer already takes place in a safe context.
Deliverables
To summarize, my project will result in debugged and documented implementation of the following commands:
- darcs optimize –http: repository packing for network transmission,
- darcs serve: smart server, sending to the client only files that it need,
- darcs transfer-mode: variation of the smart server using stdin/stdout for I/O.
Timeline
weeks 1-3: implementation of repository packing for HTTP transfers. Actually I think that 2 weeks would be enough, but I allocate additional time to solve unforeseen problems that I may encounter in early development stage. If all goes smoothly, the remaining time of this period I will spent on debugging and polishing the code.
weeks 4-9 : implementation of a smart server. Much work has to be done here, so I allocate for this task the half of the entire program duration. By the time of mid-term evaluations the server will support all the basic commands and options.
week 10: implementation of transfer-mode on top of smart server.
weeks 11-12: no new features here; this period is dedicated to working with code written so far, including: debugging, testing, documenting and polishing. (These activities will take place during other periods too, along with code writing.) By the end of this period I will make the code ready for use in the real world.
About me
I am a 22 year old student from town of Makeevka, Ukraine. I’ve learned about Haskell 3 years ago, and was excited about its power. It made me functional programming fan, and since then I became interested in theory behind it. For most of my programming tasks Haskell is my language of choice.
I’m using Darcs as VCS, and it too has powerful theory behind it, the theory of patches. Unfortunately, currently there are issues with its implementation that restrict its usage in real world. To help with situation, I’ve started with writing some patches that fix minor issues, and I see this summer as a great opportunity to make a big change that will make life of Darcs users easier. Darcs has a great theory, and misses a great implementation to support it. Darcs has the potential of being the greatest VCS ever, and thinking of that potential inspires me to put my best effort into this project.