Basic Concepts
This document introduces Helix, a secure, scalable, and highly available version control system that supports parallel development. You should read this document before you start working with Helix.
This document:
- introduces the basic concepts and tasks of version control
- explains how you can configure Helix to improve performance, to scale the system, and to provide automated failover
- suggests some ways you can extend and customize Helix
- explains how you can use Helix with other products to get additional functionality
- discusses use cases for Helix
- introduces the many support resources that can help you use Helix.
If you are familiar with version control systems, you can skip the first section and begin by reading “Helix as a version control implementation.”
The basics of version control
When you work alone on a document, the latest is usually the greatest: you successively open the document, make changes, and save the document. Each time you save, you overwrite the existing copy. The situation is different when you are working with a large, globally distributed team on a project consisting of hundreds or even thousands of files. In this case, it is important to track authorship and changes, and to resolve conflicts when several users make contending changes to the same file. Version control systems allow you to do this; you can track and manage changes to any large collection of digital assets: documents, source code, web sites, audio files, and so on.
Version control systems address these issues using a variety of techniques; one of the most basic is versioning. Rather than overwriting earlier versions of a file when it is saved, under version control, each saved copy of the file is versioned and assigned a number or letter that reflects the order in which it was saved.
In addition to identifying a file version within a sequence of versions, a version control system automatically associates certain information with each version: it records who made the change, when the change was made, and why the change was made. Taken together this information provides an audit trail that you can always consult to understand how a project developed and when specific changes were made. Because no version of a file is overwritten, when bugs arise, it is possible to look back in time to identify the point at which the bug was introduced. This can be critical in fixing bugs that cannot be reproduced. Equally, looking at file history and understanding why certain decisions were made can help project participants stay on track or find appropriate options for future directions.
Sharing data under version control requires a certain amount of gatekeeping to determine who can access the data and how conflicts are resolved when two users make changes to the same file. We have explained so far about editing and saving files; but to support this gatekeeping function, version control systems introduce the additional step of checking out a file and checking in or submitting a file.
The basic version control workflow looks like this:
- Assets under version control are placed in a specified repository.
- Assets are associated with specific permissions that users must have in order to read or modify them.
- A user checks out a working copy of an asset and makes changes.
- Another user checks out a working copy of the same asset and makes changes.
- The first user saves changes to their working copy and checks in that copy.
- The second user saves changes to their working copy and checks in that copy.
- The version control system detects the fact that the same asset was changed in parallel, and it asks the second user to merge changes with those of the first user before the second user’s changes can be checked in. The work of comparing and merging changes is called resolving.
In this way, the version control system makes sure that changes are predictable, manageable, and auditable.
Version control systems are traditionally either centralized or distributed:
- Centralized version control systems use a single repository from which users check out one or more files to work on locally.
- Distributed version control systems allow users to host repositories locally, check out entire repositories with all history, work independently of one another, and combine their work through merging when necessary.
Helix supports either model, as well as a hybrid of the two.
Version control systems can be used as stand-alone applications or they can be incorporated into development or authoring tools as a means of managing the assets produced by these tools.
Helix as a version control implementation
Helix uses a client-server architecture to implement version control management.
- The Helix server (also known as the Helix Versioning Engine or
p4d
) manages shared file repositories, or depots, that contain every revision of every file under version management. Files are organized into directory trees. The server also maintains a database to track data associated with files and client activity: logs, user permissions, metadata, configuration values, and so on. -
Helix clients provide an interface that allows you to check files in and out of the depot, resolve conflicts, detect threats, track change requests, and more.
Helix clients include a command-line client, a graphical user interface client, and various plugins that work with commercial IDEs and productivity software. A Helix server can provide services to a mix of Helix clients.
You also use Helix clients to manage a special area of your workstation called a workspace. Directories in the depot are mapped to directories in your workspace, which contain local copies of managed files. You always work on managed files in your workspace:
- You check the files out of the depot (and into your workspace).
- You modify the files.
- You check them back into the depot.
- If the changes you try to submit conflict with changes that other users, working in parallel with you, have already submitted, you must resolve conflicts as needed.
The next figure shows the mapping between depot files (shown on the left) and workspace files (shown on the right). Until files are checked out from the depot, they remain as read-only in the workspace. To have Helix update your workspace so that it reflects current work on the depot, synchronize your workspace to the depot by getting the latest revision of the files.
We have explained about checking files in and out of the depot, suggesting that single files may be checked in and out. In fact, the means we use to check files in and out of the depot is the changelist. A changelist must contain at least one file and may contain tens of thousands. A changelist is numbered and allows you to track all changes with respect to the contents of the depot: file modifications, the addition of a file, or the deletion of a file.
The changelist is the simplest way to organize your work. A changelists also represents the atomic unit of work in Helix: if a changelist includes several files, changes for all the files are committed to the depot or none of the changes are. For example, if a network connection between the client and the server fails during changelist submission, the entire submit fails.
Multiple user access to a set of files
Version control systems must address the fundamental need for multiple users to work on the same project simultaneously. Helix offers two ways to do this:
- File locking: Helix locks a file while someone is working in it. This means that no one else can alter the file until the work is done and the file is unlocked.
- Branching and merging: By branching streams and then merging them later, multiple users can work on the same files simultaneously. See Balancing stability and innovation: the mainline model.
Balancing stability and innovation: the mainline model
So far, we have explained how version control systems handle the problem of different users working on the same file at the same time: when the different copies of the file are checked in, all but the first user must resolve changes by deciding which changes will be preserved in the latest version of the file.
This is the simplest use case for parallel development. More complicated cases arise when large development projects require many people to work in parallel both because they must support multiple releases and because they involve multiple functional teams working together to create the desired product.
For example, a game development company depends on the combined efforts of artists, musicians, programmers, testers, and build engineers to create a release. One way to organize this effort is to split the set of files that make up the project into multiple parallel branches, allowing development and testing to occur along each branch. Integration then occurs across branches to promote everyone’s work into a releasable product.
Consider the following cases:
- Extremely short release cycles, such as occur with fast changing web content, require overlapping cycles of testing and release.
To handle this case, we move code through the branches shown above. Development occurs along the development line. When a milestone is reached, code is copied up to the QA line where it is thoroughly tested. After it clears all tests, it is copied up to the Beta test line where it is subjected to real-world use. Having satisfied Beta users, it can then be copied up to a release line. Note the purple dashed lines, which indicate the flow of code as it is copied up through the branch hierarchy.
-
Unforeseen delays cause some features under development to miss deadlines. Not wanting to imperil the project as a whole, the project is released while development continues for the laggard components.
In the model shown above, projects X, Y, and Z have been branched off from the development mainline and worked on independently. When work has completed and projects X and Y have passed development tests, they are copied up to the mainline. Project Z however could not be completed and work continues on that project without putting the larger release at risk.
What does branching have to offer? It allows us to balance the need for stability with the need for innovation. On the one hand, we have release branches that hold the most tested and stable code; on the other hand, we have development branches that allow for experimentation and exploration without putting the release at risk.
Branches can be organized in a variety of ways: you can create branches for different platforms, you can create branches along organizational lines, and so on. One common model used for product development is the Mainline model shown below:
The Mainline forms the trunk from which release branches and development branches are created. Each branch normally contains a subset of the Mainline files. Release branches might contain fewer files because files needed for testing are excluded; development branches might contain a different subset of files because the projects they represent focus on discrete product features. In the Mainline model, the “up” direction indicates increased stability or confidence.
When you create branches, you are free to integrate changes in any direction you like. Unfortunately, this can lead to big problems if you inadvertently integrate untested changes into an otherwise stable branch. For this reason, in addition to defining branches that isolate changes, the Mainline model is most useful when it can implement some protocols that limit what changes can be made and in what direction.
Streams
Helix streams implement the Mainline model, adding intelligence that determines what changes can be made and in what order they must be made.
Let’s look at the Mainline example again and add some information to indicate flow of change:
- Project Y has been branched from Mainline; work and testing continues until it is complete. It is then ready to be copied back up to the mainline. However, while development has taken place in Y, Mainline has continued to change. Before we can copy the contents up to the Mainline, we need to make sure that Project Y files reflect changes that have been made in Mainline; we merge those changes into Project Y before we copy Project Y files to Mainline.
- A bug is found in Release 1.x. The bug is fixed and tested. We now want the bug fix in Mainline, so we merge files from Release 1.x down to Mainline. We do not copy anything up because Release 1.x should not include any features added after it has been branched.
The Mainline model arranges branches in terms of stability: the most stable branches are at the top; the least stable branches are at the bottom. The flow of change needs to support this model by merging changes down and copying up.
Typically, when you work with streams, you define and populate the mainline first. You then create development streams and release streams as children of the Mainline stream. The type of a stream and its relationship to other streams determines what sort of changes can be made and in what order they are made.
Rather than using a timeline, the streams GUI—found in the Perforce Visual Client (P4V)--represents related streams as shown below:
The children of Main are shown both above and below Main. Release type streams are at the top; development and task streams are at the bottom. Stability grows as streams near the top of the diagram. The direction and color of arrows linking streams indicate both the direction of flow and the order of flow.
When you create a stream, you specify its type, its relationship to other streams, and how files are to be treated for merging and branching. The information you provide is then used by the streams application to encourage good behavior.
Streams provide visual clues for where and how to branch and merge. They guide behavior that supports stability and innovation. Using streams eliminates much of the work needed to define branches, to create workspaces, and to manage integrations.
An additional advantage of using streams is that when you switch from one stream to another, the contents of your workspace are updated automatically to reflect the contents of the current stream.
Streams automate branching, but you do not have to use them. You can create your own branches and manage them as you see fit. Custom branching gives you finer grained control but you lose the convenience of built-in flow control and workspace updating.
For information on streams, see the P4 User Guide.
Organizing your work: jobs and labels
In addition to using changelists and streams to organize your work, you can use two other methods: jobs and labels.
-
Jobs provide lightweight issue tracking that integrates well with third party defect tracking and workflow systems. They allow you to track the status of a bug or an enhancement request. Jobs have a status and a creator and are associated with changelists that implement the bug fix or the enhancement. An administrator can customize the type of information tracked by jobs add more fine grained status values, or define additional fields for information to be tracked: which customer the enhancement is for; what was done to test the fix, and so on.
You can integrate the jobs function with third-party defect tracking and workflow systems. For more information, see the Defect Tracking Gateway page.
- Labels are sets of tagged file revisions that allow you to handle a heterogeneous group of files as one unit. While a changelist refers only to the contents of a given set of files at the time they were submitted, a label can refer to a group of file revisions from different points in time. You might want to use labels to define the group of files contained in a particular release, to sync a set of files, to populate a workspace, or to specify a set of file revisions to be branched. You can also use a label as an alias for a changelist number, which makes it easier to remember the changelist and easier to refer to it in issuing commands.
For information about jobs and labels from a user’s perspective see the P4 User Guide.
For information about managing jobs and labels, see the Perforce Server Administrator’s Guide: Fundamentals.
Working together and working apart: centralized and distributed development
We mentioned earlier that version control systems can implement either a centralized model or a distributed model:
- Centralized version control systems use a single repository from which users check out one or more files to work on in their local directories.
- Distributed version control systems allow users to host repositories locally, check out entire repositories with history, work independently of one another, and combine their work through merging when necessary.
Helix supports either model, as well as a hybrid of the two.
In the centralized model, clients work with a depot on a shared server. A mapping of files from the depot to their workspace determines which files they are able to work with in their workspace. The following figure illustrates the centralized model:
Users check files out of the same depot, work on them, and check in their changed files. If multiple users work on the same file, they use merging and conflict resolution to make sure the resulting version is satisfactory to all authors. Although users can disconnect from the shared server and continue to work on the files in their workspace, some manual work is required to sync back to the server and to check in files when the user reconnects. For information on working with Helix using this model, see the P4 User Guide.
In the distributed model, users work with a depot on personal servers that are then connected to a shared server. The depot on their personal server might contain a subset or the entire set of files on the shared server. Each user can work disconnected from the shared shared server but still be able to access all the files in their workspace and place these under version control using their personal server. Each user can access the entire history of a file locally, can rewrite and revise history, and can manage the files and streams on their local machine without interacting with the shared server.
The following figure illustrates this model:
When users decide to share their code or digital assets with other users, they connect to and then push their content to the shared server. This allows other users to fetch that content from the shared server and work with it on their own personal server. Users might need to merge content before pushing if their changes conflict with changes made by others.
In addition to supporting these two models, Helix also allows for a hybrid architecture in which some users connect directly to the shared server while others connect to personal servers that are connected to the shared server.
For more information about distributed development and file management, see Using Helix for Distributed Versioning.
Collaborating within a Git Ecosystem
Git users who want to take advantage of Helix enterprise-level features can work seamlessly with Helix Versioning Engines using Helix Git Fusion.
Helix Git Fusion is a Git remote repository service that uses the Helix Versioning Engine as its backend. Git users interact with Git Fusion as they do with any other Git remote respository (repo), issuing standard Git commands to clone repos and transfer data. When a Git user pushes changes, Git Fusion translates those Git changes into Helix changes and submits them to the Helix depot. When a Git user pulls changes, Git Fusion translates the pull request into p4d commands to download changes from the Helix depot.
Performance, scaling, and high availability
Version control systems are key to managing large projects: with Helix, “large” can be large indeed. With enterprise-level features that you can use to fine tune and improve performance, Helix lets you scale your system to accommodate a global workforce, and to automate failover for a highly available system. For example, Helix can accommodate the needs of a gaming development company whose files might take up hundreds of terabytes or even petabytes of data; or it can support the work of a software company, whose activity level includes massive automated testing as well as focused, analytic bug fixing and tracking work.
To support these tasks, Helix uses the following additional server types:
- Proxies are used where bandwidth to remote sites is limited; they mediate between remote clients and the versioning service. By caching frequently used files, the proxy reduces demand on the server and keeps network traffic to a minimum.
- Brokers mediate between clients and servers to implement policies that solve routing or security problems. For an example of how brokers can be used, see Data center scaling and high availability. The cluster router is a broker used to filter and load-balance incoming client requests.
- Replicas duplicate server data. They can be used to provide a warm standby server or to reduce load on a primary server.
The following sections explain how these servers are used singly or how they are combined to provide enterprise-level performance. For complete information about using proxies, brokers, and replicas, see Perforce Server Administrator’s Guide: Multi-site Deployment.
Using proxies to improve performance
To improve performance for users accessing a shared Helix repository across a WAN, you can configure a proxy on the side of the network close to the users and configure the users to access the service through the proxy; then configure the proxy to access the master Helix service.
The following diagram illustrates a typical proxy configuration:
In this configuration, file revisions requested by users at a remote development
site are fetched first from a shared server (p4d
running on Central) and
transferred over a relatively slow WAN. Subsequent requests for that same
revision, however, are delivered from the Helix Proxy, (p4p
running on
Outpost), over the remote development site’s LAN; this architecture reduces both
network traffic across the WAN and CPU load on the shared server.
Using a replica for disaster recovery
Replication is the duplication of server data from one Helix server to another. The replicated server is called the master server; its replica is called a replica server. You can use replication to provide a warm standby server, or to reduce load and downtime on a primary server when performing builds. The following figure shows how you set up a replica to provide a warm standby to aid in disaster recovery.
When you create the replica, you specify which server it should get its data from. The replica then periodically updates itself by copying files and metadata from the master. If the master fails, all you need do is reconfigure the replica to be the new master and then reconnect clients to communicate with it.
Edge servers and workspace servers, described in the following sections, are special examples of replica servers.
Commit-edge architecture
This architecture supports optimal performance for geographically distributed work groups. At a minimum it is made up of the following kinds of servers:
- A commit server that stores archives and metadata.
- An edge server that contains a replicated copy of the commit server data and a unique, local copy of some workspace and work-in-progress information. This server can handle read-only operations and operations that only write to the local data. You can connect multiple edge servers to a commit server as shown in the next figure.
Since an edge server can handle most routine operations locally, the edge-commit architecture offloads a significant amount of processing work from the commit server and reduces data transmission between commit and edge servers. This greatly improves performance.
For more information about these options, see Perforce Server Administrator’s Guide: Multi-site Deployment.
Data center scaling and high availability
Cluster architecture offers superior performance and automated failover.
A Helix cluster includes the following elements:
- A depot master that records changes to revision data.
- A depot standby that is automatically synchronized to the depot master and that assumes the role of the depot master if the depot master fails.
- A fault tolerant shared file system that is accessible to all cluster nodes.
- Optional workspace servers that manage client data and help distribute the load of clients.
- A router that dispatches user requests to the workspace servers or directly to the depot master if no workspace servers are used.
A Helix cluster implements a highly available, horizontally scalable commit-edge architecture.
A server cluster uses an Apache ZooKeeper service to automate the process of
failure detection and of role migration from the failing depot master to the
depot standby. The fault-tolerant ZooKeeper service maintains cluster state
information and monitors the health of the cluster nodes. The cluster uses the
p4zk
process, which supports each server in the cluster, to register the
existence of the server and to respond to cluster state changes.
Server cluster topology is shown in the following figure:
For more information about Helix clusters, Perforce Server Administrator Guide: Cluster Management.
Securing the system
Independent of the Helix architecture you use, secure communication is guaranteed both with respect to communication between clients and servers as well as communication between servers.
- User authentications can be done using passwords or tickets, and the strength of the password can be defined by an administrator. Users can be authenticated against an Active Directory or LDAP server, or against an internal Helix user database.
- Communication between clients and servers can be secured using the SSL protocol, which you specify when connecting to the server.
- Communication between servers in a distributed environment can be secured using a trust file and by setting permissions for the service users that own the different servers in the environment.
In addition to user authentication, digital assets are further secured by a protection scheme to prevent unauthorized access. Protections determine which Helix commands can be run, on which files, by whom, and from which host. This scheme provides the finest grained control possible.
Finally, Perforce offers Helix Threat Detection, which quickly and accurately identifies internal and external security risks to intellectual property managed in Helix. See “Clients, Collaboration, and Security” for details.
For more information on security see the Perforce Server Administrator’s Guide: Fundamentals and the Perforce Server Administrator’s Guide: Multi-site Deployment.