Wednesday, June 22, 2011

Sorting included files by importance

In this context an included file is considered more important if more files depend on it.

Attempt 1: straightforward but slow approach

This first attempt assumes that only header files are included, which is not always true. This is done mainly in an attempt to halve the required time to get a result. The main predicate is important_headers(H). It starts by collecting all header files in a list R. Then it calls annotate_headers_with_importance on each header in the list to find out how many times that file is included. In order to find out how many times a header file is included, it finds all files that depend on that header file, and counts how many there are.

Finding out all F that depend on a given file is an expensive query, and it has to run for each header file, leading to a slow program. On the code base pycdep is normally used with (around 19000 dependencies), generating the list of important header files using this program takes almost an hour - clearly unacceptable.

Attempt 2: a faster approach

Considering attempt 1 to be somewhat of a failure due to its slow result, I thought about a different way to go about generating the list of includes. Luckily, I remembered that swi-prolog has an associative list library (based on AVL trees), and I realized I could use it to maintain a list of counters: one for each file that another file depends upon.

The algorithm I implemented works as follows:

First find a list List of all files. For each file in List, find out which files R depend on it (this is a reversed and much cheaper query than the one used in the previous approach). For each file in R, increase a counter. After completion, the counters indicate how often each file is depended upon. Whether or not this can be considered good prolog style, I don't really know, but it increases speed from around 50 minutes to around 6 seconds to get the desired result.

Note that in this approach files that are not depended upon are not mentioned in the end result. In the previous approach, those files would be mentioned with an importance of 0. A second difference is that in this second approach, also .cpp files which are included are taken into account. In the previous approach we assumed that only header files would be included (mostly to halve the time required to complete the query).

The swap_elements predicate is used only to get the results in the same format as was produced by the first approach. This is because in pycdep there's a predicate to generate a report from the results, which expects the data to be in this format.

Thursday, May 19, 2011

Double inclusion and circular dependencies

If you thought we were out of neat tricks that can be performed with pycdep, here's some more :)

Detecting header files that are included twice (or more).

Here's a simple prolog program to keep only those elements in a list that occur more than once:

You could use it in a query as follows:

Detecting circular dependencies

Circular dependencies are a code maintainer's nightmare. They can cause all kinds of subtle and less subtle problems (if you get it to compile at least).

To get rid of circular dependencies one can do several things (depending on the situation at hand). Sometimes it suffices to move data structures into a file of their own, where the circularly dependent files can now both include it. Another technique is to avoid #includes and use as much forward declarations as possible.

Before you can solve circular dependencies, you have to be aware of them. Especially if the cycles are big (a includes b which includes c which includes d which includes a), it may not be obvious at first sight that there is a circular dependency. Luckily pycdep is able to tell us about circular dependencies. More precisely, we will calculate the dependency graph's strongly connected components (SCC). A strongly connected component is a subgraph of which each node is reachable from each other node. This is only possible if it contains cycles. Note that an SCC can encompass multiple elementary cycles. In the code shown below, no attempt is made to further decompose the SCC into elementary cycles.

The algorithm shown here is not very efficient (for reasons of efficiency Tarjan's SCC detection algorithm would be more suitable, but it's not as easy to write and understand as the one shown here. Maybe one day... :-) ). I have used it successfully on a 2800+ file code base with around 19000 include dependencies, to find all cycles in a few minutes on a reasonably modern pc.

You can use it as follows:

The program works by noting that two nodes X and Y belong to the same SCC if there's a path from X to Y, and vice versa. The scc(X, Result) predicate calculates the list Result, which contains all nodes that belong to the same SCC as X. The all_scc(R) returns a list of all the SCC found in the include file graph. To perform its magic, it uses a predicate all_scc_helper that starts from a node taken from the list of all eligible nodes, then finds all the nodes belonging to the same SCC. All the nodes that belong to the SCC, are then removed from the list of eligible nodes, and the same steps are done on the remaining nodes.

Note that the program can be sped up by memoizing the path calculation in the implementation of "X depends on Y". Memoization is a technique in which results are calculated once, and then the outcome of the calculation is stored as a new fact in the prolog database. This avoids that the same calculation has to be redone multiple times. Using a better algorithm (like the ones by Tarjan, Kosoraju or Gabow) of course would be a better approach.

Memoization of the path calculation can be achieved by changing the definition of X depends on Y as follows:

TechnoGems