Wednesday, June 22, 2011

Sorting included files by importance


In this context an included file is considered more important if more files depend on it.

Attempt 1: straightforward but slow approach


This first attempt assumes that only header files are included, which is not always true. This is done mainly in an attempt to halve the required time to get a result. The main predicate is important_headers(H). It starts by collecting all header files in a list R. Then it calls annotate_headers_with_importance on each header in the list to find out how many times that file is included. In order to find out how many times a header file is included, it finds all files that depend on that header file, and counts how many there are.

Finding out all F that depend on a given file is an expensive query, and it has to run for each header file, leading to a slow program. On the code base pycdep is normally used with (around 19000 dependencies), generating the list of important header files using this program takes almost an hour - clearly unacceptable.



Attempt 2: a faster approach


Considering attempt 1 to be somewhat of a failure due to its slow result, I thought about a different way to go about generating the list of includes. Luckily, I remembered that swi-prolog has an associative list library (based on AVL trees), and I realized I could use it to maintain a list of counters: one for each file that another file depends upon.

The algorithm I implemented works as follows:

First find a list List of all files. For each file in List, find out which files R depend on it (this is a reversed and much cheaper query than the one used in the previous approach). For each file in R, increase a counter. After completion, the counters indicate how often each file is depended upon. Whether or not this can be considered good prolog style, I don't really know, but it increases speed from around 50 minutes to around 6 seconds to get the desired result.

Note that in this approach files that are not depended upon are not mentioned in the end result. In the previous approach, those files would be mentioned with an importance of 0. A second difference is that in this second approach, also .cpp files which are included are taken into account. In the previous approach we assumed that only header files would be included (mostly to halve the time required to complete the query).

The swap_elements predicate is used only to get the results in the same format as was produced by the first approach. This is because in pycdep there's a predicate to generate a report from the results, which expects the data to be in this format.

No comments:

Post a Comment