Sunday, April 10, 2011

Dependency analysis using prolog


According to Wikipedia (emphasis is mine):

Prolog is a general purpose logic programming language associated with artificial intelligence and computational linguistics.

Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is declarative: The program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations.

It sounds ideal for our purpose: we can express dependencies between files as relations, and examine these relations using queries. If you hear the word query and automatically think "relational database and SQL" I will have to disappoint you. Prolog is much more powerful than SQL. Contrary to SQL, prolog can be used a general purpose language. Prolog especially shines when it comes to recursive queries, something that SQL is not exactly famous for. Dependencies between include files are recursive in nature, and prolog will be a natural fit.

Representing facts and rules in prolog

To represent the fact that John is the father of Pete, in prolog one would write:
father(john, pete).

This single fact already forms a database which we can run a few queries on. For example, we could ask the system who is the father of Pete:
father(X, pete).

Note that facts are represented in lowercase, whereas queries are defined by using capitals. By asking father(X, pete), we ask prolog: what value should X have so that the relation father(X, pete) is true according to the information in our database.

As we expect, prolog answers:
X = john.

Now suppose that we extend our database with another fact, namely that Pete is also the father of Mary.
father(pete, john).
father(pete, mary).

We can ask prolog who are the children of pete using the query:
father(pete, X).

This query has two possible answers: according to the database both john and mary are children of pete. Prolog therefore will return the first possible answer, and then ask you what to do.
If you press the ";" key, prolog will go on a search another answer. If you press the ".", prolog stops searching.
X = john ;
X = mary

So far we've only represented and queried facts. This could still just as well be done in SQL.
We turn to the more interesting question about finding siblings. Our database doesn't directly represent facts about siblings, but we can add knowledge by defining rules. E.g. we could add the rule that X and Y are siblings provided that they both have the same father. Our prolog program becomes:
father(pete, john).
father(pete, mary).
sibling(X, Y) :- father(Z, X), father(Z, Y).

You can read the last line as: X is a sibling of Y provided that there's a Z that is the father of X, and Z is also the father of Y. We could use the following query to ask prolog who are the siblings of mary:
sibling(mary, X).

And prolog searches its database to return the following answers:
X = john ;
X = mary

It may come as a surprise that Mary is considered a sibling of Mary, but that's exactly what we taught our system: X and Y are siblings if they have the same parent. If you substitute Mary for X and Mary for Y, then both X and Y have the same parent, so according to our rule, they are siblings. We could extend the rule, to demand that X and Y cannot be equal:
father(pete, john).
father(pete, mary).
sibling(X, Y) :- father(Z, X), father(Z, Y), X \= Y.

If we now ask who's the sibling of Mary
sibling(mary, X).

We get only John as an answer
X = john

This is, in a nutshell, how prolog can be taught to reason about facts. Of course here you only see the tip of the iceberg that are the possibilities of prolog. To find out more, you should consider reading a book.

From siblings to dependencies

Now that we understand the basics of prolog, we can define some relations that will come in handy for our dependency analysis. We could e.g. represent the fact that a file XXX.cpp includes a file XXX.h as follows:
includes('XXX.cpp', 'XXX.h').

Note that normally words with capitals are considered Variables that are to be filled in by prolog, but if you want to use capitalized words in facts, you can put them in between apostrophes (').

For users who are not familiar with prolog, interpreting a relation like "includes('XXX.cpp', 'XXX.h')" is not very intuitive. Wouldn't it be nice if we could write instead:
'XXX.cpp' includes 'XXX.h'.

Thanks to user defined operators in prolog this becomes possible. Since this post isn't meant to be a prolog tutorial, without further delay here's some code to accomplish this. Hardcore prolog users will not always appreciate user defined operators, since they introduce a domain specific language which may make the code harder to read. For the uninitiated, I think the DSL we introduce here will make prolog look less daunting.
:- op(970, xfx, includes).

After defining the user defined operator "includes" we can now write:
'src/XXX.cpp' includes 'src/XXX.h'.
'src/XXX.h' includes 'lib/YYY.h'.
to express the fact that file 'src/XXX.cpp' includes 'src/XXX.h', and to express that 'src/XXX.h' includes 'lib/YYY.h'. Who said that prolog was difficult and mind-stretching? ;-) (Trust me, it can be, just not yet at this level.)

Knowledge representation for dependency analysis

So we already understand that we can write facts and rules to reason about the facts. Now the question arises what facts do we need in order to do meaningful include file analysis? We will make a distinction between cppfiles and include files. In a typical programming language, you might be tempted to examine the file name to decide whether or not you are dealing with a cppfile or a headerfile. In prolog this is not impossible, but awkward. Better is to directly represent such knowledge using facts. We define two user defined operators as follows:
:- op(990, fy, headerfile).
:- op(990, fy, cppfile).
Those operators allow us to represent that 'XXX.cpp' is a cppfile and 'YYY.h' is a headerfile as follows:
cppfile 'src/XXX.cpp'.
headerfile 'src/XXX.h'.
headerfile 'lib/YYY.h'.

You may be afraid that by representing such facts explicitly in your prolog database, the prolog database will grow to monstrous sizes. Don't worry! The prolog engine is incredibly fast, parsing hundreds of thousands of lines of prolog code in a few seconds. (Who said prolog was slow? ;-) Trust me, it can be, if you work with an inadequate knowledge representation, or if you write bad or unoptimized queries.)

Note also that we used the name 'src/XXX.cpp' instead of 'XXX.cpp'. The idea is to use enough path prefix so that there can be no duplicate file names.

What other facts will we use? We need a concept of project, to check for hierarchy violations. With the following definitions
:- op(990, fy, project).
:- op(990, fy, to).
:- op(990, xfy, belongs).
we can write facts like:
'src/XXX.cpp' belongs to project 'src'.
'src/XXX.h' belongs to project 'src'.
'lib/YYY.h' belongs to project 'lib'.

We want to express constraints between projects:
:- op(800, fy, from).
:- op(800, fy, include).
:- op(800, xfy, cannot).
'src' cannot include from 'lib'.

Finally, we can also represent parsing cost (number of statements) for each file:
:- op(800, xfx, costs).
'src/XXX.cpp' costs 123.
'src/XXX.h' costs 12.
'lib/YYY.h' costs 23.

We will define some rules to reason about these facts. An interesting rule is the "path" rule, which calculates a Path from file A to file B. If one or more Paths exist, they represent how file A eventually includes file B. So suppose that 'src/XXX.cpp' includes 'src/XXX.h' and 'src/XXX.h' in turn includes 'lib/YYY.h', then Path('src/XXX.cpp', 'lib/YYY.h', P) would return a P = ['src/XXX.cpp', 'src/XXX.h', 'lib/YYY.h']. Often there will be multiple paths from one file to another file, and prolog will find all of them for you. There's nothing wrong with you if you don't understand how it works without reading a prolog book. You also don't need to understand it in depth to understand what follows.
path(A,B,Path) :-
travel(A,B,P,[B|P]) :-
    A includes B.
travel(A,B,Visited,Path) :-
    A includes C,
    C \== B,

Using our path rule we can now define less mind melting queries, e.g. we define a rule that can tell us if file A depends on file B:
:- op(650, fy, on).
:- op(650, xfy, depends).
:- op(670, yfx, via).

X depends on Y :- Path(X, Y, _).
Here we say: file X depends on file Y if there's a path _ from X to Y. The underscore denotes that we are not really interested in the actual Path.
X depends on Y via P :- Path(X, Y, P).
With the above rule we can write a query like:
'src/XXX.cpp' depends on 'lib/YYY.h' via P.
and prolog will answer:
P = ['src/XXX.cpp', 'src/XXX.h', 'lib/YYY.h'].

Some useful queries

You may not realize it, but by now (even if we haven't fully developed everything we will eventually use) we already have a lot of power at our finger tips.

Find all header files

headerfile X.

Find all cpp files

cppfile X.

Find out which files are directly included by file 'src/XXX.h'

'src/XXX.h' includes F.

Find out which files directly include 'lib/YYY.h'

F includes 'src/YYY.h'.

Find out which files recursively are included by 'src/XXX.cpp'

'src/XXX.cpp' depends on F.

Find out which files eventually include 'lib/YYY.h' (recursively)

F depends on 'lib/YYY.h'.

Find out if someone includes cpp files

cppfile F, F2 includes F.

Find out if there are header files that are included by no one

(note: \+ is the prolog way of writing "not" )
headerfile F, \+(F2 includes F).

Some more rules

We add a rule to detect include file violations, taking into account constraints between projects:
violations(Project, X, Y) :-
    X includes Y, 
    X belongs to project Project,
    Y belongs to project Py,
    Project cannot include from Py.

Now we can ask to get all include file violations:
violations(P, X, Y).

If you're not convinced about the power of prolog for dependency analysis, you'd better stop reading here.

The next question that arises is: how will we extract all those facts from our actual source code? Are we supposed to manually enter all information in a prolog database?
The answer, of course, is no. In a next post, we will introduce a python program that extracts all these facts from our source code and dumps them into a prolog database similar to the one we developed here. I used python instead of prolog to extract those facts from c/cpp files because I'm much more familiar with python. Prolog is great for reasoning about facts, but text processing probably is better left to a language like python.

Once everything is ready, I'll make the full code available for download.

No comments:

Post a Comment