## Mlg07.dvi

DIGDAG

**, a first algorithm to mine closed frequent embedded sub-DAGs**
Systems Pharmacology Research Institute, GNI Ltd

**Abstract**
gorithms, which discover

*induced patterns *based only onparent-descendant relationship, are not adapted to analyze

*Although tree and graph mining have attracted a lot of*
this data. Algorithms discovering

*embedded patterns*, based

*attention, there are nearly no algorithms devoted to DAG-*
on the ancestor-descendant relationships, are needed for this

*mining, whereas many applications are in dire need of such*
*algorithms. We present in this paper *DIGDAG

*, the first al-*
In this paper, we present DIGDAG, the first algorithm

*gorithm capable of mining closed frequent embedded sub-*
capable of mining closed frequent embedded sub-DAGs

*DAGs. This algorithm combines efficient closed frequent*
(

*c.f.e.-DAGs*) in DAG data. The input data are DAGs where

*itemset algorithms with novel techniques in order to scale*
all the labels must be distinct. This assumption of distinct
labels has long lead to think that the solution to this problemwas trivial. However, we will show that naive approachesare unable to provide results on real data and this motivates

**1. Introduction**
the more elaborate method used in DIGDAG. Experimentswill show that DIGDAG significantly improves a naive ap-proach both in term of memory consumption and in term of
A growing percentage of the data available today is

*semi-*
*structured data*, i.e. data that can be represented as a la-
The paper is organized as follows: Section 2 gives the
beled graph. This kind of data can be found in various do-
necessary definitions, Section 3 presents the DIGDAG algo-
mains such as bioinformatics, chemistry or computer net-
rithm, Section 4 shows some preliminary experiments, and
works. To analyze this data, there is an important need
Section 5 concludes the paper and gives perspectives for fu-
of specific data mining algorithms, especially for the dis-
covery of frequent patterns. In the recent years, the re-search community has produced numerous algorithms tofind frequent patterns in general graphs [5] and in trees

**2. Definitions and data-mining problem**
[2, 7, 10]. However, there are nearly no algorithms for find-ing frequent patterns in data structured as DAGs. This is
A

**labeled graph **is a tuple G = (V, E, ϕ), where V is

a problem to be explored, as a lot of interesting data ex-
the set of vertices, E ⊆ V × V is the set of edges, and ϕ :
hibit this structure. An important body of this data is the
V → L is a labeling function with L a finite set of labels.

result of Bayesian networks based algorithms. This is for
For an edge (u, v) ∈ E, u is the

**parent **of v and v is the

example the case of most current studies about gene in-

**child **of u. If there is a set of vertices {u1, ., un} ⊆ V such

teraction networks [4]. Such data exhibit large structural
that (u1, u2) ∈ E,.,(un−1, un) ∈ E, {u1, ., un} is called
variations, meaning that ancestor-descendant relationships
a

**path**, u1 is an

**ancestor **of un and un is a

**descendant **of

are more often preserved than parent-child relationships.

u1. There is a

**cycle **in the graph if a path can be found from

Hence current DAG mining [1] and graph mining [5, 9] al-
a vertex to itself. An edge (u, v) ∈ E of the graph is said

to be a

**transitive edge **if besides the edge (u, v), there also

tices must be introduced as input data. For each DAG D we
exists another path from u to v in G. A

**labeled DAG **is a

already have its set of edges ED. We define the saturated
set E+ = {(u, v) | u is an ancestor of v in D} containing
Let P1 = (V1, E1, ϕ1) and P2 = (V2, E2, ϕ2) be two
all the ancestor-descendant edges of a DAG D. Each input
DAGs. P1 is an

**embedded **sub-DAG of P2, written P1 ⊑

DAG D ∈ D will be represented by its E+ set. The set
P2, if there exists an injective homomorphism µ : P1 → P2
D+ = {E+ | D ∈ D} regroups all the saturated versions
such as: 1) for two vertices u ∈ V1 and v ∈ V2 such that
v = µ(u), ϕ2(v) = ϕ1(u) holds and 2) for (u, v) ∈ E1,
Now the question is how to correctly use a closed fre-
µ(u) is an ancestor of µ(v) in P2. In short, the homomor-
quent itemset mining algorithm to discover the sets of edges
phism must preserve the labels and the ancestor relation-
of the c.f.e.-DAGs. Briefly stated, it can be summed up as a
ship. If in 2), the homomorphism only preserves the parent
relationship (i.e. the condition becomes: for (u, v) ∈ E1,µ(u) is a parent of µ(v) in P
1. Using D+ as input, apply a closed frequent itemset
2 ), then P1 is an

**induced **sub-

mining algorithm to discover closed frequent ancestor-
2. Note that an induced sub-DAG is also an em-
bedded sub-DAG, but not the opposite.

Let D = {D1, ., Dn} be a set of labeled DAGs and
2. Process the closed frequent ancestor-descendant edge
ε ≥ 0 be an absolute frequency threshold. A DAG P is a

**frequent embedded (***induced***) **sub-DAG of D if it is em-

bedded (

*induced*) in at least ε DAGs of D. The set of DAGs

Both of these two steps seem quite straightforward and
of D in which P is embedded (

*induced*) is called the

**tidlist**
very easy to implement, i.e., step 1 should be the simple
of P , denoted tidlist(P ). The

**support **of P is defined by

application of any well known closed frequent itemset min-
ing algorithm, and step 2 should only be connecting the
A frequent embedded (

*induced*) sub-DAG P of D is
edges together to discover the c.f.e.-DAGs. However this

**closed **if it is maximal for its support, i.e. if there is no fre-

simplicity is only apparent, and in practice numerous prob-
quent embedded (

*induced*) sub-DAG P ′ of D such that P is
lems arise, which prevent any “naive” approach to success.

embedded (

*induced*) into P ′ and tidlist(P ) = tidlist(P ′).

The main of these problems is that step 1 doesn’t scale up
The exact data mining problem at hand is to find closed
to complex data. We have tried to analyze real data made of
frequent embedded sub-DAGs (c.f.e.-DAGs) in a collection
100 DAGs, each having 300 vertices, coming from a bioin-
of DAG data, where for each input DAG all its labels are
formatics problem. The closed frequent itemset algorithm
was the FIMI 2003 winner, LCM2 [8]. This algorithm usesa lot of techniques to reduce the size of the dataset in mem-

**3. Algorithm**
ory. But even on a machine with 8 GB of RAM, the algo-rithm could not complete due to memory saturation. This is
due to two reasons. First, the item space is the space of allthe ancestor-descendant edges, which is huge. This is diffi-
cult for any closed frequent itemset mining algorithm. But
rithm for discovering c.f.e.-DAGs in DAG data where for
the main reason lies in the very nature of the closed frequent
each input DAG, all the labels are distinct. The fact that
ancestor-descendant edge sets researched. In the ideal case,
all the labels are distinct in the input DAGs is convenient
ancestor-descendant edges are regrouped together because
because it means that there are no ambiguities, i.e., if two
they belong to the same c.f.e.-DAG. But two c.f.e.-DAGs
edges (A, B) and (B, C) are found to appear frequently to-
can also appear frequently together in any D ∈ D, in this
case all their ancestor-descendant edges will be regrouped
pattern A → B → C is necessarily frequent and appears in
in the same closed frequent ancestor-descendant edge set.

So in facts, the closed frequent itemset algorithm does not
1, ., Dp}. This can be further extended, and any c.f.e.-
DAG P is uniquely defined by its set of edges E
look for the c.f.e.-DAGs, but it looks for all the

*closed fre-*
This means that instead of mining directly c.f.e.-DAGs,

*quent sets of c.f.e.-DAGs*. This problem is much more diffi-
the c.f.e.-DAGs are mined by deriving closed frequent sets
cult than the original problem, and in complex real data the
of edges. Though there are no algorithms to mine c.f.e.-
sheer number of possible c.f.e.-DAGs combinations gives
DAGs, a lot of closed frequent itemset mining algorithms,
too many results for an efficient handling in memory.

capable of mining efficiently sets of any kind of items [6, 8],
In the following, we will explain the methods we devel-
can then be used to mine the c.f.e.-DAGs.

oped to avoid these problems in the DIGDAG algorithm1.

As we are interested in finding closed frequent

*embedded*
1NB: We will always refer to the vertices of the DAGs simply by their
sub-DAGs, the ancestor-descendant relations among ver-

3.2. Item space reduction using the tiles
Our solution to this combinatorial problem is to make
computations where the item space is reduced to small sub-
Experiments show that the successful use of the
sets of the whole tile set T . The ideas are that each c.f.e.-
ancestor-descendant edges as items for a closed frequent
DAG is likely to contain only a few tiles of T , and that two
itemset algorithm is not easy on real data, because the item
disconnected c.f.e.-DAGs that appear frequently in the same
space they make is too big for efficient handling. Our so-
input DAGs will necessarily contain different tiles (thanks
lution to reduce the item space size is to use instead spe-
to the distinct labels hypothesis). A further observation is
cial groups of ancestor-descendant edges named

*tiles *as
that given D a c.f.e.-DAG and l a vertex of D, let Tl be the
items. The tiles are the closed frequent sets of ancestor-
biggest tile included in D whose root is l. Then the leaves
descendant edges that share the same initial label, like
of Tl represent all the descendants of l in D. Let l′ be one
{(a, b1), ., (a, bk)} (a, b1, ., bk ∈ L) for example. They
of these descendants, then the set of all the vertices of Tl′ ,
can be represented as depth 1 trees. Lets T denote the set
the biggest tile of root l′ included in D, will be included in
of all tiles. Figure 1 shows a simple example of tile discov-
ery. The input DAGs are D1 and D2. For each label a ma-
We have used this monotonic property to create an sim-
trix is constructed, whose lines are the vertices of the given
ple heuristic method grouping the tiles based on their labels.

label and whose columns indicate the labels of the descen-
The result is a set of tile groups T G, satisfying the following
dants of these vertices. For example for label A there are
two lines corresponding to the vertices A1 and A6. A1 has{C, D, E} as descendants, whereas A

**Property 1 ***For any c.f.e.-DAG *D

*, there exists a tile group*
as descendants. Applying a closed frequent itemset mining
T G ∈ T G

*such that *T G

*contains all the tiles contained in*
algorithm on this matrix gives the tiles whose root has the
given label, as shown on the right of the figure. Note that in

**Property 2 ***For all *T G ∈ T G

*, *T G

*is unlikely to contain*
the figure, matrices for labels A, B and D only are shown,

*the tiles of two c.f.e.-DAGs *D1

*and *D2

*that appear fre-*
other labels, having no descendants, do not produce tiles.

*quently in the same DAGs, even if this can happen in rare*
For each tile group T G ∈ T G we perform a closed fre-
quent itemset algorithm on DT i , i.e. DT i reduced to the
tiles of T G. A closed frequent tileset obtained here corre-
spond to the tiles of an ancestor-descendant relationship sat-
urated c.f.e.-DAG. Even if a closed frequent tileset contains
multiple c.f.e.-DAGs, a simple safety check is performed to
further decompose it into each of the c.f.e.-DAGs it con-
3.4. Discovering the c.f.e.-DAGs from their
Having determined each tileset that corresponds exactly
Figure 1: Example of tile discovery, frequency threshold
to the tiles of an ancestor-descendant relationship saturated
ε = 2. Vertex identifiers are subscripts of vertex labels.

c.f.e.-DAG D+, we know that for each c.f.e.-DAG D wehave all its ancestor-descendant edges, i.e. we have E+.

3.3. Further item space reduction by tiles
ing to find the non-saturated c.f.e.-DAGs, which are the ex-pected final result. For this we distinguish the edges of a
The input DAGs can be represented by the tiles they con-
c.f.e.-DAG into two categories, i.e., the

*direct edges *and the
tain. When, for each DAG D ∈ D, T iD is the set of tiles in-

*transitive edges*. An edge (u, v) is a direct edge if the only
cluded in D, the new input data is DT i = {T iD | D ∈ D}.

way from u to v in the c.f.e.-DAG is the edge (u, v). And
Applying a closed frequent itemset mining algorithm on
edge (u, v) is a transitive edge if it is not a direct edge, i.e.

DT i gives closed frequent tilesets, from which the c.f.e.-
if from u to v there are several paths. These are shown in
DAGs can be deduced. However in this case also the results
are closed frequent sets of c.f.e.-DAGs, which are too nu-merous for successful computation in complex cases.

We have presented in Section 3 a naive method for ex-
tracting closed frequent embedded DAGs, and claimed that
this method was not efficient enough for real data analy-sis. We have implemented this method in a way very simi-
Figure 2: Simple example of directs and transitive edges
lar to DIGDAG, and show the performance difference withDIGDAG in Figures 3 and 4.

The direct edges of the c.f.e.-DAGs are easily found
from the saturated c.f.e.-DAGs. The transitive edges are the
tricky part. For every transitive edge of each saturated c.f.e.-
DAG D+, we check if the transitive edge is actually con-
tained in all the input DAGs containing D+. This check is
efficiently done by applying a closed frequent itemset min-
ing algorithm to the edge sets of all the input DAGs contain-ing D+. An edge of each closed frequent edges set found
here represents a transitive edge of a non saturated c.f.e.-
DAG in the case where the edge is not a direct edge of D+.

We have proved the following property.

Figure 3: Naive vs DIGDAG, computation time

**Property 3 ***The set of DAGs found by *DIGDAG

*is the sound*

and complete set of the c.f.e.-DAGs.
**4. Experiments**
This section is divided into two parts. First, we show that
the techniques used in the DIGDAG algorithm drastically
reduce computation time and memory usage over the naive
method of the beginning of Section 3. Then, we give an
application example with the analysis of the action of the
The dataset used is a real world dataset of gene network
DAGs, as outputted by the algorithm of [4] when analyzing
microarray data from [3]. This algorithm models gene net-works with Bayesian networks. From the input microarray
Figure 3 shows the difference in computation time.

data, a greedy search is performed, whose results are local
DIGDAG is at least two orders of magnitude faster than the
optima, the

*candidate gene networks*. They are of mixed
naive method: the improvement made is significant. As a
quality: they contain both correct and incorrect parts. The
consequence, DIGDAG can find results quickly for low sup-
goal of our data mining algorithm is to extract closed fre-
port values, while the naive method’s computation time ex-
quent gene networks patterns from the candidate gene net-
ceeds the allocated 2 hours for support values below 30%.

works, which should be more credible and more easy to an-
The memory consumption difference is shown on Fig-
alyze than the whole candidate networks.

ure 4. For each program we give two values, as reported
To keep reasonable runtimes for all algorithms, we used
by the Linux memusage command: the

*heap total*, which
a simplified dataset reduced to 100 candidate gene networks
corresponds to the total amount of memory allocated, and
of 50 genes. The DAGs of this dataset have in average
the

*heap peak*, which corresponds to the maximum amount
284.5 ancestor-descendant edges, with a maximum of 352
of system memory used by the program during its execu-
tion. Due to problems with the memusage command, this
The machine used is a Xeon 2.8 GHz with 8 Gb of RAM
experiment was made with a machine different from the pre-
running under Linux. The implementations of DIGDAG and
vious one, a dual-core Xeon at 2.2 GHz with 4 Gb of RAM
of the naive algorithm are our own C++ implementation, us-
running under Linux. The naive method saturates the avail-
ing inside the LCM2 closed frequent itemset mining algo-
able memory for support values under 35%. For the lowest
support value, the figure shows that the naive method con-
3http://research.nii.ac.jp/∼uno/codes.htm
sumes much more memory than DIGDAG: for support value
35%, the heap peak of DIGDAG is three orders of magni-

**5. Conclusion and future works**
tude lower than the heap peak of the naive method. Since asupport value of 65%, the heap peak of the naive method is
We have presented in this paper the DIGDAG algorithm,
even higher than the heap total of DIGDAG. These results
the first algorithm capable of mining c.f.e.-DAGs. Our first
shows that DIGDAG is much less prone to saturating the
experiments on real data have shown that this algorithm ex-
system’s memory during computation. This is very impor-
hibited much better mining performances than a naive ap-
tant; as if the memory is saturated the program will not give
proach for the task at hand, and can allow to find interesting
any results, and computation time will have been wasted.

From these experiments, we can see that the improve-
As a topic for future research, we would like to be able
ments made by DIGDAG over the naive method are indeed
to handle even more complex data (more vertices and more
efficient at reducing the memory consumption of the pro-
input DAGs) through the use of parallelism. We would also
gram. These improvements also allow an important gain in
like to research what modifications would be needed for
DIGDAG to handle general graphs instead of DAGs.

**Acknowledgements: **This research was, in part, sup-

ported by Function and Induction project of ROIS/TRIC.

In the previous dataset, the 50 genes have not been cho-

**References**
sen at random, they have been carefully selected as beingthe most probably affected by the

*terbinafine *drug, in order
[1] Y.-L. Chen, H.-P. Kao, and M.-T. Ko. Mining dag patterns
from dag databases. In

*Web-Age Information Management(WAIM), Dalian, China*, 2004.

[2] Y. Chi, Y. Yang, Y. Xia, and R. R. Muntz.

iner: Mining both closed and maximal frequent subtrees. In
ERG6S-adenosyl-methionine delta-24-sterol-c-methyltransferase

*The Eighth Pacific-Asia Conference on Knowledge Discov-ery and Data Mining (PAKDD’04)*, 2004.

[3] T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts,
R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey,
H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer,
D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker,
D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H.

Friend. Functional discovery via a compendium of expres-
sion profiles.

*Cell*, 102(1):109–126, July 2000.

[4] S. Imoto, T. Goto, and S. Miyano. Estimation of genetic
networks and functional structures between genes by using
bayesian network and nonparametric regression. In

*Pacific*
*Symposium on Biocomputing*, pages 175–186, 2002.

[5] A. Inokuchi, T. Washio, and H. Motoda. Complete mining of
frequent patterns from graphs: Mining graph data.

*Machine*
*Learning*, 50(3):321–354, 2003.

[6] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.

3-hydroxy-3-methylglutaryl- coenzyme A reductase 1
covering frequent closed itemsets for association rules. In

*Database Theory - ICDT ’99, 7th International Conference,*
Figure 5: A significative c.f.e.-DAG.

[7] A. Termier, M. Rousset, and M. Sebag. Dryade : a new
approach for discovering closed frequent trees in heteroge-
One of the c.f.e.-DAGs found is shown on Figure 5. This
neous tree databases. In

*International Conference on Data*
gene network clearly shows relations of the gene ERG1 (the

*Mining ICDM’04, Brighton, England*, pages 543–546, 2004.

[8] T. Uno, M. Kiyomi, and H. Arimura. Lcm v.2: Efficient
real target of terbinafine) with the genes ERG6, 11, 13, 25
mining algorithms for frequent/closed/maximal itemsets. In
and 26 from the steroid metabolism, and HMG1 and 2 from

*2nd Workshop on Frequent Itemset Mining Implementations*
the lipid metabolism. This suggests that terbinafine works
on steroid metabolism or lipid metabolism related genes,
which is consistent with existent knowledge about this drug.

graph patterns. In

*KDD*, pages 286–295, 2003.

Our c.f.e.-DAG proposes a precise interaction structure be-

*Fundamenta Informaticae*, 65(1-2):33–52,
tween these genes, which could serve as a basis to design

Source: http://mlg07.dsi.unifi.it/pdf/01_Termier.pdf

The patient had failed to recover command-specifically damaging oligodendrocytes, for response or communication functions, despite 4 weeks and were subsequently tested on a a 2-year course of inpatient rehabilitation and Y-maze task. Treatment of mice with quetiapine 4 years in a nursing home. Functional MRI beginning 1 week before introduction of cupri-revealed a preserved large-scale bih

B I B L I O G R A F I A BIBLIOGRAFIA Capitolo 1 La sfera individuale AAVV. Informatica e handicap. Etaslibri, Milano, 1990. AAVV. L’inserimento lavorativo dei disabili: condizioni e strumenti . Fondazione Cancan, Padova, 1991. AAVV. Psicopatologia e sordita’ . Atti del VII Convegno dell’Istituto di Ortofonologia. Edizioni Scientifiche Magi, Roma, 1996. Abrahamsson K