Mining Software Engineering Data
Ahmed E. Hassan
North Carolina State Univ.
Simon Fraser Univ.
Univ. of Victoria
Software engineering data (such as code bases, exe- cution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project’s status, progress, and evolution. Using well- established data mining techniques, practitioners and re- searchers can explore the potential of this valuable data in order to better manage their projects and to produce higher-quality software systems that are delivered on time and within budget.
This tutorial presents the latest research in mining Soft- ware Engineering (SE) data, discusses challenges associ- ated with mining SE data, highlights SE data mining suc- cess stories, and outlines future research directions. Partic- ipants will acquire knowledge and skills needed to perform research or conduct practice in the field and to integrate data mining techniques in their own research or practice.
Software engineering data (such as code bases, execu- tion traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a soft- ware project’s status, progress, and evolution. Many studies have emerged that use this data to support various aspects of software development within industrial and open source set- tings. Working with Nokia, Gall et al.  have shown that software repositories can help developers change legacy systems by pointing out hidden code dependencies. Work- ing with Bell Labs and Avaya, Graves et al.  and Mockus et al.  demonstrated that historical change information can support management in building reliable software sys- tems by predicting bugs and effort. Working on open source projects, Chen et al.  have shown that historical informa- tion can assist developers in understanding large systems.
Although the idea of applying data mining techniques on software engineering data has existed since mid 1990s , the idea has especially attracted a large amount of interest
lately within software engineering. The workshop in Min- ing Software Repositories (MSR) has been recognized as the most attended workshop at ICSE since 2001. MSR 2006 was oversubscribed. As a reflection of the great interest in the area and the importance of the MSR work within the context of software engineering, the best papers for three of the major conferences within SE (ICSE, ASE, and ICSM) for 2006 are on applying data mining techniques on SE data. A recent issue of IEEE Transactions on Software Engineer- ing (TSE) on the MSR topic received over 15% of all the submissions to the TSE in 2005 .
The tutorial will provide participants with an overview of the field of mining software engineering data, as shown in Figure 1. In particular, the tutorial will cover the following topics along three dimensions (software engineering, data mining, and future directions):
What types of SE data are available to be mined?
Which SE tasks can be helped using data mining?
How are data mining techniques used in SE?
What are the challenges in applying data mining techniques to SE data?
Which data mining techniques are most suitable for specific types of SE data?
What are freely available data mining and analy- sis tools (e.g., R  and WEKA )?
Future Directions: What are the challenges and op- portunities for the data mining and software engineer- ing communities?
The tutorial will cover these topics through case studies from recent software engineering conferences. Participants will gain the knowledge needed to accomplish the following tasks: