6117CIT: Adv Topics in Computing Science

Knowledge Discovery and Data Mining

(Course outline Semester 1, 2004)

Identifying Information

 

Course catalogue no:

 6117CIT

Course title:

 Adv Topics in Computing Science

Field of Education Code

 Computer Science

Program/s

2011Bachelor of Information Technology with Honours

Program Convenor: V. Estivill-Castro

5107 Master of Information and Communication Technology

Program Convenor: J. Gasston

School:

 Computing and Information Technology

Faculty:

 Engineering and Information Technology

Status of Course within program/s or academic plan/s

Elective, honours 

 

Credit point value

 10

Prerequisites:

 Enrolment in Honours Program or MIT

Year and semester:

Semester 1, 2003 and 2004

Course convenor

Assoc. Prof. Vladimir Estivill-Castro

Office: Room 1.14
Technology Building
Nathan Campus
Phone: (+61 7) 387 55402
Extension: 55402
Fax: (+61 7) 387 55051
Email:
V.Estivill-Castro@cit.gu.edu.au

Teaching team members:

Same as Course convenor

Date course outline was last modified

 Feb. 26th, 2003.

 

Background

 

Computer technology and databases have provided many companies, institutions, government agencies and corporations with extraordinary power to collect and manipulate data about   almost every aspect of their function and their activities. Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. While the interpretation of discovered patterns         demands their presentation in visual form, statistics is probably the most familiar   approach to summarizing several observations into few measurements of tendency and spread that translate raw data into information for decision-making. Machine learning techniques can be regarded as exploring more flexible non-parametric models as well as more representations for knowledge. Many of the statistical or machine learning approaches translate into large and difficult optimisation and search problems that demand the use of heuristics developed in artificial intelligence.

 

Objectives

 

The major aims and objectives of the subject are to:

  1. Expand on your understanding of why Data Mining technology has emerged and what can Data Mining do?
  2. Expand on your understanding of Data Mining Techniques  (Statistical Methods), Machine Learning (Symbolic and Inductive Learning) and Market Basket Analysis (Association Rules).
  3. Provide you with a clear introduction and understanding of Modelling methods, Supervised Learning (Neural Network) and Analogical Methods like Support Vector Machines.
  4. Expand on your understanding of Unsupervised Learning and Clustering.
  5. Provide you an understanding of current issues like Data Mining in Geo-referenced Data, Privacy, User Web Mining and Mining Multimedia Data.
  6. Advance your core research skills; in particular, comprehension of recent research publications, summarization and analysis of literature in a very specific topic, application of techniques to a practical problem.

Interrelationship of the Course with other Courses and the Program

 

Because of the use of techniques from Machine Learning, this course has close links to 3146CIT: Machine Learning. Students who complete the Machine Learning course may find that many useful links can be established between Data Mining and Machine Learning. Also, because of the issues of managing large datasets, the contents of 3166CIT: Database Management Systems offers material that is in close relationship to Data Mining. In particular, topics like On-Line Analytical Processing can be studied from the perspective of Databases or from the perspective of Knowledge Discovery and Data Mining.

 

Brief Description

 

This subject introduces a selection of current research topics in computing science that are not covered elsewhere in the Honours course. The particular topics are Knowledge Discovery and Data Mining, Inductive Machine Learning (supervised and unsupervised), and applications like Privacy, Spatial Data Mining, WEB usage Mining and Multimedia Data Mining.

 

Content

 

Lecture 1. - Introduction to Knowledge Discovery and Data Mining

Lecture 2. - On-line Analytical Processing (OLAP)

Lecture 3. - Association Rule Mining.

Lecture 4. - Relationship to Machine Learning, Classification and Evaluation of Classifiers

Lecture 5. - Relation to Statistics, illustration with linear discriminants.

Lecture 6. - Clustering and illustration with spatial and categorical data

Lecture 7. - Representation Based Clustering

 

Lecture 8. - Support Vector Machines

Lecture 9. - Data Mining and Privacy

Lecture 10 - Competitive Learning and Kohonen Networks

Lecture 11 - Spatial Data Mining

Lecture 12. - Web usage Mining

 

 

Lectures 1 to 8 cover fundamental concepts and techniques in the current body of knowledge for this field. Lectures 9 to 12 highlight current research topics and may be adjusted according to the interest from participants.

 

Generic Skills Development

 

Emphasis will be placed in generic research skills. This course will teach written communication of summaries, executive reports and literature reviews of research articles in Knowledge Discovery and Data Mining. The students will practices these writing skills. This course will also teach analysis and critical evaluation. There will be guided practice in analysing the contribution and assessing the merit of research papers. Another aspect that will be emphasized is the analysis and critical evaluation of different paradigms in Machine learning or in the techniques of Knowledge Discovery and Data Mining. Issues where students will be asked to perform such practice are a consideration of traditional statistics vs. data mining.

 

Problem solving and decision-making will be further developed by practical problems in Data Mining where machine-learning techniques must be selected for their solution. Skills leading to professional effectiveness will be fostered by the debate of issues like Privacy in Data Mining and some links to the ethics of Data Mining Research.

 

Flexible Learning

 

This subject is Mode A - Web Supplemented. A complementary WEB-site will make available lecture notes, WEB resources and reading lists with materials to complement lectures. Thus, participation on-line is optional for the student.  Enrolled students will access information additional to that available in the University's calendar or handbook. The information includes the course descriptions and study guides, examination information, assessment overview and reading. The information is used to supplement traditional forms of delivery.

 

There is flexibility in choosing 5 out of 11 packages of readings and students can propose their own package. Students can propose the topic for the application of their programming assignment and the programming language to implement the algorithm.

 

Rationale for Content

 

The course will develop from an initial introduction to Knowledge Discovery and Data Mining. The student should be able after the first lecture to understand the spirit and motivation for the filed. This multidisciplinary filed has adopted and developed its own methods and techniques. These constitute core material for appreciating tools, algorithms and methods and to be in a position to apply them to practical settings. The course will present and illustrate these core techniques and their fundamental algorithms. The impact of the techniques in application or contemporary issues is explored in the later section of the subject.

Organisation and Teaching Methods

 

The course will have 2-hour weekly lectures. Material as described in the course content will be presented and discussed. Students will have to complete suggested weakly readings to complement the material from the lectures. A package of weekly readings consists of 2 to 3 research papers and a chapter in a textbook. Practical activities will be assessed as items of assessment. These will constitute of 1) the composition of summaries, 2) the comparison of learning methods and 3) the implementation and programming of sample of algorithms and techniques towards a potential application.

 

Rationale for Teaching Methods

 

Lectures will provide the material and subject matter towards objectives 1 to 5. The selected readings will contribute towards objective 6. The practical activities will be assessed and will contribute towards all objectives. The material of the first lecture and the first group of readings will address objective 1. Similarly, readings are grouped towards topics and in relation to objectives 1 to 4.

 

Assessment (confrim details in this semester course outline)

 

1.      5 executive summaries or research surveys (These are assignments that involve a package of reading and summarizing activities. Reading consists of reading 3-5 research articles and analysing and evaluating means preparing an executive summary or survey). They must be between 1500 amd 2000 words excluding references. Each is worth 4% each (a total of 20% for the subject). You can submit up to 8 summaries out of different packages of readings, the best 5 will be used to compute your grade. A list of readings and their packaging will be available in the WEB site. You must add a reading to each package yourself with publication date 2000 or later. You may choose to build a package of your own. In that case, all papers must be from proceedings of the ICDM or KDD Conferences after 2000 or from the journal Knowledge Discovery and Data Mining.

·        DUE DATE: Week 2, 4, 6, 8 and 10

2.      2. One programming assignment for 20%. You are required to program yourself an algorithm of your choice in the programming language of your choice. But is must clearly be an algorithm for a core techniques in Data Mining and Knowledge Discovery. You must provide a report of your implementation including testing that validates to some extent, its correctness.

·        DUE DATE: Week 12

3.      One research project for 40%. You are to propose a topic or problem that can be addressed by Data Mining techniques. You must obtain a dataset for your problem (perhaps data available on the WEB, or you may postulate an industrial partner who would supply the data). You must use at least two data mining techniques to attempt to solve the problem. You may use public domain software or demo version of commercial systems. You must produce a report analysing and critically valuating your experience.

·        DUE DATE: Week 7

4.       Exams. There will be 1 Final exam worth 20%. (sample final)

·         EXAM Period

 

 

Rationale for Assessment

Packages of readings are directly linked to specific objectives. For instance, the first package contains articles that debate the nature and role of Data Mining Technology. Reading this particular package will expand on your understanding of why Data Mining technology has emerged and what can Data Mining do. The analysis and critical evaluation skills put in practice by performing an executive summary of the readings in the package will confirm such understanding.

Readings and executive summaries will also advance your core research skills; in particular, comprehension of recent research publications, summarization and analysis of literature in a very specific topic.

 

The practical implementation of an algorithm for a core task of data mining will confirm the understanding from lectures and readings. Arguing about its correctness will ensure that the inner workings and the subtle aspects of the data structures are fully understood.

 

The comparison of two data mining techniques and their application to a particular problem will reinforce objective 6.

 

 

Texts and Supporting Materials

There is no prescribed textbook. However, the following constitute excellent references for expanding on the material in Lectures or in the readings. Readings will be made available through the course WEB site. 

 

Title

Data mining: concepts and techniques / Jiawei Han and Micheline Kamber.

Author

Han, Jiawei.

Publication

San Francisco: Morgan Kaufmann Publishers, 2001.

Description

xxiv, 550 p. : ill. ; 24 cm.

Series

Morgan Kaufmann series in data management systems

QGU Nathan

QA76.9.D343 H36 2001

 

Title

Data mining : practical machine learning tools and techniques with Java implementations / Ian H. Witten, Eibe Frank.

Author

Witten, I. H. (Ian H.)

Publication

San Francisco, Calif. : Morgan Kaufmann, 2000.

QGU Logan

QA76.9.D343 W58 2000

 

 

Title

Data mining techniques : for marketing, sales, and customer support / Michael J.A. Berry, Gordon Linoff.

Author

Berry, Michael J. A.

Publication

New York ; Chichester, [England] : Wiley Computer Pub., c1997.

QGU Nathan

HF5415.125 .B47 1997

 

 

 

Scope of Course Evaluation

This course will be evaluated using a student questionnaire with some open questions.

 

Administration

Because of the close links between Machine Learning and Data Mining, and effort will be made to coordinate the two subjects such that they can provide some learning support to each other. In particular, lecture timetabling may be adjusted to facilitate attending both set of lectures, and some topics may be rearranged in their sequence.

 

Course Communications

The course convenor should be contacted by e-mail in first instance regarding any difficulties with the course. A weekly schedule of the course convenor activities is available at his personal WEB page and could be used to potentially arrange an appointment in case of an urgent matter.