Computational & Technology Resources
an online resource for computational,
engineering & technology publications |
|
Civil-Comp Proceedings
ISSN 1759-3433 CCP: 82
PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON THE APPLICATION OF ARTIFICIAL INTELLIGENCE TO CIVIL, STRUCTURAL AND ENVIRONMENTAL ENGINEERING Edited by: B.H.V. Topping
Paper 17
Data Mining Techniques for Analysing Geotechnical Data I.E.G. Davey-Wilson
Department of Computing, School of Technology, Oxford Brookes University, Oxford, United Kingdom I.E.G. Davey-Wilson, "Data Mining Techniques for Analysing Geotechnical Data", in B.H.V. Topping, (Editor), "Proceedings of the Eighth International Conference on the Application of Artificial Intelligence to Civil, Structural and Environmental Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 17, 2005. doi:10.4203/ccp.82.17
Keywords: data mining, missing data, neural network, decision trees, nearest neighbour, Bayes naïve, algorithm analysis, geotechnical data.
Summary
As geotechnical materials are naturally occurring, a database of parameters
derived from geotechnical test results will be both noisy and contain missing
parameter values. Relationships exist between some of the parameters in the data
which can be generated from empirical mathematical correlations [1] or recognised
by statistical methods. These relationships can then be exploited to make limited
predictions of other parameters [2] from a few known values. However, as the
databases would be multidimensional and at least one of the parameters would be
categorical rather than numerical (i.e. class), some hidden relationships within the
data can only be retrieved using techniques like data mining.
Data mining employs algorithms that are a mixture of statistics, logic, mathematics and artificial intelligence. There are a large number of algorithms (described in [3]) that seek relationships within datasets from which rules of some kind can be derived and subsequently used for prediction, classification or other functions, but selecting the most effective algorithm is not an intuitive process. The algorithms fall into of a number of groups of methods where four of the most widely used are neural networks, decision trees, nearest neighbour and Baysian logic. Many of the algorithms have been refined and augmented to show improvements over the original algorithms e.g. [4] and [5] but the improvements are often marginal. This work had centred on experiments with algorithms from these four types of methods and in particular algorithms were chosen that were amongst the simplest examples of these groups of methods, namely multilayer perceptron, J48, Ibk and Bayes naïve respectively. This work concentrates on analysing algorithm performance with respect to varying degrees of missing data and consequently derives methodologies for selecting the most appropriate data mining technique for data set with large amounts of missing data. The particular nature of geotechnical data dictates that there are numerous gaps in any set of test data for a variety of reasons. To simulate real data performance a series of synthetic geotechnical data sets was created (geoSynth1 to 4). These were created initially from complete data set, i.e. with zero percent missing. Experiments were carried out on the data with increasing percentages of data missing up to 50%. The distribution of missing records can be totally random throughout the dataset or can display a bias showing more missing for particular attributes or for particular records. This variability was mirrored in the geoSynth sets where some attributes could be weighted to show more missing data than others. Within the original data set a test set is put aside and the rules derived from the remaining training set are used to predict values in the test set. The effectiveness of an algorithm is measured by its ability to establish rules on the training data set and apply these to correctly predict values on the test set. Experiments with each data mining algorithm for many combinations of missing records produced results that demonstrated the competence of the algorithms when the percentage of missing data was increased from zero to 50%. The results indicated that the algorithms were differently effective at various degrees of missing data. For example the neural network algorithm generally showed the best result at zero percent missing but also showed the highest rate of decline in performance as the percentage increased. Conversely, the Bayes naïve algorithm was only averagely effective at zero percent missing but showed very little decline in effectiveness with increasing percentage missing and was generally the best performer at 50% missing. A subsidiary investigation found that missing data could be categorised in relation to proportions of attribute and record weighting and that the nature of the missing data has some effect on the performance of the algorithm. References
purchase the full-text of this paper (price £20)
go to the previous paper |
|