Gal A. Kaminka: Publications

• Sorted by Date • Classified by Publication Type • Classified by Topic • Grouped by Student (current) • Grouped by Former Students •

Mining and Classification of Multivariate Sequential Data

Ariella D. Richardson. Mining and Classification of Multivariate Sequential Data. Ph.D. Thesis, Bar Ilan University, 2011.

Download

[PDF]1.4MB

Abstract

Multivariate sequence mining and classiﬁcation are important and challenging tasks. They can be applied to numerous domains including medical diagnosis, handwriting deﬁciency diagnosis, identiﬁcation of users for security or personalized TV services, and even transportation and traﬃc planning. The problem we address in this dissertation is classiﬁcation of multivariate sequences. Multivariate sequences are sequences that have multiple attributes for each item in the sequence. Several attempts to address this problem exist, but none provide a full solution. One type of solution to this problem is to reduce the solution to a single attribute or non sequential problem while loosing valuable information. Other solutions address both the multivariate and the sequential aspect of the input but provide an unscalable solution. In this dissertation we ﬁrst present COACH (Cumulative Online Algorithm for Classiﬁcation of Handwriting deﬁciencies). COACH is a classiﬁcation algorithm for multivariate sequences that uses heuristics to combine several single attribute classiﬁcations. COACH is evaluated on real data obtained from children with poor handwriting using a digitizer tablet. Results show that COACH manages to successfully differentiate between poor to proﬁcient handwriting. Integrating several single attribute classiﬁcations encouraged us to search for a solution that uses all the attributes together in the classiﬁcation process. The second part of the dissertation introduces frequent sequence mining. Frequent sequence mining, as well as being a challenging and interesting task, can be used for classiﬁcation as we will show in the third part of the dissertation. Many algorithms have been proposed to efficiently address frequent sequence mining. Most of them use support based mining to achieve this task. However, support based mining has been shown to suffer from a bias towards mining short sequences. We will show how resolving this bias produces better sequences than traditional support based mining. We present REEF, a frequent sequence mining algorithm that resolves this length bias. We deﬁne norm-frequency, based on the statistical z-score of support, and use it to replace support based frequency. Unfortunately the use of norm-frequencyhinders pruning. We address this issue and introduce a bound to perform pruning. Calculating the norm-frequency requires a preprocessing stage performed on a sample of the database. Values acquired from the sample suffer from a distortion. We analyze this distortion and correct it. Experimental results on synthetic and real world data presented in this dissertation establish that REEF overcomes the short sequence bias successfully. Mining performed with REEF on textual data is used to demonstrate that the sequences mined with REEF are more meaningful than those mined with support based algorithms, indicating that REEF is better than traditional algorithms for producing interesting sequences. Finally in the third part of the dissertation we use the new mining algorithm REEF to develop CUBS (Classiﬁcation Using Bounded Z-Score with Sampling) a classiﬁcation algorithm for multivariate sequences. CUBS uses the REEF mining to produce frequent subsequences, and then selects among them the statistically significant subsequences to compose a classiﬁcation model. We evaluate the accuracy of CUBS on a synthetic dataset and on two real world dataset. CUBS provides a scalable classiﬁcation algorithm for multivariate sequence classiﬁcation that makes use of both the multiple attributes and the sequential nature of the data.

Additional Information

BibTeX

@PhdThesis{ariella-phd,
author = {Ariella D. Richardson},
title = {Mining and Classification of Multivariate Sequential Data},
school = {{B}ar {I}lan {U}niversity},
year = {2011},
wwwnote = {},
OPTannote = {} ,
abstract = { Multivariate sequence mining and classiï¬cation are important and challenging tasks.
They can be applied to numerous domains including medical diagnosis, handwriting
deï¬ciency diagnosis, identiï¬cation of users for security or personalized TV services,
and even transportation and traï¬ƒc planning. The problem we address in this dissertation is
classiï¬cation of multivariate sequences. Multivariate sequences are sequences
that have multiple attributes for each item in the sequence. Several attempts to address
this problem exist, but none provide a full solution. One type of solution to this
problem is to reduce the solution to a single attribute or non sequential problem while
loosing valuable information. Other solutions address both the multivariate and the
sequential aspect of the input but provide an unscalable solution.
In this dissertation we ï¬rst present COACH (Cumulative Online Algorithm for
Classiï¬cation of Handwriting deï¬ciencies). COACH is a classiï¬cation algorithm
for multivariate sequences that uses heuristics to combine several single attribute
classiï¬cations. COACH is evaluated on real data obtained from children with poor
handwriting using a digitizer tablet. Results show that COACH manages to successfully
differentiate between poor to proï¬cient handwriting. Integrating several
single attribute classiï¬cations encouraged us to search for a solution that uses all the
attributes together in the classiï¬cation process.
The second part of the dissertation introduces frequent sequence mining. Frequent
sequence mining, as well as being a challenging and interesting task, can be used for
classiï¬cation as we will show in the third part of the dissertation. Many algorithms
have been proposed to efficiently address frequent sequence mining. Most of them
use support based mining to achieve this task. However, support based mining has
been shown to suffer from a bias towards mining short sequences. We will show how
resolving this bias produces better sequences than traditional support based mining.
We present REEF, a frequent sequence mining algorithm that resolves this length
bias. We deï¬ne norm-frequency, based on the statistical z-score of support, and
use it to replace support based frequency. Unfortunately the use of norm-frequency
hinders pruning. We address this issue and introduce a bound to perform pruning.
Calculating the norm-frequency requires a preprocessing stage performed on a sample
of the database. Values acquired from the sample suffer from a distortion. We analyze
this distortion and correct it.
Experimental results on synthetic and real world data presented in this dissertation
establish that REEF overcomes the short sequence bias successfully. Mining
performed with REEF on textual data is used to demonstrate that the sequences
mined with REEF are more meaningful than those mined with support based algorithms,
indicating that REEF is better than traditional algorithms for producing
interesting sequences.
Finally in the third part of the dissertation we use the new mining algorithm
REEF to develop CUBS (Classiï¬cation Using Bounded Z-Score with Sampling) a
classiï¬cation algorithm for multivariate sequences. CUBS uses the REEF mining to
produce frequent subsequences, and then selects among them the statistically significant
subsequences to compose a classiï¬cation model. We evaluate the accuracy of
CUBS on a synthetic dataset and on two real world dataset. CUBS provides a scalable
classiï¬cation algorithm for multivariate sequence classiï¬cation that makes use of
both the multiple attributes and the sequential nature of the data.
},
}

Generated by bib2html.pl (written by Patrick Riley ) on Mon Feb 03, 2025 16:33:37