@COMMENT This file was generated by bib2html.pl <http://www.cs.cmu.edu/~pfr/misc_software/index.html#bib2html> version 0.91
@COMMENT written by Patrick Riley <http://www.cs.cmu.edu/~pfr>
@COMMENT This file came from Gal A. Kaminka's publication pages at
@COMMENT http://www.cs.biu.ac.il/~galk/Publications/
@PhdThesis{ariella-phd, 
author = {Ariella D. Richardson}, 
title = {Mining and Classification of Multivariate Sequential Data}, 
school = {{B}ar {I}lan {U}niversity}, 
year = {2011}, 
  wwwnote = {}, 
OPTannote = {} ,
  abstract = { Multivariate sequence mining and classiﬁcation are important and challenging tasks. 
They can be applied to numerous domains including medical diagnosis, handwriting 
deﬁciency diagnosis, identiﬁcation of users for security or personalized TV services, 
and even transportation and traﬃc planning. The problem we address in this dissertation is 
classiﬁcation of multivariate sequences. Multivariate sequences are sequences 
that have multiple attributes for each item in the sequence. Several attempts to address 
 this problem exist, but none provide a full solution. One type of solution to this 
problem is to reduce the solution to a single attribute or non sequential problem while 
loosing valuable information. Other solutions address both the multivariate and the 
sequential aspect of the input but provide an unscalable solution. 
    In this dissertation we ﬁrst present COACH (Cumulative Online Algorithm for 
Classiﬁcation of Handwriting deﬁciencies). COACH is a classiﬁcation algorithm 
for multivariate sequences that uses heuristics to combine several single attribute 
classiﬁcations. COACH is evaluated on real data obtained from children with poor 
handwriting using a digitizer tablet. Results show that COACH manages to successfully 
differentiate between poor to proﬁcient handwriting. Integrating several 
single attribute classiﬁcations encouraged us to search for a solution that uses all the 
attributes together in the classiﬁcation process. 
    The second part of the dissertation introduces frequent sequence mining. Frequent 
sequence mining, as well as being a challenging and interesting task, can be used for 
 classiﬁcation as we will show in the third part of the dissertation. Many algorithms 
have been proposed to efficiently address frequent sequence mining. Most of them 
use support based mining to achieve this task. However, support based mining has 
been shown to suffer from a bias towards mining short sequences. We will show how 
resolving this bias produces better sequences than traditional support based mining. 
    We present REEF, a frequent sequence mining algorithm that resolves this length 
bias. We deﬁne norm-frequency, based on the statistical z-score of support, and 
use it to replace support based frequency. Unfortunately the use of norm-frequency
hinders pruning. We address this issue and introduce a bound to perform pruning. 
Calculating the norm-frequency requires a preprocessing stage performed on a sample 
of the database. Values acquired from the sample suffer from a distortion. We analyze 
this distortion and correct it. 
    Experimental results on synthetic and real world data presented in this dissertation 
establish that REEF overcomes the short sequence bias successfully. Mining 
performed with REEF on textual data is used to demonstrate that the sequences  
mined with REEF are more meaningful than those mined with support based algorithms, 
indicating that REEF is better than traditional algorithms for producing 
interesting sequences. 
    Finally in the third part of the dissertation we use the new mining algorithm 
REEF to develop CUBS (Classiﬁcation Using Bounded Z-Score with Sampling) a 
classiﬁcation algorithm for multivariate sequences. CUBS uses the REEF mining to 
produce frequent subsequences, and then selects among them the statistically significant 
subsequences to compose a classiﬁcation model. We evaluate the accuracy of 
CUBS on a synthetic dataset and on two real world dataset. CUBS provides a scalable 
classiﬁcation algorithm for multivariate sequence classiﬁcation that makes use of 
both the multiple attributes and the sequential nature of the data. 
 },
}

