Virtual Course on Sequence Data Analysis

Gilbert Ritschard and the TraMineR team, LIVES and IDEMO, University of Geneva

This Virtual course will be given in 12 weekly lectures of 2 hours each starting September 4, 2012. The lectures will start at 10:30 AM EST (3:30 GMT)

$280 student, $325 faculty, $380 practitioner (No more registrations will be taken after August 21st)

Content

The course focuses on methods for exploring and analyzing categorical longitudinal data describing life courses such as family trajectory or professional careers. The aim is: (i) to explain the whole process of sequence analysis from the preparation of longitudinal data and the exploration of sequences to the use of more advanced explanatory analyses, and (ii) to train participants to the practice of sequence analysis by means of the TraMineR package for the R graphical and statistical environment.

 Covered topics include -  for state sequences:

  • the visual rendering of sequence data,
  • transversal and longitudinal sequence descriptive statistics,
  • optimal matching and other ways of measuring the dissimilarity between sequences,
  • clustering individual sequences,
  • identifying representative trajectories,
  • discrepancy analysis and regression trees for sequence data;

 For event sequences:

  • rendering the sequencing,
  • mining typical subsequences and associations between those subsequences,
  • finding the subsequences that best discriminate between groups such as between women and men for instance.
  • measuring the dissimilarity between event sequences and dissimilarity-based analysis of event sequences.

The course is user oriented and includes an introduction to R to provide the basic knowledge required for using TraMineR. The scope of sequence analysis will be illustrated with real data from the Swiss Household Panel http://www.swisspanel.ch and other datasets that come with the TraMineR package. Participants are encouraged to train the methods with their own data.

About R and TraMineR   

R is free open source software available at http://www.r-project.org. TraMineR is distributed through the CRAN http://cran.r-project.org. See http://mephisto.unige.ch/traminer for details about TraMineR

About the Instructor

Dr. Gilbert Ritschard is a full professor of statistics at the Department of Economics of the University of Geneva, where he is responsible for the program of statistics and quantitative methods for the social sciences. He runs his researches within the Institute for Demographic and Life Course Studies and acts as vice-dean of the Faculty of Economics and Social Sciences since 2007. He graded in econometrics and got his Ph.D. in Econometrics and Statistics at the University of Geneva in 1979. He taught as invited professor in Toronto, Montreal, Lyon, Lausanne and Fribourg. He published on topics such as the mining of event histories, sequence analysis, atypical data detection, decision trees, as well as on more applied topics in demography, sociology and social science history. He headed or co-headed several funded applied researches on, for example, the mining of Swiss life course histories, Differential mortality, and Life courses in the 19th century Geneva. With his team he developed the world wide used TraMineR toolbox for exploring and analyzing sequence data in R. His present research interests are in categorical and numerical longitudinal data analysis and their application to life course analysis. He currently leads a methodological individual project within the Swiss National center of competence in research (NCCR) "LIVES: Overcoming vulnerability, life course perspectives."

Contact

Organization: Aron Lindberg
Course Content: Gilbert Ritschard


Course Prerequisites

Prior to the start of the course, please make sure you have both a Github account as well as a StackExchange account.

We will be using StackExchange with the TraMineR tag to facilitate public discussion for the course as well as Github to store the course files.

Finally, click the classroom link at the top and make sure you can log in as a guest. You will use your full name as the guest name.

We will be using R with the TraMineR package. If you can, please install R from here and the TraMineR package from the CRAN source. Finally, we will be referencing R-Studio so please have that as well

Tentative Course Outline (as of May 2012)

Day 1: Introduction - September 4

  • About longitudinal data analysis
  • What is sequence analysis (SA)?
    • How does SA compare with other longitudinal methods?
    • Chronological and non chronological sequences; states, events, transitions
  • What kind of questions may SA answer to? Sequencing, timing, and duration
  • Preview of what you will learn
  • TraMineR: an R package for sequence analysis
    • About TraMineR and other softwares for sequence analysis
    • A first run: creating a state sequence object and rendering the sequnces

Day 2: Starting with R and TraMineR - September 11

  • About the R statistical and graphical environment
  • A short introduction to R
  • TraMineR and other useful packages: installing a library and exploring its content and documentation
  • Importing data from other softwares and checking the content of data sets
  • Basic statistical analysis in R (tabulating data, linear and logistic regression, Anova, ...)

Day 3: Rendering and Describing State Sequences - September 18

  • The seqdef() function and its options
  • Transversal and individual longitudinal characteristics
  • Rendering sequences: three basic plots
  • Comparing groups and controlling the plots
  • Aggregated views of a set of sequences
    • Sequence of transversal indicators (modal state, entropy, ...)
    • Mean time spent in each state, transition rates.

Day 4: Longitudinal Characteristics of Individual Sequences - September 25

  • Basic attributes: sequence length, number of transitions, state duration
  • Composite characteristics: within entropy, complexity, turbulence
  • Studying the relationship between sequence characteristics and covariates

Day 5: Handling Sequence Data - October 2

  • Formal representations of sequences.
  • Retrieving spell and person-period data
  • Building state sequences from panel data

Day 6: Issues with sequential data - October 9

  • Missing data, time alignment, unequal sequence lengths
  • Weights
  • State definition, time granularity
  • What are the main limitations of sequence analysis?

Day 7: Measuring pairwise dissimilarities - October 16

  • Dissimilarity measures
    • Measures based on count of common attributes
    • Optimal matching and other edit distances
    • Other measures
  • Choosing a measure and defining costs
  • Missing data
  • Normalization
  • Multichannel dissimilarities

Day 8: Dissimilarity-based analysis of state sequences I - October 23

  • Cluster analysis of sequences
    • What is clustering and which method should we use?
    • Validation: How do we determine the number of groups?
    • Cluster interpretation and relationship with covariates
    • Scope and limits of cluster-based analysis
  • Representation on principal coordinates (multidimensional scaling MSD).

Day 9: Dissimilarity-based analysis of state sequences II - November 6

  • Measuring the discrepancy of a set of sequences
  • Neighborhood and coverage of a sequence
  • Extracting representative sequences
    • Centrality versus density criteria
    • Quality measures of representatives

Day 10: Dissimilarity-based analysis of state sequences III - November 13

  • ANOVA-like discrepancy analysis of sequences
  • Time evolution of a sequence-covariate associations
  • Regression trees of sequence data

Day 11: Mining event sequences I - November 20

  • Event sequences
    • Sequences of time stamped events: definition and representation.
    • Converting to and from state sequences
    • How does the analysis of event sequences compare with that of state sequences?
  • Rendering the sequencing
  • Seeking for frequent subsequences
    • Counting algorithm: Sequence mining versus itemset mining
    • Counting methods
    • Time constraints

Day 12: Mining event sequences II - November 27

  • Determining the most discriminating sub-sequences
  • Sequential association rules
  • Measuring pairwise dissimilarities among event sequences
  • Dissimilarity-based analysis of event sequences

Recommended readings

Abbott, A. and A. Tsay (2000). Sequence analysis and optimal matching methods in sociology, Review and prospect. Sociological Methods and Research 29(1), 3–33. (With discussion, pp 34–76).

Aisenbrey, S. and A. E. Fasang (2010). New life for old ideas : The “second wave” of sequence analysis bringing the “course” back into the life course. Sociological Methods and Research 38(3), 430–462.

Billari, F. C. (2001). Sequence analysis in demographic research. Canadian Studies in Population 28(2), 439–458. Special Issue on Longitudinal Methodology.

Billari, F. C. (2005). Life course analysis : Two (complementary) cultures? Some reflections with examples from the analysis of transition to adulthood. In R. Levy, P. Ghisletta, J.-M. Le Goff, D. Spini, and E. Widmer (Eds.), Towards an Interdisciplinary Perspective on the Life Course, Advances in Life Course Research, Vol. 10, pp. 267–288. Amsterdam : Elsevier.

Elzinga, C. H. (2010). Complexity of categorical time series. Sociological Methods & Research 38(3), 463–481.

Elzinga, C. H. and A. C. Liefbroer (2007). De-standardization of family-life trajectories of young adults : A cross-national comparison using sequence analysis. European Journal of Population 23, 225–250.

Gabadinho, A., G. Ritschard, N. S. Müller, and M. Studer (2011a). Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.

Gabadinho, A., G. Ritschard, M. Studer, and N. S. Müller (2011b). Extracting and rendering representative sequences. In A. Fred, J. L. G. Dietz, K. Liu, and J. Filipe (Eds.), Knowledge Discovery, Knowledge Engineering and Knowledge Management, Volume 128 of Communications in Computer and Information Science (CCIS), pp. 94–106. Springer-Verlag.

Maindonald, J. H. (2008). Using R for data analysis and graphics : Introduction, code and commentary. Manual, Centre for Mathematics and Its Applications, Austrialian National University.

Piccarreta, R. and F. C. Billari (2007). Clustering work and family trajectories by using a divisive algorithm. Journal of the Royal Statistical Society : Series A (Statistics in Society) 170(4), 1061–1078.

Piccarreta, R. and O. Lior (2010). Exploring sequences : a graphical tool based on multi-dimensional scaling. Journal of the Royal Statistical Society : Series A (Statistics in Society) 173(1), 165–184.

Pollock, G. (2007). Holistic trajectories : A study of combined employment, housing and family careers by using multiple-sequence analysis. Journal of the Royal Statistical Society A 170(1), 167–183.

Ritschard, G., A. Gabadinho, N. S. Müller, and M. Studer (2008). Mining event histories : A social science perspective. International Journal of Data Mining, Modelling and Management 1(1), 68–90.

Ritschard, G., A. Gabadinho, M. Studer, and N. S. Müller (2009). Converting between various sequence representations. In Z. Ras and A. Dardzinska (Eds.), Advances in Data Management, Volume 223 of Studies in Computational Intelligence, pp. 155–175. Berlin : Springer-Verlag.

Studer, M. (2012). Le manuel de la librairie WeightedCluster : un guide pratique pour la création de typologie de séquences avec R. In Étude des inégalités de genre en début de carrière académique à l’aide de méthodes innovatrices d’analyse de données séquentielles, PhD Thesis. Faculté des SES, Université de Genève.

Studer, M., N. S. Müller, G. Ritschard, and A. Gabadinho (2010). Classer, discriminer et visualiser des séquences d’événements. Revue des nouvelles technologies de l’information RNTI E–19, 37–48.

Studer, M., G. Ritschard, A. Gabadinho, and N. S. Müller (2011). Discrepancy analysis of state sequences. Sociological Methods and Research 40(3), 471–510.

 

 

Course Registration

No more registrations will be taken after August 21st