chapter 1

  1. Classification and regression methods are usually described as supervised learning methods. Explain why. And describe the difference between the two.
  1. What do you need in order to produce a classification model?
  1. How do you know that you have produced a good model?
  1. Why is a classification model also called a generalized model?
  1. Explain how a clustering task differs from a classification task.
  1. What is the core operation/measurement required in a clustering task?
  1. How do you know that you have produced a good clustering outcome?
  1. Explain how an anomaly detection task differs from a classification task and a clustering task.
  1. How do you know that you have produced a good anomaly detection outcome?
  1. How does association pattern mining differ from other data mining tasks?
  1. How does association pattern mining differ from rule-based classification?

Chapter 2

  1. What is the attribute type of each of the following kinds of attributes (a)Age, (b) Salary, (c) Zip Code, (d) Province of Residence, (e) Height, (f) Weight.
    1. Nomial定类变量:加减乘除没有意义,只有等不等于(distinctness)
    2. Ordinal定序变量:distinctness & order
    3. Interval定距变量:可以比大小,差值有意义(年龄)
    4. Ratio定比变量:有绝对零点,加减乘除有意义(高度)
  1. An analyst sets up a sensor network in order to measure the temperature of different locations over a period. What is the data type of the data collected?
  1. An analyst processes Web logs in order to create records with the ordering information for Web page accesses from different users. What is the type of this data?
  1. Describe the difference between interval and ratio attribute type.
  1. Suppose that you are employed as a data mining consultant for an Internet search engine company. Describe how data mining can help the company by giving specific examples of how techniques, such as clustering, classification, association rule mining, and anomaly detection can be applied.
  1. Distinguish between noise and outliers. Be sure to consider the following questions.
    (a) Is noise ever interesting or desirable? Outliers?
    (b) Can noise objects be outliers?
    (c) Are noise objects always outliers?
    (d) Are outliers always noise objects?
    (e) Can noise make a typical value into an unusual one, or vice versa?
  1. The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure would you use to compare or group these elephants? Justify your answer and explain any special circumstances.
  1. Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features.
  1. Suppose that you had a set of arbitrary objects of different types representing different characteristics of widgets. A domain expert gave you the similarity value between every pair of objects. How would you convert these objects into a multidimensional data set?
  1. Consider the time-series 1, 1, 3, 3, 3, 3, 1, 1. Perform wavelet decomposition of the time series. How many coefficients of the series are non-zero?
  1. Suppose that you had a data set, such that each data point corresponds to sea surface temperatures over a square mile of resolution 10 × 10. In other words, each data record contains a 10 ×10 grid of temperature values with spatial locations. You also have some text associated with each 10 × 10 grid. How would you convert this data into a multidimensional data set?
  1. Suppose that you had a set of discrete biological protein sequences, which are annotated with text describing the properties of the protein. How would you create a multidimensional representation from this heterogeneous data set?
  1. Apart from the assumption that the specified distance matrix is Euclidean, what is the assumption of using Multidimensional Scaling (MDS) in converting a graph to multidimensional representation?
    Hint: The violation of the assumption will not allow MDS to convert a graph.
  1. What is the assumption of spectral methods?
  1. Describe how you would perform clustering using spectral clustering if the dataset is in
    a. Multidimensional representation
    b. Time series

    c. Graphs
  1. MDS methods are designed for preserving global distances; and spectral methods are designed for preserving local distances. Explain the meanings of global distances and local distances in this context.

Chapter 3

  1. Explain the reason why existing distance measures and kernels are affected by certain types of data distributions, as explained in the textbook.
  1. Yet, many algorithms which employ these measures have no issues in these data distribution, e.g., k-nearest neighbor classifier, SVM and DBSCAN. Explain the reason why.
  1. A generic method for local distance computation is given as follows:
    a. Partition the data into a set of local regions (use a clustering method)
    b. For any pair of objects, determine the most relevant region for the pair, and compute the pairwise distances using the local statistics of that region. For example, the local Mahalanobis distance may be used in each local region.
    1. Discuss the issues with this method.
    2. Suggest one or more alternatives. Explain the reason why these alternatives are better.
  1. How would you choose an algorithm (or design an algorithm) for a data mining task at hand? The task could be any of the following: classification, clustering or anomaly detection. Discuss the issues you would consider in a sequential order.
  1. Derive the mathematical relationship between cosine similarity and Euclidean distance when each data object has an L2 length of 1.
  1. Explain the similarity measure you will use to measure flight paths of many migratory birds in a dataset in order to find the most similar flight paths. Name the data type of this dataset.
  1. Explain the criteria you would use to choose a distance/similarity measure and provide the reason(s) of these choices.
  1. What makes a complex data object complex?
  1. As the complexity of a data object increases, the choice of a distance/similarity measure increases. Explain the reason of this phenomenon.
  1. When do we need to measure similarity between (a) two nodes; and (b) two graphs?

Isolation Kernel

  1. Compare with the generic method for local distance computation, described on page 73 in Aggarwal, Chapter 3:
    a. Partition the data into a set of local regions (use a clustering method)
    b. For any pair of objects, determine the most relevant region for the pair, and compute the pairwise distances using the local statistics of that region. For example, the local Mahalanobis distance may be used in each local region.
    Describe its similarities and differences with Isolation Kernel.
  1. Discuss how Isolation Kernel deal with the issues of data distribution described on pages 69-73 in Aggarwal, Chapter 3.

Chapter 4

  1. Explain the similarities and differences between the main classification methods.
  1. Explain why the divide-and-conquer algorithm used to generate a decision rule is called a greedy algorithm.
  1. The set-covering algorithm used to generate decision rules is sometimes called the separate-and-conquer algorithm. How does this differ conceptually from the divide-and-conquer algorithm?
  1. As the k-nearest neighbor algorithm does nothing in training, does it produce a generalized model that enables it to classify well on unseen data points?
  1. Explain the distinct features of SVM with respect to other classification methods.
  1. Which of the four classification algorithms require a distance/similarity function?
  1. Describe the two key steps in k-means clustering.
  1. Describe the two key steps in DBSCAN.
  1. Describe the three key steps in Spectral clustering.
  1. Describe the operational principle of
    1. Distance-based anomaly detector.
    2. Isolation-based anomaly detector.
    3. IDK-based anomaly detector.
  1. Is an Isolation tree a variant of decision tree? Explain your answer.
  1. Explain another possible way to perform isolation which is different from using isolation trees. Hint: the principle is to isolate one point from the rest of the points in a sample.
  1. Describe the important concepts in enumeration tree algorithms.
  1. The aim of an enumeration tree algorithm is to have a systematic exploration of the candidate patterns in a non-repetitive way.
    1. Why is “non-repetitive” an important criterion?
    2. Why is “systematic” an important criterion?
  1. Given a multidimensional dataset, you are consulted as a data mining consultant to explore ways to extract interesting information from this dataset. Explain which algorithm(s) you will use and why.
  1. Can we use the data mining algorithms we have studied for a spatial (or graph) dataset? Explain your answer.

Chapter 5

  1. What is the insight and its relation with IDK?
    alt text
  2. What are the assumptions of the insight?
  3. What are the advantages of using a distributional treatment for time series?
    A: Don't need consider the alignment issue.
  4. Can you provide any disadvantages of using a distributional treatment for time series? Hint: relate to the assumption(s) of the insight.
  5. Describe the difference between the time domain approach and the R domain approach.
    A: point-to-point distance, slide window
    A: i.i.d assumption
  6. What are the similarity and difference between group anomaly detection and anomalous subsequence detection?
  1. What else can be done with the proposed distributional treatment in time series?
  2. Discuss any other opportunities in using the distributional treatment. What is the key in any of these opportunities?

Chapter 6

  1. Describe the key difference between traditional measures of trajectories and the distributional measures.
  1. Explain how important is the uniqueness property of a measure for trajectories.
  1. Describe how you would use a distributional kernel to perform clustering of a dataset of trajectories. Explain in what ways the suggested method are similar to and different from 𝒦AT (in detecting anomalous trajectories from a dataset of trajectories).
  1. Can any distributional measures (e.g., Wasserstein distance) be used in the same way as described in (3)? If not, describe how you would use such a distributional measure to perform clustering of a dataset of trajectories.
  2. Similarity of Trajectory & Sub-trajectary anomaly detection: (aggregate: 总计)
    alt text

Chapter 7

  1. Consider two graphs, which are cliques containing an even number 2·n nodes. Let exactly half the nodes in each graph belong to labels A and B. What are the total number of isomorphic matchings between the two graphs?
  2. Consider two graphs, containing 2 · n nodes and n distinct labels, each occurring twice. What is the maximum number of isomorphic matchings between the two graphs?
  3. Compute the Morgan Indices of order 1 and 2, for each node of the Acteominophen graph. How does the Morgan Index vary with the label (corresponding to chemical element)?
  4. In general, a kernel-based method is preferred over distance-based method (e.g., in the context of classification). Explain why this is so.
  1. Describe how a single large neighborhood graph can be constructed from a dataset of graphs in order to use spectral clustering. (Hint: the first issue you need to resolve is: how to represent a graph.)
  1. Describe the difference between labeled (non-attributed) graphs and attributed graphs.
  2. Graph kernels can be categorized into two kinds of feature mapping: implicit and explicit. Which category does each of Random-walk kernel, shortest path kernel and WL kernel belong? Explain your answer in relation to K(Gi,Gj)=Φ(Gi)Φ(Gj)K(G_i,G_j) = \Phi(G_i) · \Phi(G_j)
  1. Consider the three graph kernels and the generic transformational (GT) approach we have studied. Can these graph kernels be used with GT? Explain the reason why.
  2. Why k-medoids are used instead of k-means in the graph domain?
  1. Would kernel k-means be a valid method when a graph kernel is used to measure similarity between two graphs?
  1. Given what you have learned about graph kernels, explain how you will use them to perform data mining (i.e., anomaly detection, clustering or clustering) in each of the following problems:
    1. A dataset of small graphs
    2. A large network
      Explain the necessary ingredients in order to perform a data mining task, e.g., which graph kernel (or any graph kernel), algorithm to use, etc.
  2. The graph kernels we have studied do not take the distribution of node vectors in an attributed network into consideration. Discuss how the distribution may be incorporated in the WL kernel.

Chapter 8

alt text

  1. Explain what concept drift is. How does it relate to iid (independent and identically distributed)?
  1. Describe the difference between data stream and timeseries. How does concept drift in a data stream relate to timeseries?
  1. Out of the four constraints in data streams, explain which constraints the synopsis approach addresses.
  1. Explain how one can use microclusters to perform clustering.
  1. Specify whether k-means, k-medoids or DBSCAN can use microclusters to perform clustering.
  1. To deal with concept drift, a fixed length sliding window is often used. Discuss the merits and limitation of this approach.
  1. Discuss how you may use a sketch for anomaly detection in data streams.
  2. Discuss the disadvantage and advantage of using a sketch for anomaly detection.
  1. Explain the nature of the SENC (Classification under streaming emerging new classes) problem: is it a classification problem only? Explain your answer.
  1. Clustering is sometimes suggested as a method for outlier detection. Discuss the advantage and disadvantage of this method in comparison to outlier detection methods such as LOF (Local Outlier Factor) and Isolation Forest.
  1. The problem of streaming classification is especially challenging because of the impact of concept drift. One simple approach is to use a reservoir sample to create a concise representation of the training data. This concise representation can be used to create an offline model. If desired, a decay-based reservoir sample can be used to handle concept drift. Such an approach has the advantage that any conventional classification algorithm can be used since the challenges associated with the streaming paradigm have already been addressed at the sampling stage. Discuss the issues in using this approach.

Chapter 9

  1. Explain what a 2-way minimum cut problem is.
  1. Why is the balancing constraint more important in community detection algorithms, as compared to multidimensional clustering algorithms? What would the unconstrained minimum 2-way cut look like in a typical real network?
  1. Explain why the Kernighan-Lin algorithm is a local method. Does it produce balanced clusters?
  2. Explain why the Spectral Clustering is a global method. Does it produce balanced clusters?
  3. The last step of the Spectral Clustering procedure often employs k-means clustering. Explain why this is the clustering algorithm of choice (and not other clustering algorithms).
  4. Must we use k-means as the final step in the Spectral Clustering procedure? Explain the impact (if any) of using a different clustering algorithm.
  5. Describe the differences in using graph kernels to deal with a large graph versus a database of graphs.
  6. Describe the differences in using spectral clustering to deal with a large graph versus a database of graphs.
  7. One of the drawback of Kernighan-Lin is: random initialization. Does spectral clustering has the same drawback?
  8. The term “spectral clustering” is a misnomer because the “spectral” component performs feature transformation only and a separate clustering algorithm must be used to perform the actual clustering. Discuss this statement (whether it is true or false, and explain your reasoning)
  9. Can we use graph kernels for collective classification? How can we do it?
  10. How does the graph kernel approach differ from the spectral embedding based approach for collective classification?

Chapter 10

  1. There are two types of web data. Name them and describe which type of data is used for (a) ranking systems, and (b) recommender systems.
  1. “Content information biases the recommendation towards items described by similar keywords to what the user has seen in the past. Collaborative filtering methods work directly with the utility matrix, and can therefore avoid such bias.”
    Discuss why the utility matrix can avoid such bias.
  1. Describe how you would use PageRank to measure the similarity between two nodes in a graph.
  1. In the bipartite graph on slide#24, what is the SimRank value between a user node and an item node? In this light, explain the weakness of the SimRank model. How would you measure node similarity in such cases?
  1. What is the advantage of using clustering methods in recommender systems compared with methods such as collaborative filtering and content-based recommendation?