Label data using semisupervised graphbased method
fitsemigraph
creates a semisupervised graphbased model
given labeled data, labels, and unlabeled data. The returned model contains the fitted labels
for the unlabeled data and the corresponding scores. This model can also predict labels for
unseen data using the predict
object function. For more information on
the different labeling algorithms, see Algorithms.
uses the labeled data in Mdl
= fitsemigraph(Tbl
,ResponseVarName
,UnlabeledTbl
)Tbl
, where
Tbl.ResponseVarName
contains the labels for the labeled data, and
returns fitted labels for the unlabeled data in UnlabeledTbl
. The
function stores the fitted labels and the corresponding scores in the
FittedLabels
and LabelScores
properties of the
object Mdl
, respectively.
uses Mdl
= fitsemigraph(Tbl
,formula
,UnlabeledTbl
)formula
to specify the response variable (vector of labels) and
the predictor variables to use among the variables in Tbl
. The
function uses these variables to label the data in
UnlabeledTbl
.
uses the predictor data in Mdl
= fitsemigraph(Tbl
,Y
,UnlabeledTbl
)Tbl
and the labels in
Y
to label the data in UnlabeledTbl
.
uses the predictor data in Mdl
= fitsemigraph(X
,Y
,UnlabeledX
)X
and the labels in Y
to label the data in UnlabeledX
.
specifies options using one or more namevalue pair arguments in addition to any of the
input argument combinations in previous syntaxes. For example, you can specify the
labeling method, number of iterations, and score threshold to use in the labeling
algorithm.Mdl
= fitsemigraph(___,Name,Value
)
Fit labels to unlabeled data by using a semisupervised graphbased method.
Randomly generate 60 observations of labeled data, with 20 observations in each of three classes.
rng('default') % For reproducibility labeledX = [randn(20,2)*0.25 + ones(20,2); randn(20,2)*0.25  ones(20,2); randn(20,2)*0.5]; Y = [ones(20,1); ones(20,1)*2; ones(20,1)*3];
Visualize the labeled data by using a scatter plot. Observations in the same class have the same color. Notice that the data is split into three clusters with very little overlap.
scatter(labeledX(:,1),labeledX(:,2),[],Y,'filled') title('Labeled Data')
Randomly generate 300 additional observations of unlabeled data, with 100 observations per class. For the purposes of validation, keep track of the true labels for the unlabeled data.
unlabeledX = [randn(100,2)*0.25 + ones(100,2); randn(100,2)*0.25  ones(100,2); randn(100,2)*0.5]; trueLabels = [ones(100,1); ones(100,1)*2; ones(100,1)*3];
Fit labels to the unlabeled data by using a semisupervised graphbased method. The function fitsemigraph
returns a SemiSupervisedGraphModel
object whose FittedLabels
property contains the fitted labels for the unlabeled data and whose LabelScores
property contains the associated label scores.
Mdl = fitsemigraph(labeledX,Y,unlabeledX)
Mdl = SemiSupervisedGraphModel with properties: FittedLabels: [300x1 double] LabelScores: [300x3 double] ClassNames: [1 2 3] ResponseName: 'Y' CategoricalPredictors: [] Method: 'labelpropagation' Properties, Methods
Visualize the fitted label results by using a scatter plot. Use the fitted labels to set the color of the observations, and use the maximum label scores to set the transparency of the observations. Observations with less transparency are labeled with greater confidence. Notice that observations that lie closer to the cluster boundaries are labeled with more uncertainty.
maxLabelScores = max(Mdl.LabelScores,[],2); rescaledScores = rescale(maxLabelScores,0.05,0.95); scatter(unlabeledX(:,1),unlabeledX(:,2),[],Mdl.FittedLabels,'filled', ... 'MarkerFaceAlpha','flat','AlphaData',rescaledScores); title('Fitted Labels for Unlabeled Data')
Determine the accuracy of the labeling by using the true labels for the unlabeled data.
numWrongLabels = sum(trueLabels ~= Mdl.FittedLabels)
numWrongLabels = 10
Only 10 of the 300 observations in unlabeledX
are mislabeled.
Fit labels to unlabeled data by using a semisupervised graphbased method. Specify the type of nearest neighbor graph.
Load the patients
data set. Create a table from the variables Distolic
, Gender
, and so on. For each observation, or row in the table, treat the Smoker
value as the label for that observation.
load patients
Tbl = table(Diastolic,Gender,Height,Systolic,Weight,Smoker);
Suppose only 20% of the observations are labeled. To recreate this scenario, randomly sample 20 labeled observations and store them in the table unlabeledTbl
. Remove the label from the rest of the observations and store them in the table unlabeledTbl
. To verify the accuracy of the label fitting at the end of the example, retain the true labels for the unlabeled data in the variable trueLabels
.
rng('default') % For reproducibility of the sampling [labeledTbl,Idx] = datasample(Tbl,20,'Replace',false); unlabeledTbl = Tbl; unlabeledTbl(Idx,:) = []; trueLabels = unlabeledTbl.Smoker; unlabeledTbl.Smoker = [];
Fit labels to the unlabeled data by using a semisupervised graphbased method. Use a mutual type of nearest neighbor graph, where two points are connected when they are nearest neighbors of each other. Specify to standardize the numeric predictors. The function fitsemigraph
returns an object whose FittedLabels
property contains the fitted labels for the unlabeled data.
Mdl = fitsemigraph(labeledTbl,'Smoker',unlabeledTbl,'KNNGraphType','mutual', ... 'Standardize',true); fittedLabels = Mdl.FittedLabels;
Identify the observations that are incorrectly labeled by comparing the stored true labels for the unlabeled data to the fitted labels returned by the semisupervised graphbased method.
wrongIdx = (trueLabels ~= fittedLabels); wrongTbl = unlabeledTbl(wrongIdx,:);
Visualize the fitted label results for the unlabeled data. Mislabeled observations are circled in the plot.
gscatter(unlabeledTbl.Diastolic,unlabeledTbl.Systolic, ... fittedLabels) hold on plot(wrongTbl.Diastolic,wrongTbl.Systolic, ... 'ko','MarkerSize',8) xlabel('Diastolic') ylabel('Systolic') legend('Nonsmoker','Smoker','Mislabeled') title('Fitted Labels for Unlabeled Data')
Tbl
— Labeled sample dataLabeled sample data, specified as a table. Each row of Tbl
corresponds to one observation, and each column corresponds to one predictor.
Optionally, Tbl
can contain one additional column for the response
variable (vector of labels). Multicolumn variables, cell arrays other than cell arrays
of character vectors, and ordinal categorical variables are not supported.
If Tbl
contains the response variable, and you want to use all
remaining variables in Tbl
as predictors, then specify the response
variable using ResponseVarName
.
If Tbl
contains the response variable, and you want to use only
a subset of the remaining variables in Tbl
as predictors, specify a
formula using formula
.
If Tbl
does not contain the response variable, specify a
response variable using Y
. The length of the response variable and
the number of rows in Tbl
must be equal.
Data Types: table
UnlabeledTbl
— Unlabeled sample dataUnlabeled sample data, specified as a table. Each row of
UnlabeledTbl
corresponds to one observation, and each column
corresponds to one predictor. UnlabeledTbl
must contain the same
predictors as those contained in Tbl
.
Data Types: table
ResponseVarName
— Response variable nameTbl
Response variable name, specified as the name of a variable in
Tbl
. The response variable contains the class labels for the
sample data in Tbl
.
You must specify ResponseVarName
as a character vector or string
scalar. For example, if the response variable Y
is stored as
Tbl.Y
, then specify it as 'Y'
. Otherwise, the
software treats all columns of Tbl
, including Y
,
as predictors.
The response variable must be a categorical, character, or string array, a logical
or numeric vector, or a cell array of character vectors. If Y
is a
character array, then each element of the response variable must correspond to one row
of the array.
A good practice is to specify the order of the classes by using the
ClassNames
namevalue pair argument.
Data Types: char
 string
formula
— Explanatory model of response variable and subset of predictor variablesExplanatory model of the response variable and a subset of the predictor variables,
specified as a character vector or string scalar in the form
'Y~X1+X2+X3'
. In this form, Y
represents the
response variable, and X1
, X2
, and
X3
represent the predictor variables.
To specify a subset of variables in Tbl
as predictors, use a
formula. If you specify a formula, then the software does not use any variables in
Tbl
that do not appear in formula
.
The variable names in the formula must be both variable names in Tbl
(Tbl.Properties.VariableNames
) and valid MATLAB^{®} identifiers. You can verify the variable names in Tbl
by
using the isvarname
function. If the variable names
are not valid, then you can convert them by using the matlab.lang.makeValidName
function.
Data Types: char
 string
Y
— Class labelsClass labels, specified as a numeric, categorical, or logical vector, a character or string array, or a cell array of character vectors.
If Y
is a character array, then each element of the class
labels must correspond to one row of the array.
The length of Y
must be equal to the number of rows in
Tbl
or X
.
A good practice is to specify the class order by using the
ClassNames
namevalue pair argument.
Data Types: single
 double
 categorical
 logical
 char
 string
 cell
X
— Labeled predictor dataLabeled predictor data, specified as a numeric matrix.
Each row of X
corresponds to one observation, and each column
corresponds to one predictor.
The length of Y
and the number of rows in
X
must be equal.
To specify the names of the predictors in the order of their appearance in
X
, use the PredictorNames
namevalue pair
argument.
Data Types: single
 double
UnlabeledX
— Unlabeled predictor dataUnlabeled predictor data, specified as a numeric matrix. Each row of
UnlabeledX
corresponds to one observation, and each column
corresponds to one predictor. UnlabeledX
must have the same
predictors as X
, in the same order.
Data Types: single
 double
Note
The software treats NaN
, empty character vector
(''
), empty string (""
),
<missing>
, and <undefined>
elements as
missing data. The software removes rows of the predictor data (observations) with missing
values.
Specify optional
commaseparated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
fitsemigraph(Tbl,'Y',UnlabeledTbl,'Method','labelspreading','IterationLimit',2e3)
specifies to use a label spreading labeling technique and run a maximum of 2000
iterations.Method
— Labeling technique'labelpropagation'
(default)  'labelpropagationexact'
 'labelspreading'
 'labelspreadingexact'
Labeling technique, specified as the commaseparated pair consisting of
'Method'
and one of these values.
Value  Description  MethodSpecific NameValue Pair Arguments 

'labelpropagation'  Iteratively propagate labels across the nodes in the similarity graph. For more information, see Label Propagation. 

'labelpropagationexact'  Use an exact formula to propagate labels. For more information, see Label Propagation.  None 
'labelspreading'  Iteratively spread labels across the nodes in the similarity graph. For more information, see Label Spreading. 

'labelspreadingexact'  Use an exact formula to spread labels. For more information, see Label Spreading. 

Example: 'Method','labelspreading'
Data Types: char
 string
Alpha
— Relative weight of neighboring labels to initial label0.01
(default)  scalar value in the range (0,1)Relative weight of neighboring labels to the initial label for labeled
observations in X
or Tbl
, specified as the
commaseparated pair consisting of 'Alpha'
and a scalar value in
the range (0,1). A value close to 0 indicates that fitsemigraph
treats labels of initially labeled observations almost like ground truth. A value
close to 1 indicates that fitsemigraph
treats labels of initially
labeled observations almost like noise.
Note
This argument is valid only when the Method
value is
'labelspreading'
or
'labelspreadingexact'
.
Example: 'Alpha',0.05
Data Types: single
 double
IterationLimit
— Maximum number of iterations1e3
(default)  positive integer scalarMaximum number of iterations, specified as the commaseparated pair consisting of
'IterationLimit'
and a positive integer scalar. The
fitsemigraph
function returns Mdl
, which
contains the fitted labels and scores, when this limit is reached, even if the
algorithm does not converge.
Note
This argument is valid only when the Method
value is
'labelpropagation'
or
'labelspreading'
.
Example: 'IterationLimit',2e3
Data Types: single
 double
Tolerance
— Tolerance for score absolute difference in subsequent iterations1e3
(default)  nonnegative scalarTolerance for score absolute difference in subsequent iterations, specified as the
commaseparated pair consisting of 'Tolerance'
and a nonnegative
scalar.
Note
This argument is valid only when the Method
value is
'labelpropagation'
or
'labelspreading'
.
Example: 'Tolerance',1e4
Data Types: single
 double
CategoricalPredictors
— Categorical predictors list'all'
Categorical predictors list, specified as one of the values in this table.
Value  Description 

Vector of positive integers 
Each entry in the vector is an index value indicating that the corresponding predictor is
categorical. The index values are between 1 and If 
Logical vector 
A 
Character matrix  Each row of the matrix is the name of a predictor variable. The names must match the entries in PredictorNames . Pad the names with extra blanks so each row of the character matrix has the same length. 
String array or cell array of character vectors  Each element in the array is the name of a predictor variable. The names must match the entries in PredictorNames . 
"all"  All predictors are categorical. 
By default, if the predictor data is in a table, fitsemigraph
assumes that a variable is categorical if it is a logical vector, categorical vector,
character array, string array, or cell array of character vectors. Ordinal categorical
variables are not supported. If the predictor data is a matrix,
fitsemigraph
assumes that all predictors are continuous. To
identify any other predictors as categorical predictors, specify them by using the
'CategoricalPredictors'
namevalue pair argument.
fitsemigraph
encodes categorical variables as numeric
variables by assigning a positive integer value to each category. When you use
categorical predictors, ensure that you use an appropriate distance metric
(Distance
).
Example: 'CategoricalPredictors','all'
Data Types: single
 double
 logical
 char
 string
 cell
ClassNames
— Names of classes to use for labelingNames of the classes to use for labeling, specified as the commaseparated pair
consisting of 'ClassNames'
and a categorical, character, or string
array, a logical or numeric vector, or a cell array of character vectors.
ClassNames
must have the same data type as
Y
.
If ClassNames
is a character array, then each element must
correspond to one row of the array.
Use 'ClassNames'
to:
Order the classes.
Specify the order of any input or output argument dimension that corresponds
to the class order. For example, use 'ClassNames'
to specify
the column order of classification scores in
Mdl.LabelScores
.
Select a subset of classes for labeling. For example, suppose that the set
of all distinct class names in Y
is
{'a','b','c'}
. To use observations from classes
'a'
and 'c'
only, specify
'ClassNames',{'a','c'}
.
The default value for ClassNames
is the set of all distinct
class names in Y
.
Example: 'ClassNames',{'b','g'}
Data Types: categorical
 char
 string
 logical
 single
 double
 cell
PredictorNames
— Predictor variable namesPredictor variable names, specified as the commaseparated pair consisting of
'PredictorNames'
and a string array of unique names or cell array
of unique character vectors. The functionality of 'PredictorNames'
depends on the way you supply the predictor data.
If you supply X
, Y
, and
UnlabeledX
, then you can use
'PredictorNames'
to assign names to the predictor variables
in X
and UnlabeledX
.
The order of the names in PredictorNames
must
correspond to the column order of X
. That is,
PredictorNames{1}
is the name of
X(:,1)
, PredictorNames{2}
is the
name of X(:,2)
, and so on. Also,
size(X,2)
and
numel(PredictorNames)
must be equal.
By default, PredictorNames
is
{'x1','x2',...}
.
If you supply Tbl
and
UnlabeledTbl
, then you can use
'PredictorNames'
to choose which predictor variables to
use. That is, fitsemigraph
uses only the predictor variables
in PredictorNames
and the response variable to label the
unlabeled data.
PredictorNames
must be a subset of
Tbl.Properties.VariableNames
and cannot include the
name of the response variable.
By default, PredictorNames
contains the names of
all predictor variables.
A good practice is to specify the predictors using either
'PredictorNames'
or formula
, but
not both.
Example: 'PredictorNames',{'SepalLength','SepalWidth','PetalLength','PetalWidth'}
Data Types: string
 cell
ResponseName
— Response variable name'Y'
(default)  character vector  string scalarResponse variable name, specified as the commaseparated pair consisting of
'ResponseName'
and a character vector or string scalar.
If you supply Y
, then you can use
'ResponseName'
to specify a name for the response
variable.
If you supply ResponseVarName
or
formula
, then you cannot use
'ResponseName'
.
Example: 'ResponseName','response'
Data Types: char
 string
Standardize
— Flag to standardize predictor datafalse
or 0
(default)  true
or 1
Flag to standardize the predictor data, specified as the commaseparated pair
consisting of 'Standardize'
and a numeric or logical
0
(false
) or 1
(true
). If you set 'Standardize',true
, the
software combines the labeled and unlabeled predictor data, and then centers and
scales each numeric predictor variable by the corresponding column mean and standard
deviation. The software does not standardize the categorical predictors.
Example: 'Standardize',true
Data Types: double
 logical
Distance
— Distance metricDistance metric, specified as the commaseparated pair consisting of
'Distance'
and a character vector or string scalar.
If all the predictor variables are continuous (numeric) variables, then you can specify one of these distance metrics.
Value  Description 

'euclidean'  Euclidean distance 
'seuclidean'  Standardized Euclidean distance — Each coordinate difference
between observations is scaled by dividing by the corresponding
element of the standard deviation 
'mahalanobis'  Mahalanobis distance — By default, the distance is computed
using 
'minkowski'  Minkowski distance — The default exponent is 2. To specify a
different exponent, use the 
'chebychev'  Chebychev distance (maximum coordinate difference) 
'cityblock'  City block distance 
'correlation'  One minus the sample correlation between observations (treated as sequences of values) 
'cosine'  One minus the cosine of the included angle between observations (treated as vectors) 
'spearman'  One minus the sample Spearman's rank correlation between observations (treated as sequences of values) 
Note
If you specify one of these distance metrics and the predictor data includes categorical predictors, then the software treats each categorical predictor as a numeric variable, with each category represented by a positive integer.
If all the predictor variables are categorical variables, then you can specify one of these distance metrics.
Value  Description 

'hamming'  Hamming distance, which is the percentage of coordinates that differ 
'jaccard'  One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ 
Note
If you specify one of these distance metrics and the predictor data includes continuous (numeric) predictors, then the software treats each continuous predictor as a categorical variable.
If the predictor variables are a mix of continuous (numeric) and categorical variables, then you can specify one of these distance metrics.
Value  Description 

'goodall3'  Modified Goodall distance 
'ofd'  Occurrence frequency distance 
The default value is 'euclidean'
if all the predictor variables
are continuous, and 'goodall3'
if any of the predictor variables
are categorical. For more information on the various distance metrics, see Distance Metrics.
Example: 'Distance','ofd'
Data Types: char
 string
Scale
— Scale parameter for standardized Euclidean distance metricScale parameter for the standardized Euclidean distance metric, specified as the
commaseparated pair consisting of 'Scale'
and a nonnegative
vector. The length of Scale
is equal to the number of predictors.
Each coordinate difference between two observations is scaled by the corresponding
element of Scale
.
The default scale parameter is std(PD,'omitnan')
, where
PD
is the predictor data, both labeled and unlabeled.
Note
This argument is valid only if Distance
is
'seuclidean'
.
Example: 'Scale',iqr(X)
Data Types: single
 double
Cov
— Covariance matrix for Mahalanobis distance metricCovariance matrix for the Mahalanobis distance metric, specified as the
commaseparated pair consisting of 'Cov'
and a
pbyp positive definite matrix, where
p is the number of predictors.
The default covariance matrix is cov(PD,'omitrows')
, where
PD
is the predictor data, both labeled and unlabeled.
Note
This argument is valid only if Distance
is
'mahalanobis'
.
Example: 'Cov',eye(3)
Data Types: single
 double
P
— Exponent for Minkowski distance metric2
(default)  positive scalarExponent for the Minkowski distance metric, specified as the commaseparated pair
consisting of 'P'
and a positive scalar.
Note
This argument is valid only if Distance
is
'minkowski'
.
Example: 'P',3
Data Types: single
 double
SimilarityGraph
— Type of similarity graph'knn'
(default)  'epsilon'
Type of similarity graph used in the labeling algorithm, specified as the
commaseparated pair consisting of 'SimilarityGraph'
and one of
these values.
Value  Description  GraphSpecific NameValue Pair Arguments 

'knn'  Construct the graph using nearest neighbors. 

'epsilon'  Construct the graph by using radius search. You must specify a value
for Radius if you use this option.  'Radius' — Search radius for the nearest neighbors
used to construct the similarity graph 
For more information, see Similarity Graph.
Example: 'SimilarityGraph','epsilon','Radius',2
Data Types: char
 string
NumNeighbors
— Number of nearest neighborsNumber of nearest neighbors used to construct the similarity graph, specified as
the commaseparated pair consisting of 'NumNeighbors'
and a
positive integer scalar.
The default number of neighbors is log(n)
, where
n
is the number of observations in the predictor data, both
labeled and unlabeled.
Note
This argument is valid only if SimilarityGraph
is
'knn'
.
Example: 'NumNeighbors',10
Data Types: single
 double
KNNGraphType
— Type of nearest neighbor graph'complete'
(default)  'mutual'
Type of nearest neighbor graph, specified as the commaseparated pair consisting
of 'KNNGraphType'
and one of these values.
Value  Description 

'complete'  Connects two points i and j, when either i is a nearest neighbor of j or j is a nearest neighbor of i. This option leads to a denser representation of the similarity matrix. 
'mutual'  Connects two points i and j, when i is a nearest neighbor of j and j is a nearest neighbor of i. This option leads to a sparser representation of the similarity matrix. 
Note
This argument is valid only if SimilarityGraph
is
'knn'
.
Example: 'KNNGraphType','mutual'
Data Types: char
 string
Radius
— Search radiusSearch radius for the nearest neighbors used to construct the similarity graph,
specified as the commaseparated pair consisting of 'Radius'
and a
nonnegative scalar.
Note
You must specify this argument if SimilarityGraph
is
'epsilon'
.
Example: 'Radius',5
Data Types: single
 double
KernelScale
— Scale factor1
(default)  'auto'
 positive scalarScale factor for the kernel, specified as the commaseparated pair consisting of
'KernelScale'
and 'auto'
or a positive scalar.
The software uses the scale factor to transform distances to similarity
measures.
The 'auto'
option is supported only for the
'euclidean'
and 'seuclidean'
distance
metrics.
If you specify 'auto'
, then the software selects an
appropriate scale factor using a heuristic procedure. This heuristic procedure
uses subsampling, so estimates can vary from one call to another. To reproduce
results, set a random number seed using rng
before calling
fitsemigraph
.
Example: 'KernelScale','auto'
Data Types: single
 double
 char
 string
Mdl
— Semisupervised graphbased classifierSemiSupervisedGraphModel
objectSemisupervised graphbased classifier, returned as a SemiSupervisedGraphModel
object. Use dot notation to access the object
properties. For example, to get the fitted labels for the unlabeled data and their
corresponding scores, enter Mdl.FittedLabels
and
Mdl.LabelScores
, respectively.
A distance metric is a function that defines a distance between two
observations. fitsemigraph
supports various distance metrics for
continuous (numeric) predictors, categorical predictors, and a mix of continuous and
categorical predictors.
Given an mxbyn data matrix X, which is treated as mx (1byn) row vectors x_{1}, x_{2}, ..., x_{mx}, and an mybyn data matrix Y, which is treated as my (1byn) row vectors y_{1}, y_{2}, ...,y_{my}, the various distances between the vector x_{s} and y_{t} are defined as follows:
Distance metrics for continuous (numeric) variables
Euclidean distance
$${d}_{st}^{2}=({x}_{s}{y}_{t})({x}_{s}{y}_{t}{)}^{\prime}.$$
The Euclidean distance is a special case of the Minkowski distance, where p = 2.
Standardized Euclidean distance
$${d}_{st}^{2}=({x}_{s}{y}_{t}){V}^{1}({x}_{s}{y}_{t}{)}^{\prime},$$
where V is the nbyn diagonal matrix whose jth diagonal element is (S(j))^{2}, where S is a vector of scaling factors for each dimension.
Mahalanobis distance
$${d}_{st}^{2}=({x}_{s}{y}_{t}){C}^{1}({x}_{s}{y}_{t}{)}^{\prime},$$
where C is the covariance matrix.
Minkowski distance
$${d}_{st}=\sqrt[p]{{\displaystyle \sum _{j=1}^{n}{\left{x}_{sj}{y}_{tj}\right}^{p}}}.$$
For the special case of p = 1, the Minkowski distance gives the city block distance. For the special case of p = 2, the Minkowski distance gives the Euclidean distance. For the special case of p = ∞, the Minkowski distance gives the Chebychev distance.
Chebychev distance
$${d}_{st}={\mathrm{max}}_{j}\left\{\left{x}_{sj}{y}_{tj}\right\right\}.$$
The Chebychev distance is a special case of the Minkowski distance, where p = ∞.
City block distance
$${d}_{st}={\displaystyle \sum _{j=1}^{n}\left{x}_{sj}{y}_{tj}\right}.$$
The city block distance is a special case of the Minkowski distance, where p = 1.
Correlation distance
$${d}_{st}=1\frac{\left({x}_{s}{\overline{x}}_{s}\right){\left({y}_{t}{\overline{y}}_{t}\right)}^{\prime}}{\sqrt{\left({x}_{s}{\overline{x}}_{s}\right){\left({x}_{s}{\overline{x}}_{s}\right)}^{\prime}}\sqrt{\left({y}_{t}{\overline{y}}_{t}\right){\left({y}_{t}{\overline{y}}_{t}\right)}^{\prime}}},$$
where
$${\overline{x}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{x}_{sj}}$$
and
$${\overline{y}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{y}_{tj}}.$$
Cosine distance
$${d}_{st}=\left(1\frac{{x}_{s}{{y}^{\prime}}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime}}_{s}\right)\left({y}_{t}{{y}^{\prime}}_{t}\right)}}\right).$$
Spearman distance
$${d}_{st}=1\frac{\left({r}_{s}{\overline{r}}_{s}\right){\left({r}_{t}{\overline{r}}_{t}\right)}^{\prime}}{\sqrt{\left({r}_{s}{\overline{r}}_{s}\right){\left({r}_{s}{\overline{r}}_{s}\right)}^{\prime}}\sqrt{\left({r}_{t}{\overline{r}}_{t}\right){\left({r}_{t}{\overline{r}}_{t}\right)}^{\prime}}},$$
where
r_{sj} is the rank of x_{sj} taken over x_{1j}, x_{2j}, ...x_{mx,j}, as computed by tiedrank
.
r_{tj} is the rank of y_{tj} taken over y_{1j}, y_{2j}, ...y_{my,j}, as computed by tiedrank
.
r_{s} and r_{t} are the coordinatewise rank vectors of x_{s} and y_{t}, that is, r_{s} = (r_{s}_{1}, r_{s}_{2}, ... r_{sn}) and r_{t} = (r_{t1}, r_{t2}, ... r_{tn}).
$${\overline{r}}_{s}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{sj}}=\frac{\left(n+1\right)}{2}$$.
$${\overline{r}}_{t}=\frac{1}{n}{\displaystyle \sum _{j}{r}_{tj}}=\frac{\left(n+1\right)}{2}$$.
Distance metrics for categorical variables
Hamming distance
$${d}_{st}=(\#({x}_{sj}\ne {y}_{tj})/n).$$
Jaccard distance
$${d}_{st}=\frac{\#\left[\left({x}_{sj}\ne {y}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right)\right]}{\#\left[\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right]}.$$
Distance metrics for a mix of continuous (numeric) and categorical variables
Modified Goodall distance
This distance is a variant of the Goodall distance, which assigns a small distance if the matching values are infrequent regardless of the frequencies of the other values. For mismatches, the distance contribution of the predictor is 1/(number of variables).
Occurrence frequency distance
For a match, the occurrence frequency distance assigns zero distance. For a mismatch, the occurrence frequency distance assigns a higher distance on a less frequent value and a lower distance on a more frequent value.
A similarity graph models the local neighborhood relationships between observations in the predictor data, both labeled and unlabeled, as an undirected graph. The nodes in the graph represent observations, and the edges, which are directionless, represent the connections between the observations.
If the pairwise distance Dist_{i,j} between any two nodes i and j is positive (or larger than a certain threshold), then the similarity graph connects the two nodes using an edge. The edge between the two nodes is weighted by the pairwise similarity S_{i,j}, where $${S}_{i,j}=\mathrm{exp}\left({\left(\frac{Dis{t}_{i,j}}{\sigma}\right)}^{2}\right)$$, for a specified kernel scale σ value.
fitsemigraph
supports these two methods of constructing a similarity graph:
Nearest neighbor method (if
SimilarityGraph
is 'knn'
(default)):
fitsemigraph
connects observations in the predictor data, both
labeled and unlabeled, that are nearest neighbors.
Use the 'NumNeighbors'
namevalue pair argument to
specify the number of nearest neighbors.
Use the 'KNNGraphType'
namevalue pair argument to
specify whether to make a 'complete'
or
'mutual'
connection of points.
Radius search method (if SimilarityGraph
is 'epsilon'
): fitsemigraph
connects
observations whose pairwise distances are smaller than the specified search radius.
You must specify the search radius for nearest neighbors used to construct the
similarity graph by using the 'Radius'
namevalue pair
argument.
A similarity matrix is a matrix representation of a similarity graph. The nbyn matrix $$S={({S}_{i,j})}_{i,j=1,\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}n}$$ contains pairwise similarity values between connected nodes in the similarity graph. The similarity matrix of a graph is also called an adjacency matrix.
The similarity matrix is symmetric because the edges of the similarity graph are
directionless. A value of S_{i,j} =
0
means that nodes i and j of the
similarity graph are not connected.
The software constructs a similarity graph (SimilarityGraph
) with
both labeled and unlabeled observations as nodes, and distributes label information from
labeled observations to unlabeled observations by using either label propagation or label
spreading.
To propagate labels across the nodes in the similarity graph, the iterative label
propagation algorithm (where Method
is
'labelpropagation'
) follows these steps:
Initialize an nbyK matrix F(0), where n is the number of nodes (or observations) and K is the number of classes.
The first l rows correspond to labeled observations. Each row contains a 1 in the column corresponding to the true class label for that observation, and a 0 in every other column.
The last u rows correspond to the unlabeled observations, and contain a 0 in all columns.
At iteration t (starting with t = 1), update the F matrix by using the probabilistic transition matrix P, so that $$F(t)=PF(t1)$$, where $${P}_{i,j}=\frac{{S}_{i,j}}{{\displaystyle \sum _{k=1}^{n}{S}_{i,k}}}$$.
P_{i,j} is the probability of transmitting label information from node i to node j.
S_{i,j} is the weight of the edge between node i and node j. For the definition of S_{i,j}, see Similarity Graph.
To complete iteration t, clamp the labels for the labeled observations. That is, keep the first l rows of F(t) equal to their initial values in F(0).
Repeat the second and third steps until the F values converge.
You can use the IterationLimit
and Tolerance
values to control the convergence.
The final F matrix corresponds to the scores for the labeled
data and the unlabeled data (LabelScores
). For each observation, or row in F, the
column with the maximum score corresponds to the fitted class label (FittedLabels
).
For more details, see [2].
Instead of using an iterative label propagation algorithm, you can use an exact method
for label propagation (where Method
is
'labelpropagationexact'
). In this case, the
ubyK matrix of label information for the unlabeled
data is F_{U} = (I –
P_{UU})^{1}P_{UL}F_{L} where:
I is the identity matrix.
P_{UU} and P_{UL} are the labeled (L) and unlabeled (U) submatrices of P such that $$P=\left[\begin{array}{cc}{P}_{LL}& {P}_{LU}\\ {P}_{UL}& {P}_{UU}\end{array}\right]$$.
F_{L} is the lbyK matrix of label information for the labeled data. Each row contains a 1 in the column corresponding to the true class label for that observation, and a 0 in every other column.
The F_{U} matrix corresponds to the
scores for the unlabeled data (LabelScores
). For
each observation, or row in F_{U}, the column with
the maximum score corresponds to the fitted class label (FittedLabels
).
For more details, see [3].
To spread labels across the nodes in the similarity graph, the iterative spreading
propagation algorithm (where Method
is
'labelspreading'
) follows these steps:
Create an nbyK matrix Y, where n is the number of nodes (or observations) and K is the number of classes.
The first l rows correspond to labeled observations. Each row contains a 1 in the column corresponding to the true class label for that observation, and a 0 in every other column.
The last u rows correspond to the unlabeled observations, and contain a 0 in all columns.
Create a matrix A that is a normalized version of the nbyn similarity matrix S, with pairwise similarity values S_{i,j} as defined in Similarity Graph. Let A = D^{1/2}SD^{1/2}, where D is the nbyn diagonal matrix $$D=\left[\begin{array}{ccc}{\displaystyle \sum _{j=1}^{n}{S}_{1,n}}& \cdots & 0\\ \vdots & \ddots & \vdots \\ 0& \dots & {\displaystyle \sum _{j=1}^{n}{S}_{n,n}}\end{array}\right]$$.
At iteration t (starting with t = 1), update the F matrix by using the matrix
A and the neighboring label weight parameter α
(Alpha
), so that F(t) =
αAF(t
– 1) + (1 – α)Y. Let F(0) equal Y.
Repeat the third step until the F values converge. You can use
the IterationLimit
and Tolerance
values to
control the convergence.
Take the limit of the sequence $${\{F(t)\}}_{t=1,\mathrm{..},T}$$. This final matrix corresponds to the scores for the labeled data
and the unlabeled data (LabelScores
). For each observation, or row in the matrix, the column with
the maximum score corresponds to the fitted class label (FittedLabels
).
Instead of using an iterative label spreading algorithm, you can use an exact method for
label spreading (where Method
is
'labelspreadingexact'
). In this case, the
nbyK matrix of label information for the labeled
and unlabeled data is F = (I –
αA)^{1}Y, where I is the identity matrix. The
F matrix corresponds to the scores for the labeled data and the
unlabeled data (LabelScores
). For
each observation, or row in F, the column with the maximum score
corresponds to the fitted class label (FittedLabels
).
For more details, see [1].
[1] Zhou, Dengyong, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. “Learning with Local and Global Consistency.” Advances in Neural Information Processing Systems 16 (NIPS). 2003.
[2] Zhu, Xiaojin, and Zoubin Ghahramani. “Learning from Labeled and Unlabeled Data with Label Propagation.” CMU CALD tech report CMUCALD02107. 2002.
[3] Zhu, Xiaojin, Zoubin Ghahramani, and John Lafferty. “SemiSupervised Learning Using Gaussian Fields and Harmonic Functions.” The Twentieth International Conference on Machine Learning (ICML). 2003.
You have a modified version of this example. Do you want to open this example with your edits?
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
Select web siteYou can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.