1 Introduction and Preliminaries
Graph matching (GM) refers to establishing node correspondences between two or among multiple graphs. Graph matching incorporates both the unary similarity between nodes and pairwise [7, 14] (or even higherorder [21, 29, 43]
) similarity between edges from separate graphs to find a matching such that the similarity between the matched graphs is maximized. By encoding the highorder geometrical information in the matching procedure, graph matching in general can be more robust to deformation noise and outliers. For its expressiveness and robustness, graph matching has lied at the heart of many computer vision applications e.g. visual tracking, action recognition, robotics, weakperspective 3D reconstruction – refer to
[40] for a more comprehensive survey on graph matching applications.Due to its highorder combinatorial nature, graph matching is in general NPcomplete [13] such that researchers employ approximate techniques to seek inexact solutions. For the classic setting of twograph matching between graph , , the problem can be written by the following general quadratic assignment programming (QAP) problem [25]:
(1)  
where is a (partial) permutation matrix indicating the node correspondence, and
is the socalled affinity matrix
[22] whose diagonal elements and offdiagonal ones encode the nodetonode and edgetoedge affinity between two graphs, respectively. One popular embodiment of in literature is whereis the feature vector of the edge
, which can also incorporate the node similarity when node index .Eq. (1) is called Lawler’s QAP [20]. It can incorporate other forms e.g. KoopmansBeckmann’s QAP [25]:
(2) 
where , are weighted adjacency matrices of graph , respectively, and is the nodetonode affinity matrix. Its connection to the Lawler’s QAP can be established by setting .
Beyond the secondorder affinity modeling, recent methods also explore the way of utilizing higherorder affinity information. Based on tensor marginalization as adopted by several hypergraph matching works
[5, 9, 43, 46]:(3)  
where is the affinity order and is the order affinity tensor whose element encodes the affinity between two hyperedges from the graphs. is the tensor product [21]. Readers are referred to Sec. 3.1 in [9] for details on tensor multiplication. The above works all assume the affinity tensor is invariant w.r.t. the index of the hyperedge pairs.
All the above studies show the generality and importance of the affinity model, either for the setting of twograph matching or for multigraph matching. However, traditional affinity methods mostly rely on a predefined affinity function (or distance), e.g. a Gaussian kernel with Euclid distance in the node and edge feature space. We believe that such a predefined parametric affinity model has limited flexibility to capture the structure of a realworld matching task, whereby the affinity metric can be arbitrary and call for models with enough high capacity to approximate. This challenge is more pronounced in the presence of noise and outliers which are ubiquitous in practical settings. Based on an inappropriate affinity model, the matching solver can be more struggling as the global optimum regarding with the affinity model may even not correspond to the ground truth matching solution – due to the biased affinity objective function as input for combinatorial optimization.
Hence it calls for effective affinity modeling across graphs. It is orthogonal to the major line of previous efforts on devising combinatorial solvers using predefined affinity model [7, 9, 14, 21]. The contributions of this paper are:
i) We develop a novel supervised deep network based pipeline for graph matching, whereby the objective involves the permutation loss based on a Sinkhorn net rather than structured maxmargin loss [6] and pixel offset loss [45]. We argue that the permutation loss is a more inherent choice for the combinatorial nature graph matching (by relaxing it into linear assignment). Meanwhile, the permutation loss allows for the flexible handling of arbitrary number of nodes of graph for matching. In contrast, the number of nodes for matching in a graph is fixed and predefined in the problem structure in [6]. To our best knowledge, this is the first time for adopting a permutation loss for learning graph matching – a natural choice for its combinatorial nature.
ii) Our graph matching nets learn the nodewise feature (extracted from image in this paper) and the implicit structure information (including hyperedge) by employing a graph convolutional network, together with the nodetonode crossgraph affinity function using additional layers. As such, the intragraph information and crossgraph affinity are jointly learned given ground truth correspondence. Our network embeds both the node (image patch) feature and structure into the nodewise vector, and the nodetonode affinity layers are shared among all nodes. Such a design also allows for different numbers of nodes in different graph pairs for training and testing. To our best knowledge, this is the first time for adopting a graph neural network for learning graph matching (at least in computer vision).
iii) Experimental results including ablation studies show the effectiveness of our devised components including the permutation loss, the nodewise feature extract layer, graph convolutional network based node embedding, and the crossgraph affinity component. In particular, our method outperforms the deep learning peer method
[45] in terms of matching accuracy. Our method also outperforms [6] in accuracy while being more flexible as the method in [6] requires constant number of nodes for matching in both training and testing sets. We also show the learning capability of our approach even when the training set and test set are from different object categories, which also outperforms [45].2 Related Work
This paper focuses on learning of graph matching. Readers are referred to [42] for a comprehensive acquaintance.
2.1 Modeling and Learning Affinity
Recently a number of studies show various techniques for affinity function learning. Based on the extent to which ground truth correspondence information is used for training, methods are either unsupervised [23], semisupervised [24], or supervised [4, 6, 45].
Previous graph matching affinity learning methods are mostly based on simple and shallow parametric models, which use popular distances (typically weighted Euclid distance) in the node and edge feature space plus a similarity kernel function (e.g. Gaussian kernel) to derive the final affinity score. In particular, a unified (shallow) parametric graph structure learning model is devised between two graphs in a vector form
[6]. The authors in [6] observe that the above simple model can incorporate most previous shallow learning models, including [4, 23, 39]. Therefore, this method will be compared in our experiment.There is a seminal work [45]
presenting a method for adopting deep neural networks for learning the affinity matrix for graph matching. However, as will be shown later in the paper, their pixel offset based loss function does not fit well with the combinatorial nature of matching, which will be addressed in this paper. In addition, node embedding is not considered which has been shown being able to effectively capture the local structure of the node, which can go beyond secondorder for more effective affinity modeling.
2.2 Graph Neural Networks and Embedding
Deep neural networks have been proven effective on spatial and sequential data, with CNN and RNN respectively. Recently, there emerges a number of techniques for extracting highorder node embedding via deep networks, whose input i.e. graph is nonEuclidean data. Specifically, graph neural networks (GNN) [34] have been proposed whereby node features are aggregated from adjacent neighbors and different nodes can share the same transfer function. The output of GNN is invariant to permutations of graph elements. Many variants of GNN have been developed since [34], which is comprehensively discussed in [48]. In particular, the SNDE model [41] is developed for deep node embedding by exploiting the firstorder and secondorder proximity jointly. Differing from the above deep embedding models, there are some shallow embedding models which are scalable on large networks including DeepWalk [32] based on random walk and node2vec [15] inspired by skipgram language model [28]. In particular, LINE [38] explicitly defines firstorder proximity and secondorder
proximity and builds heuristics models for the two proximities. However, these methods, including the SNDE model cannot be used for endtoend learning for graph matching. For this reason, we adopt the Graph Convolutional Network (GCN) framework
[17]to aggregate graph structure information whose parameters can be learned by backpropagation for graph matching.
2.3 Learning of Combinatorial Optimization
Graph matching bears the combinatorial nature. There is an emerging thread using learning to seek efficient solution, especially with deep networks. In [16]
, the well known NPhard problem for coloring very large graphs is addressed using deep reinforcement learning. The resulting algorithm can learn new state of the art heuristics for graph coloring. While the Travelling Salesman Problem (TSP) is studied in
[18] and the authors propose a graph attention network based method which learns a heuristic algorithm that employs neural network policy to find a tour. Deep learning for node set is also explored in [44] which seeks permutation invariant objective functions to a set of nodes.In particular, [30] shows a network based approach for solving the quadratic assignment problem. Their work focuses on learning the solver given previous defined affinity matrix. In contrast, this paper presents an endtoend learning pipeline for learning the affinity function. In this sense, the two methods can be further integrated for practical applications. Moreover, for the less challenging linear assignment problem, which in fact can be solved with polynomial complexity e.g. the Hungarian algorithm [19], there also exist recently proposed network based new methods. The Sinkhorn Network [1] is developed for linear assignment learning in the sense of linear assignment given predefined assignment cost, which is designated to enforce doublystochastic regulation on any nonnegative square matrix. It has been shown that Sinkhorn algorithm [37] is the approximate and differentiable version of Hungarian algorithm [26]
. More recently, the Sinkhorn AutoEncoder is proposed in
[31] to minimize Wasserstein distance in AutoEncoders, and the work [10] adopts reinforcement learning for learning a linear assignment solver. The Sinkhorn layer is also adopted on top of a deep convolutional network in DeepPermNet [33], which solves a permutation prediction problem. However, DeepPermNet is not invariant to input permutations and need a predefined node permutation as reference, thus it is unstable for two graph matching.In comparison, our model consists of an affinity learning component which encodes the structure affinity into nodewise embeddings. As such, graph matching is relaxed into linear assignment solved by the Sinkhorn layer, which is also sometimes called permutation learning in literature.
input image  

number of nodes in one graph  
adjacency matrix of graph  
vertex set of graph  
edge set of graph  
coordinate of keypoint in image  
feature vector of keypoint , layer in graph  
message vector of keypoint , layer in graph  
node feature of keypoint , layer in graph  
affinity matrix on th Sinkhorn iteration  
matrix representing permutation 
3 Proposed Approach
We present two models for matching and : i) permutation loss and intragraph affinity based graph matching learning (PIA) and ii) permutation loss and crossgraph affinity based one (PCA). Both models are built upon a deep network which exploits both image feature and structure jointly, and a Sinkhorn network enabling differentiable permutation prediction and loss backpropagation. PCA adopts an extra crossgraph component which aggregates crossgraph features, while PIA only embeds intragraph features. Alg. 1 summarizes both PIA and PCA. Symbols are shown in Tab. 1.
The proposed two models consist of a CNN image feature extractor, a graph embedding component, an affinity metric function and a permutation prediction component. Image features are extracted by CNN (VGG16 in the paper) as graph nodes, and aggregated through (crossgraph) node embedding component. The networks predict a permutation for nodetonode correspondence from raw pixel inputs.
3.1 Feature Extraction
We adopt a CNN for keypoints feature extraction, which are constructed by interpolating on CNN’s feature map. For image
, the extracted feature on the keypoint is:(4) 
where interpolates on point from tensor via bilinear interpolation. performs CNN on image and outputs a feature tensor. Taking the idea of Siamese Network [3]
, two input images share the same CNN structure and weights. To fuse both local structure and global semantic information, feature vectors from different layers of CNN are extracted. We choose VGG16 pretrained with ImageNet
[8] as the CNN embodiment in line with [45].3.2 Intragraph Node Embedding
It has been shown that methods exploiting graph structure can produce robust matching [42], compared to point based methods [12, 47]. In PIA, graph affinity is constructed by a multilayer embedding component which models the higherorder information. The message passing scheme is inspired by GCN [17], where features are effectively aggregated from adjacency nodes, and the node itself:
(5)  
(6)  
(7) 
Eq. (5) is the message passing along edges and is the message passing function. The aggregated features from adjacent nodes are normalized by the total number of adjacent nodes, as a common practice in GCN, in order to avoid the bias due to the different numbers of neighbors owned by different nodes. Eq. (6) is the message passing function for each node and it contains a node’s selfpassing function . With , Eq. (7) accumulates information to update the state of node , and may take any differentiable mapping from vector to vector. Here we implement
as neural networks with ReLU activation, and
is a summation function. We denote Eq. (7) as graph convolution (GConv) between layer and :(8) 
which denotes a layer of our node embedding net. Message passing paths are encoded by adjacency matrix . By iterating Eq. (8), we build a multilayer net:
(9) 
where is extracted CNN feature for node by Eq. (4).
3.3 Crossgraph Node Embedding
We explore improvement over intragraph embedding by a crossgraph aggregation step, whereby features are aggregated from nodes with similar features in the other graph. First, we utilize graph affinity features from shallower embedding layers to predict a doublystochastic similarity matrix via affinity composition followed by Sinkhorn iteration (see details in Sec. 3.5). The predicted similarity matrix encodes the similarity among nodes of two graphs. The message passing scheme is similar to intragraph convolution in Eq. (8), with adjacency matrix replaced by , and features are aggregated from the other graph. In our experiments, we will show this simple scheme works more effectively than a more complex iterative procedure.
(10)  
(11)  
(12) 
where are taken as identity mapping, is a concatenation of two input feature tensors, followed by a fullyconnected layer. For pair of graphs , the crossgraph aggregation scheme in Eq. (10, 11, 12) is denoted as
(13) 
where denotes the predicted correspondence from to and denotes such relation from to .
3.4 Affinity Metric Learning
By using the above embedding model, the structure affinity between two graphs have been encoded into the nodetonode affinity in the embedding space. As such, it allows for reducing the traditional secondorder affinity matrix in Eq. (1) into a linear one. Let be feature from first graph, be feature from the other graph:
(14) 
The affinity matrix contains the affinity score between two graphs. means the similarity between node in the first graph and node in the second, considering the higherorder information in graphs.
One can set a bilinear mapping followed by an exponential function which ensures all elements are positive^{1}^{1}1We have also tried other more flexible fullyconnected layers, while we find the exponential function is simple and more stable for learning..
(15) 
Consider the feature vectors have dimensions, i.e. . contains learnable weights of this affinity function. is a hyper parameter for numerical concerns. For , with , Eq. (15) becomes more discriminative.
3.5 Sinkhorn Layer for Linear Assignment
Given the linear assignment affinity matrix in Eq. (15
), we adopt Sinkhorn net to address the linear assignment task. The Sinkhorn operation takes any nonnegative square matrix and outputs a doublystochastic matrix, which is a convex relaxation of the permutation matrix. We adopt this technique that has been shown effective for network based permutation prediction
[1, 33]. For a square matrix , the Sinkhorn operator is(16)  
(17) 
means elementwise division, and is a column vector whose elements are all ones. Sinkhorn algorithm works iteratively by taking rownormalization of Eq. (16) and columnnormalization of Eq. (17) alternatively.
By iterating Eq. (16, 17) until convergence, we get a doublystochastic matrix. This doublystochastic matrix is treated as our model’s prediction in training.
(18) 
For testing, Hungarian algorithm [19] is performed on
as a post processing step to discretize output into a permutation matrix. During the Sinkhorn step, only matrix multiplication and elementwise division are taken. These operations are differentiable. Therefore, Sinkhorn operation is fully differentiable and can be efficiently implemented with the help of PyTorch’s automatic differentiation feature
[35].3.6 Permutation CrossEntropy Loss
Our methods directly utilize ground truth nodetonode correspondence, i.e. permutation matrix, as the supervised information for endtoend training. Since Sinkhorn layer in Eq. (18) is capable to transform any nonnegative matrix into doublystochastic matrix, we propose a linear assignment based permutation loss that evaluates the difference between predicted doublystochastic matrix and ground truth permutation matrix for training.
Cross entropy loss is adopted to train our model endtoend. We take the ground truth permutation matrix , and compute the cross entropy loss between and . It is denoted as permutation loss, and this is the main method adopted to train our deep graph matching model :
(19) 
Note the competing method GMN [45] applies a pixel offset based loss namely “displacement loss”. Specifically it computes an offset vector by a weighted sum of all matching candidates. The loss is given as the difference between predicted location and ground truth location.
(20)  
(21) 
where are the coordinates of keypoints in first and second image, respectively. While is a small value ensuring numerical robustness. In comparison, our cross entropy loss can directly learn a linear assignment cost based permutation loss in an endtoend fashion.
3.7 Further Discussion
Pairwise affinity matrix vs. embedding. Existing graph matching methods focus on modeling secondorder [7, 22] and higherorder [21, 43] feature with an explicitly predefined affinity matrix or tensor. The affinity information can be encoded in an affinity matrix. Optimization techniques are applied to maximize graph affinity.
In contrast, we resort to the node embedding technique with two merits. First, the space complexity can be reduced to . Second, the pairwise affinity matrix in Eq. (1) can only encode the edge information, while the embedding model can implicitly encode higherorder information.
Sinkhorn net vs. spectral matching. GMN [45] adopts spectral matching (SM) [22] which is differentiable for back propagation. While we adopt the Sinkhorn net instead. In fact, the input of Sinkhorn is of complexity while it is for spectral matching. However, in SM we observe more iterations to convergence. Such iteration may bring negative effect to gradient’s back propagation. In fact, spectral matching is for graph matching while Skinhorn net is for linear assignment, which is relaxed from the graph matching task by our embedding component.
Pixel offset loss vs. permutation loss. The loss function adopted by GMN [45]
is an offset loss named “displacement loss”. The loss takes the weighted sum of all candidate points, and compute the offset vector from the original image to the source image. In training, GMN tries to minimize the variance between predicted offset vector and ground truth offset vector. In comparison, with the help of Sinkhorn net, we adopt a combinatorial permutation loss which is computed as the cross entropy between predicted result and ground truth permutation. Such permutation loss takes the ground truth permutation directly as supervision, and utilize such information for endtoend training.
Figure 1 gives an example for the failure case of offset loss. In this case, the offset loss is unreasonably low, but the permutation loss provides correct information. Experiments also show that models trained with our permutation loss exceed offset loss models in matching accuracy.
4 Experiments
4.1 Metrics and Peer Methods
We evaluate the matching accuracy between two given graphs. In the evaluation period, two graphs are given with same number of nodes . Each node in one graph is labeled to another node in the other graph. The model predicts a correspondence between two graphs. Such correspondence is represented by a permutation matrix. The matching accuracy is computed from the predicted permutation matrix and ground truth permutation matrix.
The matching accuracy is computed by the number of correctly matched keypoint pairs averaged by the total number of keypoint pairs. For a predicted permutation matrix and a ground truth permutation , matching accuracy is computed as
(22) 
where AND is the logical function.
The evaluation involves the following peer methods:
GMN. Graph Matching Network (GMN) is the seminal model proposed in [45]. GMN adopts VGG16 [36] network to extract image features. Firstorder and secondorder features are extracted from shallower layer (relu4_2) and deeper layer (relu5_1) of VGG16, respectively. GMN models graph matching affinity via an unlearnable graph matching solver namely spectral matching SM [22]. This model is classagnostic, meaning that it learns an universal model for all instance classes. Two input graphs are constructed by Delaunay triangulation and fullyconnected topology, respectively. GMN is the first endtoend deep learning method for graph matching. Note the major difference is that the loss function is an offset based loss by Eq. (21). We follow [45] and reimplement GMN with PyTorch as the source code is not publicly available.
HARGSSVM. This is the structured SVM based learning graph matching method [6], as a baseline of learning graph matching without deep learning techniques. HARGSSVM is a classspecific method, where graph models are learned for a certain class. We use the source code released by the authors upon their approval. The original setting in [6] assumes that the keypoints of the object to be matched is unknown, and the keypoint candidates are proposed by Hessian detector [27]. In our setting, however, all candidate keypoints are known to the model. Therefore, we slightly modified the code released by the authors. From all candidate points found by the Hessian detector, we assign the nearest neighbor from ground truth point as the matching candidate. This practice is originally adopted in the training process of HARGSSVM. The graph is created with handcrafted edge features named HARG.
PIA/PCA. Our methods adopt VGG16 [36] as backbone CNN, and extract features from relu4_2 and relu5_1 for fair comparison with [45]. These two feature vectors are concatenated to fusion both local and global features. In PIA, affinity is modeled by a 3layer GNN while in PCA it is a stack of 1layer GNN, crossgraph layer and 1layer GNN, both followed by affinity mapping in Eq. (15). Each GNN layer has a feature dimension of 2048. Permutation loss in Eq. (19) is used. Input graphs are both constructed by Delaunay triangulation and we empirically set in Eq. (15). Our models are implemented by PyTorch.
GMNPL & PIA/PCAOL. GMNPL and PIA/PCAOL are variants from GMN [45] and our proposed PIA/PCA, respectively. GMNPL changes the offset loss in GMN to permutation loss, with all other configurations unchanged. While PIA/PCAOL switch the permutation loss to offset loss, leaving all other components unchanged.
For natural image experiments, we draw two images from the dataset, and build two graphs containing the same number of nodes. The graph structure is agnostic, and is constructed according to methods’ configurations (see discussions above). The CNN weight is initialized by a pretrained model on ImageNet [8] classification dataset.
method  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  mbike  person  plant  sheep  sofa  train  tv  mean 

GMN  31.9  47.2  51.9  40.8  68.7  72.2  53.6  52.8  34.6  48.6  72.3  47.7  54.8  51.0  38.6  75.1  49.5  45.0  83.0  86.3  55.3 
GMNPL  31.1  46.2  58.2  45.9  70.6  76.4  61.2  61.7  35.5  53.7  58.9  57.5  56.9  49.3  34.1  77.5  57.1  53.6  83.2  88.6  57.9 
PIAOL  39.7  57.7  58.6  47.2  74.0  74.5  62.1  66.6  33.6  61.7  65.4  58.0  67.1  58.9  41.9  77.7  64.7  50.5  81.8  89.9  61.6 
PIA  41.5  55.8  60.9  51.9  75.0  75.8  59.6  65.2  33.3  65.9  62.8  62.7  67.7  62.1  42.9  80.2  64.3  59.5  82.7  90.1  63.0 
PCA  40.9  55.0  65.8  47.9  76.9  77.9  63.5  67.4  33.7  65.5  63.6  61.3  68.9  62.8  44.9  77.5  67.4  57.5  86.7  90.9  63.8 
4.2 Synthetic Graphs
Evaluation is first performed on synthetic graphs generated in line with the protocol in [7]. Ground truth graphs are generated with a given number of keypoints , each with a 1024dimensional (512 for nodes and 512 for edges) random feature in (simulating CNN features), and a random 2dcoordinate in . During training and testing, we draw disturbed graphs, with Gaussian noise added to features, and keypoint coordinates blurred by a random affine transform, plus another random noise of . Note that there is no CNN feature extractor adopted, only graph modeling approaches and loss metrics are compared. The matching accuracy of PCA, PCAOL, GMNPL, GMN and unlearning SM is evaluated with respect to and . For each trial, 10 different graphs are generated and accuracy is averaged. Experimental results in Fig. 2 show the robustness of PCA against feature deformation and complicated graph structure.
4.3 Pascal VOC Keypoints
We perform experiments on Pascal VOC dataset [11] with Berkeley annotations of keypoints [2]. It contains 20 classes of instances with labeled keypoint locations. Following the practice of peer methods [45], the original dataset is filtered to 7,020 annotated images for training and 1,682 for testing. All instances are cropped around its bounding box and resized to , before passed to the network. Pascal VOC Keypoint is a difficult dataset because instances may vary from its scale, pose and illumination, and the number of inliers ranges from to .
We test on Pascal VOC Keypoint [2] and evaluate on 20 Pascal categories. We compare GMN, GMNPL, PIAOL, PIA, PCA and give detailed experimental results in Tab. 2. Our proposed models PIAOL, PIA, PCA outperform in most categories, including the mean accuracy over 20 categories. Our implementation of PCA runs at sample pairs per second in training, on dual RTX2080Ti GPUs. The result shows the superiority of the linear assignment loss over offset loss in training, GNN over fixed SM [22] in affinity modeling, and crossgraph embedding over intragraph embedding in the GNN component.
method  face  mbike  car  duck  wbottle 

HARGSSVM [6]  91.2  44.4  58.4  55.2  66.6 
GMNVOC [45]  98.1  65.0  72.9  74.3  70.5 
GMNWillow [45]  99.3  71.4  74.3  82.8  76.7 
PCAVOC  100.0  69.8  78.6  82.4  95.1 
PCAWillow  100.0  76.7  84.0  93.5  96.9 
4.4 Willow ObjectClass
Willow ObjectClass dataset is collected by [6] for real images. This dataset consists of 5 categories from Caltech256 (face, duck and wine bottle) and Pascal VOC 2007 (car and motorbike), each with at least 40 images. Images are resized to if it is passed to CNN. This dataset is considered easier than Pascal VOC Keypoint, because all images inside the same category are aligned in their pose, and it lacks scale, background and illumination changes.
We follow the protocol built by authors of [6] for fair evaluation. HARGSSVM is trained and evaluated on this willow dataset. For other competing methods, we initialize their weights on Pascal VOC Keypoint dataset, with all VOC 2007 car and motorbike images removed. They are denoted as GMNVOC and PCAVOC. We later finetune the model’s weight on the willow dataset as GMNWillow and PCAWillow, reaching even higher result in evaluation. We should notice that HARGSSVM is a classspecific model, but GMN and PCA are both classagnostic. Tab. 3 shows our proposed PCA almost surpasses all competing methods in all categories of Willow Object Calss dataset.
4.5 Further Study
VGG16  intragraph  crossgraph  distance  accuracy 
feature  embedding  embedding  metric  
✓  ✓  ✓  ✓  63.8 
✓  ✓  ✓  63.6  
✓  ✓  62.1  
✓  54.8  
41.9 
PCA components. Ablation study with different PCA components trained/untrained is reported in Tab. 4
. It shows the usefulness of all our components. VGG16 is initialized with pretrained weights on ImageNet, GNN is randomly initialized, and the weight of affinity mapping is initialized by identity matrix plus random noise.
Crossgraph component design. Our crossgraph affinity component in Alg. 1 is relatively simple. In fact we also explore a more complex design of crossgraph module, where the matrix is updated by iterative prediction, rather than predicted from shallower GNN layer as PCA in Alg. 1. In this alternative design,
is initialized as zero matrix, and we iteratively predict
from , which is input to the crossgraph component. Result in Tab. 5 reveals that PCA’s performance will degrade as is iteratively predicted, and we further find the training is not stable by this iterative design hence we stick to the simple design in Alg. 1. Due to space limitation, details on this alternative design is given in supplementary materials.# of iters  1  2  3  4  5  6  7  Alg. 1 

PCA accuracy  63.1  61.3  60.9  54.7  45.9  46.7  46.2  63.8 
Confusion matrix. To testify the generalization behavior on a new category of the model trained on another category, we train PCA, PCAOL, GMNPL, GMN on eight categories in Pascal VOC Keypoint and report testing result on each category as shown in Fig. 3, where result is plotted via confusion matrix (axis for training and axis for testing). It shows that GNN adopted in PCA works well, and permutation loss offers better supervision than offset one.
5 Conclusion
This paper has presented a novel deep learning framework for graph matching, which parameterizes the graph affinity with deep networks and the learning objective involves a permutation loss to account for the arbitrary transformation between two graphs. Extensive experimental results including an ablation study on the presented components and the comparison with peer methods show the stateoftheart performance of our method.
Appendix A Further Discussions on PCA
Details on alternative crossgraph design. Our crossgraph affinity component of PCA in this paper is relatively simple. We also experimented a more complex alternative design of crossgraph module, where the matrix is updated by iterative prediction, rather than predicted from shallower GNN layer as original PCA. Details of this alternative design is shown in Alg. 2, which is written inline with Alg. 1 in the main paper. Experimental result reveals the degradation in performance with this alternative design, compared to our simple but effective crossgraph module. Such phenomena may be caused by the heavy iteration adopted in this alternative design, which may affect the stability of backward gradient in training.
Discussion on accuracy of “table” category. One may notice the listed result in Tab. 2 of the main paper shows methods trained with offset loss (GMN, PIAOL) outperforms methods trained with permutation loss (GMNPL, PIA, PCA) on category “table”. Such consequence is brought by the symmetricity of tables, where the absolute pose of a table is always nondeterministic, and the learned correspondence is therefore ungeneralizable to unseen cases. As shown in Fig. 4, overfitting is witnessed on models trained with permutation loss (PCA), whose supervision is stronger than offset loss (GMN).
References
 [1] R. P. Adams and R. S. Zemel. Ranking via sinkhorn propagation. arXiv:1106.1925, 2011.
 [2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, 2009.
 [3] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a “siamese” time delay neural network. In NIPS, 1994.
 [4] T. Caetano, J. McAuley, L. Cheng, Q. Le, and A. J. Smola. Learning graph matching. TPAMI, 31(6):1048–1058, 2009.
 [5] M. Chertok and Y. Keller. Efficient high order matching. TPAMI, 2010.
 [6] M. Cho, K. Alahari, and J. Ponce. Learning graphs to match. In ICCV, 2013.
 [7] M. Cho, J. Lee, and K. M. Lee. Reweighted random walks for graph matching. In ECCV, 2010.
 [8] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [9] O. Duchenne, F. Bach, I. Kweon, and J. Ponce. A tensorbased algor ithm for highorder graph matching. PAMI, 2011.
 [10] P. Emami and S. Ranka. Learning permutations with sinkhorn policy gradient. arXiv:1805.07010, 2018.
 [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
 [12] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981.
 [13] M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NPCompleteness. 1990.
 [14] S. Gold and A. Rangarajan. A graduated assignment algorithm for graph matching. TPAMI, 1996.
 [15] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In SIGKDD, 2016.
 [16] J. Huang, M. Patwary, and G. Diamos. Coloring big graphs with alphagozero. arXiv:1902.10162, 2019.
 [17] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. ICLR, 2017.
 [18] W. Kool and M. Welling. Attention solves your tsp. arXiv:1803.08475, 2018.
 [19] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 1955.
 [20] E. L. Lawler. The quadratic assignment problem. Management Science, 1963.
 [21] J. Lee, M. Cho, and K. M. Lee. Hypergraph matching via reweighted randomwalks. In CVPR, 2011.
 [22] M. Leordeanu and M. Hebert. A spectral technique for correspondence problems using pairwise constraints. In ICCV, 2005.
 [23] M. Leordeanu, R. Sukthankar, and M. Hebert. Unsupervised learning for graph matching. IJCV, 2012.
 [24] M. Leordeanu, A. Zanfir, and C. Sminchisescu. Semisupervised learning and optimization for hypergraph matching. In ICCV, 2011.
 [25] E. M. Loiola, N. M. de Abreu, P. O. BoaventuraNetto, P. Hahn, and T. Querido. A survey for the quadratic assignment problem. EJOR, 2007.

[26]
G. Mena, D. Belanger, G. Muñoz, and J. Snoek.
Sinkhorn networks: Using optimal transport techniques to learn
permutations.
NIPS Workshop in Optimal Transport and Machine Learning
, 2017.  [27] K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point detectors. IJCV, 2004.

[28]
T. Mikolov, K. Chen, G. Corrado, and J. Dean.
Efficient estimation of word representations in vector space.
Computer Science, 2013.  [29] Q. Ngoc, A. Gautier, and M. Hein. A flexible tensor block coordinate ascent scheme for hypergraph matching. In CVPR, 2015.
 [30] A. Nowak, S. Villar, A. S. Bandeira, and J. Bruna. Revised note on learning quadratic assignment with graph neural networks. In DSW, 2018.
 [31] G. Patrini, M. Carioni, P. Forre, S. Bhargav, M. Welling, R. v. d. Berg, T. Genewein, and F. Nielsen. Sinkhorn autoencoders. arXiv:1810.01118, 2018.
 [32] B. Perozzi, R. AlRfou, and S. Skiena. Deepwalk: Online learning of social representations. In SIGKDD, 2014.
 [33] R. Santa Cruz, B. Fernando, A. Cherian, and S. Gould. Visual permutation learning. TPAMI, 2018.
 [34] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. TNN, 2009.
 [35] F. Serratosa, A. SoléRibalta, and X. Cortés. Automatic learning of edit costs based on interactive and adaptive graph recognition. In GbR. 2011.
 [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2014.
 [37] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. AoMS, 1964.
 [38] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Largescale information network embedding. In WWW, 2015.
 [39] L. Torresani, V. Kolmogorov, and C. Rother. Feature correspondence via graph matching: Models and global optimization. In ECCV, 2008.
 [40] M. Vento and P. Foggia. Graph matching techniques for computer vision. GraphBased Methods in Computer Vision: Developments and Applications: Developments and Applications, page 1, 2012.
 [41] D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In SIGKDD, 2016.
 [42] J. Yan, X. Yin, W. Lin, C. Deng, H. Zha, and X. Yang. A short survey of recent advances in graph matching. In ICMR, 2016.
 [43] J. Yan, C. Zhang, H. Zha, W. Liu, X. Yang, and S. Chu. Discrete hypergraph matching. In CVPR, 2015.
 [44] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In NIPS, 2017.
 [45] A. Zanfir and C. Sminchisescu. Deep learning of graph matching. In CVPR, 2018.
 [46] R. Zass and A. Shashua. Probabilistic graph and hypergraph matching. In CVPR, 2008.
 [47] Z. Zhang. Iterative point matching for registration of freeform curves and surfaces. IJCV, 1994.
 [48] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neural networks: A review of methods and applications. arXiv:1812.08434, 2018.
Comments
There are no comments yet.