Integration graph attention network and multi ‐ centre constrained loss for cross ‐ modality person re ‐ identification

Cross ‐ modality person re ‐ identification is a challenging task due to the large visual appearance difference between RGB and infrared images. Existing studies mainly focus on learning local features and ignore the correlation between local features. In this paper, the Integration Graph Attention Network is proposed to learn the completed correlation between local features via the graph structure. To this end, the authors learn the coarse ‐ fine attention weights to aggregate the local features by considering local detail and global information. Furthermore, the Multi ‐ Centre Constrained Loss is proposed to optimise the feature similarity by constraining the centres of modality and identity. It simultaneously utilises three kinds of centre constraints, that is intra ‐ identity centre constraint, modality centre constraint, and inter ‐ identity centre constraint, in order to reduce the influence of modality information explicitly. The proposed method is evaluated on two standard benchmark datasets, that is SYSU ‐ MM01 and RegDB, and the results demonstrate that the authors’ method achieves better performance than the state ‐ of

The visual appearance difference between RGB and IR images is the main challenge for cross-modality person Re-ID, because IR images with one channel only contain the information of invisible electromagnetic radiation while three-channel RGB images include rich colour information of visible light. Furthermore, cross-modality person Re-ID inherits the challenges of single modality person Re-ID, such as the variations in poses and viewpoints. In a word, cross-modality person Re-ID is more challenging than single modality person Re-ID.
To address the above-mentioned issues, the existing crossmodality person Re-ID methods mainly focus on feature learning and metric learning. As for the feature learning, some methods design one-stream or two-stream networks to extract global features from RGB and IR images [8,9,11,42]. Furthermore, modality-consistent features or images are usually learnt to reduce the modality gap, which is generated by various modality transformations, such as GAN, convolution operation, grayscale transformation and so on [13][14][15][16][17]44]. Meanwhile, local features of pedestrian are utilised to explore the invariant body shape information for cross-modality person Re-ID [10,18,45]. However, these methods only extract local features from a single region and ignore the correlation between other features, which is difficult to learn complementary information between local features.
As for the metric learning, it is applied to reduce the appearance difference between RGB and IR images from the aspect of feature similarity optimisation. The cross-modality triplet loss, the hetero-centre loss and the contrastive loss are proposed to control the distance between cross-modality features [9,10,42]. Some methods map heterogeneous features into a common space so as to learn modality-shared metrics [19,20]. However, these methods mix the modality information and the identity information of features in the process of metric learning, and they do not explicitly consider the influence of the modality information. Hence, the learnt metric functions are suboptimal for cross-modality person Re-ID.
To overcome the above-mentioned limitations, we propose a novel method named Integration Graph Attention Network (IGAT) for cross-modality person Re-ID, where IGAT is designed to learn the correlation between local features via the graph structure. To this end, we first extract the local features of pedestrian images and treat them as the nodes of graph. In order to model the completed correlation, we not only learn the correlation between local features, but also integrate the global correlation into the feature representation via learning the coarse-fine attention weights. Then, we apply the coarsefine attention weights to aggregate the local features from the corresponding parts with the same modality. As a result, local detail and global information are injected into the final representation so as to obtain the complementary information.
Furthermore, to relieve the influence of modality information, we propose the Multi-Centre Constrained Loss (MCCL) to optimise the similarity between pedestrian images by constraining the centres of modality and identity. Specifically, as shown in Figure 1, MCCL consists of three components: 1) Intra-identity centre constraint: to increase the feature similarity between pedestrian images with the same identity, we directly pull the centres with the same identity from different modalities together. 2) Modality centre constraint: we also pull the centres of different modalities together to reduce the feature discrepancy caused by cross modality. 3) Inter-identity centre constraint: the centres of different identities are encouraged to be away from each other. It could improve the feature dissimilarity between pedestrian images from different identities in order to obtain discriminative features. In a word, the proposed MCCL explicitly reduces the influence of modality information by constraining different kinds of centres.
The main contributions of this work are summarised as follows: 1) We propose IGAT to obtain the completed correlation between local features by learning the coarse-fine attention weights.
2) We propose MCCL to optimise the similarity between pedestrian images from different aspects by constraining different kinds of centres.

3) Extensive experimental results on the SYSU-MM01 and
RegDB datasets show our method surpasses the state-ofthe-art methods, which demonstrate the effectiveness of our method.

| Cross-modality person Re-ID
In order to overcome the visual appearance difference between RGB and IR modalities, many approaches have been proposed to learn discriminative features for cross-modality person Re-ID [8,21,22]. Some of them design the specific network structures to obtain global features [8,11]. For example, Ye et al. [11] present a two-stream network with non-local attention to extract global features. Some methods employ the generator module to generate modality alignment information [12,16,26]. Wang et al. [26] apply AlignGAN to transform real RGB images to fake IR images in order to obtain alignment features. Furthermore, local features are introduced into crossmodality person Re-ID to extract the invariant body shape information from the pedestrian images of different modalities [18,45]. Sun et al. [23] propose a whole-individual training (WIT) model to learn local features for VI-ReID, which is based on the idea of pulling the whole images and distinguishing the individuals. Ye et al. [18] exploit the intramodality part relationship to enhance the feature representation. However, these local feature-based methods for crossmodality person Re-ID only learn the local information or their correlation, which results in learning incomplete correlation in the aggregation process. Different from the above methods, the proposed IGAT learns the completed correlation between local features, where the local and global features are both considered in the aggregation process.
In order to learn the accurate similarity measurement between cross-modality features, some methods reduce the modality gap by means of metric learning. Chen et al. [42] employ the contrastive loss, and Ye et al. [9] adopt the cross-modality triplet loss to optimise the deep networks, which is beneficial to extract modality-invariant features. Hao et al. [20] map the pedestrian images from two domains into a hypersphere and constrain the cross-modality variations by the hypersphere. Zhu et al. [10] propose the hetero-centre loss to reduce the intra-identity cross-modality variations by constraining the centres. Although these centre-based losses achieve promising results, they only consider one kind of centre constraint, which is difficult to handle the complex distributions of heterogeneous features. Different from them, the proposed MCCL considers three kinds of centre constraints of modality and identity so as to simultaneously reduce the intra-identity cross-modality variations and inter-modality variations, and increase the interidentity variations.

| Graph Attention Network
Graph Attention Network (GAT) [29] remits the prior knowledge of graph structure from Graph Neural Network (GNN) [27,28] and integrates the attention architecture into GNN. It assigns different weights to neighbour nodes for propagating information to centre nodes via masked selfattentional layers.
Recently, GAT has been applied into various tasks to exploit the dependency between nodes [30,31]. Huang et al. [32] propose the target-dependent GAT to utilise dependency relationship among words for aspect level sentiment classification. Wang et al. [33] propose the relational GAT to encode syntax information for sentiment prediction. Yang et al. [25] propose HGAT based on a dual-level attention mechanism for short text classification. Chen et al. [24] exploit heterogeneous graph and node features to learn user profiles from limited labelled data.
As for the field of person Re-ID, Zhang et al. [37] present Heterogeneous Local Graph Attention Networks (HLGAT) to model the inter-local relation and the intra-local relation for person Re-ID. However, HLGAT ignores the aggregation of global information in the learning process of the local features. Different from HLGAT, the proposed IGAT models the dependency between local features from the local and global aspects.

| APPROACH
The framework of the proposed method is shown in Figure 2. It mainly consists of the Feature Extractor Module, the IGAT Module and MCCL. We detail each component in this section.

| Feature Extractor Module
The Feature Extractor Module is designed based on a twostream network, which adopts ResNet-50 [34] as the backbone. Specifically, we adopt two individual ResNet-50, which are removed the last down-sampling operation for two modality streams. In each modality stream, the feature maps outputted from the last convolution layer are conducted by the average pooling. Meanwhile, the feature maps are divided into P uniform parts horizontally and then implemented by the average pooling for each part. Finally, a weight-shared fully connected (FC) layer is employed by the two modality streams to obtain the local features f p L j P p¼1 and the global feature f G .

| IGAT module
The local features have been demonstrated the effectiveness to viewpoint and posture changes [35][36][37]. Furthermore, the local features corresponding to the same part describe the pedestrian from different aspects, and therefore learning the correlation between local features could propagate useful information between them. As a result, the discrimination of local features is improved. Motivated by this, we design IGAT to model the completed correlation for local features. The IGAT Module is connected after the Feature Extractor Module, and the local features f p L j P p¼1 and the global feature f G are the input of the IGAT Module. We utilise the local features to construct a graph where each local feature is treated as a node. Each node is updated by its neighbour nodes with the aggregation operation. The node after updating is formulated as:f where f p;i L and f p;j L are the p-th local features of the i-th and j-th pedestrian images respectively, V p indicates the learnable transformation matrix for the p-th local feature, σ(⋅) denotes the non-linear transformation implemented by the LeakyReLU operation, and N i is the neighbour node set of i, which contains the nodes with the same modality of the i-th pedestrian image. Here, α ij p is the attention weight, which reflects the correlation between the p-th local features of the i-th and j-th pedestrian images. It is usually formulated as [18]: where ϕ denotes the cosine similarity function, and W p represents the learnable transformation matrix for the p-th local feature. From Equations (1) and (2), we can see that it only considers the correlation from the local aspect, and it may HE ET AL. obtain some unexpected attention weights without considering global information. Figure 3 shows some local regions of RGB and IR images. From Figure 3a we can see that I 1 and I 3 are with the same identity, and I 1 and I 2 possess different identities. But the similarity between the local regions of I 1 and I 2 is higher than that of I 1 and I 3 . Meanwhile, from the aspect of whole images, we can distinguish that I 1 and I 3 have the same identity, which is opposite to the judgement from the local aspect. We can draw the similar conclusion from IR images in Figure 3b.
Based on the observation of Figure 3, we inject the global information when learning the correlation between local features. We expect to utilise the global feature similarity to correct the mismatching caused by only considering the local similarity. Hence, we propose the coarse-fine attention weights to consider the similarity between local features and the similarity between global features. The coarse-fine attention weight between the p-th local features of the i-th and j-th pedestrian images is defined as: where W G is the learnable transformation matrix for the global feature, λ is the balance parameter, and f i G and f j G indicate the global features of the i-th and j-th pedestrian images, respectively. Afterwards, we substitute Equation (3) into Equation (1) to obtain the aggregated local features.
We employ the cross-entropy (CE) loss to supervise the learning process of IGAT: where L p id is the CE loss for the p-th aggregated local feature.

| Multi-centre constrained loss
In the field of cross-modality person Re-ID, it is common that the similarity between RGB and IR images with the same identity is not high enough to distinguish because of the influence of heterogeneous modalities. The metric learning is effective to reduce the modality gap; however, the existing metric learning methods for cross-modality Re-ID do not explicitly handle the influence of modality information.
Hence, we propose MCCL to simultaneously consider the influence of modality information and identity information by constraining multiple centres of modality and identity. Specifically, MCCL includes three kinds of centre constraints in order to achieve comprehensive similarity optimisation. First, we apply the intra-identity centre constraint to pull the centres with the same identity from different modalities together in order to increase the similarity of cross modality features with the same identity [10]. It is defined as: where N denotes the number of identities, ‖⋅‖ 2 denotes the Euclidean distance, and c p;i R and c p;i I are the centres (mean vectors) of the p-th local features for the i-th identity of RGB images and IR images, respectively.
In order to further reduce the modality gap, we propose the modality centre constraint from the macro perspective. The modality centre constraint is expected to pull the centres of two modalities together, which is convenient to transform the heterogeneous features into the homogeneous features. It is defined as: where c p R and c p I are the centres of the p-th local features of all RGB and IR images, respectively. Different from computing multiple centres for the p-th local features in the intra-identity centre constraint, the modality centre constraint only requires to compute one centre for each modality.
Finally, we propose the inter-identity centre constraint to push the centres of different identities away so as to increase the differentiation of features. The intra-identity centre constraint and the modality centre constraint mainly focus on improving the similarity between the pedestrian images of cross modality. As a complement, the inter-identity centre constraint is designed to increase the dissimilarity between the pedestrian images with different identities. We define two kinds of forms for the inter-identity centre constraint, and as shown in Figure 4a the first one is: where r p,i is the maximum of all distances between the centre and the features for the p-th local features of i-th identity, so is r p,j for the j-th identity. Here, the margin d i;j p is the Euclidean distance between the centres of the i-th identity and the j-th identity for the p-th local features. Furthermore, we narrow the margin in Equation (8) as shown in Figure 4b and then obtain another form of interidentity centre constraint: Equation (9) relaxes the margin restriction and it does not introduce any extra parameters. Figure 5 shows the loss trend ofL ter and L ter in the training process, where we can see that L ter has faster convergence speed thanL ter . Meanwhile, in the ablation study, we conduct experiments to validate that L ter is more effective thanL ter . In a word, the proposed MCCL for local features is defined as: where β 1 , β 2 and β 3 are the weight parameters. We not only adopt MCCL on the local features using Equation (10) but also on the global features denoted as L MCC_G . Hence, MCCL on the local and global features is formulated as:

| Optimisation
To optimise the proposed framework in an end-to-end way, the overall loss is defined as: The schematic diagram of the inter-identity centre constraint. The points with the same shape denote the features belonging to the same identity, and the yellow points and the green points indicate the features belonging to RGB and infrared images, respectively. The red circles represent the centres HE ET AL.
where μ 1 , μ 2 and μ 3 are the parameters to control the weights of different components. Here, L T id denotes the sum of CE losses for the local features and the global features in the Feature Extractor Module. Finally, the result obtained by calculating Equation (12) is back-propagated to the model so as to optimise the model.

SYSU-MM01 [8] is a large-scale cross-modality person
Re-ID dataset, which contains 301/3010 (single-shot/multi-shot) RGB images and 3803 IR images of 96 identities in the test set, and 22,258 RGB images and 11,909 IR images of 391 identities in the training set.
RegDB cross-modality person Re-ID dataset [38] includes 8240 images of 412 identities. Each identity has 10 RGB images and 10 IR images. Two hundred and six identities are randomly selected from all 412 identities to construct the training set, and the remaining identities constitute the test set. RegDB provides two types of evaluation modes according to different modality match settings. One is Visible to Thermal (V-T ), which searches RGB images of the same identity from IR images, and the other one is Thermal to Visible (T-V ), which queries IR images of the same identity from RGB images.

| Implementation details
All the pedestrian images are resized to 288 � 144 and augmented by the random horizontal flipping and the random cropping. The batch size is set to 64, which contains four identities and each identity carries eight RGB images and eight IR images. The weight-shared FC layer in the Feature Extractor Module reduces the dimension of both the local features and the global features from 2048 to 512. The number of the local features P is set to 6. Besides, we set the balance parameter λ in Equation (4) to 0.2. The weights of MCCL β 1 , β 2 and β 3 in Equation (10) are set to 1, 0.5, and 0.5, respectively. The weights of different losses μ 1 , μ 2 and μ 3 in Equation (12) are set to 0.1, 1, and 0.5, respectively. To enhance the stability of graph learning, we adopt the multi-head attention strategy [29] in IGAT, and the number of multi-head is set to 4.
The proposed network is optimised by the stochastic gradient descent (SGD) scheme [46]. The number of epochs is set to 60 in the training process. The initial learning rate is set to 0.01 and lasted for 30 epochs. Afterwards, the learning rate is changed to 0.001 for the remaining epochs.

| Ablation study
In this subsection, we design the ablation study to validate the effectiveness of each component of our method. We choose the most challenging single-shot setting on SYSU-MM01 and the V-T mode on RegDB to evaluate the performance. The results of ablation study are shown in Table 1. BS refers to the baseline, which adopts the Feature Extractor Module supervised by the CE losses. BS + GAT indicates that modelling the local correlation without considering the similarity between the global features, and its attention weights are computed by Equation (2).
For SYSU-MM01, it is obvious that the performance of our method (Ours) achieves the best results and the following conclusions can be drawn.
Effectiveness of IGAT. The performance of BS + GAT surpasses BS by 2.2% rank-1 accuracy and 2.4% mAP, which illustrates the importance of learning the correlation between local features. The performance of BS + IGAT further brings 2.3% and 1.7% increments on rank-1 accuracy and mAP compared with BS + GAT. It is because the proposed IGAT models the dependency between local features from the local and global aspects. Specifically, the IGAT module not only considers the correlation between local features but also injects global information when learning the correlation so as to obtain more precise attention weights, namely, the coarse-fine attention weights. Meanwhile, it further proves the effectiveness of adding global information to attention weights of the local features.
Effectiveness of MCCL. The performance of BS + L tra , BS + L m , BS + L ter and BS+L ter all achieves better than that of BS due to adding different centre constraints. Afterwards, the performance further gains by using two or three different kinds of centre constraints. Hence, each component in MCCL prompts the network to obtain higher performance, which demonstrates the effectiveness of MCCL.
Effectiveness of the margin for the inter-identity centre constraint. As shown in Figure 5, we can see that the loss curve of Equation (9) is smoother and faster than Equation (8) because Equation (9) relaxes the margin restriction, which makes convergence in the training process more stable. Furthermore, in Table 1, BS + L ter improves Rank-1 and mAP compared with BS +L ter and so does BS + MCCL compared with BS + L tra + L m +L ter . It can be concluded that narrowing the margin of inter-identity, that is Equation (9) is more effective.

F I G U R E 5
The loss trend of L ter andL ter in the training process Through the above analysis, we can further prove the effectiveness of our proposed method, namely, the proposed IGAT learns completed correlation between local features by considering both local detail and global information and the proposed MCCL constrains the centres of modality and identity to optimise the similarity of features so as to explicitly overcome the influence of modality information. Note that as for the RegDB dataset, we can obtain the similar conclusions mentioned above.
Results on SYSU-MM01. As shown in Table 2, in SYSU-MM01, our method achieves 60.6% rank-1 accuracy and 60.3% mAP under the single-shot setting, and 65.5% rank-1 accuracy and 53.8% mAP under the multi-shot setting, which outperforms the compared state-of-the-art methods. For the three methods (DDAG [18], TSLFN + HC [10] and FBP-AL [50]) using local features, DDAG mines intra-modality part-level context cues using local features and FBP-AL learns more fine-grained information by part representations, while our method learns the correlation between local features from the local and global aspects. TSLFN + HC adopts the intraidentity centre constraint to reduce the modality gap, while our method simultaneously utilises three different kinds of centre constraints. Hence, the performance of our method is better than other local feature learning methods.
Results on RegDB. From Table 3, it can be seen that our method achieves 84.1% rank-1 accuracy and 75.4% mAP in the V-T mode and 83.1% rank-1 accuracy and 76.0% mAP in the T-V mode, which exceeds the second best method by a large margin. It demonstrates that our method possesses high generalisation ability to different cross-modality person Re-ID datasets.
Recently, MPANet [48] modifies the network backbone to build a new baseline where they propose to embed the attention mechanisms into the ResNet50 network and utilises mutual learning to enable different modalities to interact with each other. It achieves the state-of-the-art performance for cross-modality person Re-ID. While our method does not change the backbone and only utilises the original ResNet50 network. Furthermore, it does not focus on the interaction of two modality streams. GLMC [47] applies the cross-entropy loss and the triplet loss to supervise the whole global branch, where they treat cross-modality person Re-ID as a classification task and a rank task, and focusses on learning the global and local features of pedestrians. Compared to Ref. [47], we only treat cross-modality person Re-ID as a classification task.
Since building the correlations among pedestrian features is beneficial for learning completed information, we focus on the learning of the correlations of pedestrian features. To this end, we propose IGAT to consider the correlation between local features via the graph structure. The IGAT module injects global information when learning the correlation between local features so as to obtain more precise attention weights, namely, the coarse-fine attention weights. Moreover, we propose MCCL to optimise the similarity between pedestrian images from different aspects by constraining different kinds of centres, so as to reduce the discrepancies among different modalities and make the features with the same identity compact and the features with the different identity far away.

| Parameter analysis
In this subsection, we conduct a series of experiments to study the influence of several key parameters for the proposed method including the balance parameter of local detail and global information λ in Equation (4), the weights of three components in MCCL β 1 , β 2 and β 3 in Equation (10), and the weights of different losses μ 1 , μ 2 and μ 3 in Equation (12). The experiments are conducted under the single-shot setting on SYSU-MM01. The experimental results can be generalised to multi-shot settings of SYSU-MM01 and RegDB. The weight parameter λ controls global information in the aggregation process of local features. As shown in Figure 6, we show that the accuracy varies with the balance parameter λ. On the one hand, when λ gradually increases, we can see that the performance improves, which indicates that global information aggregation is beneficial for the local feature attention weights. On the other hand, we can see that the performance decreases when λ is larger than 0.2, which indicates that too much global information for the aggregation of local features appears the error interference phenomenon. In a word, introducing too much global information and little global information cannot offer the accurate completed correlation, which leads to a suboptimal performance. Thus, we obtain the best results when λ is set to 0.2.
As shown in Figure 7 and Figure 8, for β 1 , β 2 and β 3 in Equation (10) we fix two parameters to the optimal values and investigate the impact of the remaining one for the convenience of display. We also apply the same method to investigate the impact of μ 1 , μ 2 and μ 3 in Equation (12) as shown in Table 4. When β 1 , β 2 and β 3 are set to 1, 0.5 and 0.5 respectively, the performance of the network achieves the best. The optimal values of μ 1 , μ 2 and μ 3 are 0.1, 1 and 0.5, respectively.

| Visualisation
In this subsection, we first visualise the similarity score of RGB-IR positive and negative pairs as shown in Figure 9. The difference between the distributions of RGB-IR positive and negative pairs for BS + IGAT is larger than that of BS, and therefore the correct matching is more probably to occur. It demonstrates that IGAT is beneficial to learn discriminative features.
We also report t-SNE [43] visualisation of 10 randomly selected identities on RegDB. The feature distributions of BS, BS + L tra and BS + MCCL are shown in Figure 10. Comparing BS + L tra with BS, we can see that the modality gap is alleviated largely. After adding the modality centre constraint and the Note: R1, R10 and R20 denote Rank-1, Rank-10 and Rank-20 accuracies (%), respectively. Here, * means the multi-task learning is used, and ‡ indicates that the attention mechanism module is added to the backbone network and the mutual learning method is used.

-
inter-identity centre constraint, the distance between cross modality features is pulled closer and the features with the same identity become more compact, which validates that the influence of the modality information is relieved by MCCL.

| CONCLUSION
In this paper, we have proposed IGAT and MCCL for crossmodality person Re-ID. The proposed IGAT considers both local detail and global information to construct completed correlation between local features. To explicitly overcome the influence of modality information, we propose MCCL which constrains the centres of modality and identity to optimise the similarity of features. Extensive experimental results on two standard datasets have demonstrated the proposed method surpasses the state-of-the-art methods. Moreover, the proposed method is good at handling heterogeneous data, and therefore we believe that our method has great potential to generalise to other related research fields, such as crossmodality image retrieval, domain adaptation image classification, and so on.

F I G U R E 9
The visualisation of similarity score of RGB-IR positive and negative pairs on SYSU-MM01. The x axis is the similarity score of cross-modality image pair (the image pair with the same identity is called RGB-IR positive and the image pair with different identities is called RGB-IR negative). The y axis is the statistical normalisation value for each similarity score. IR, infrared F I G U R E 1 0 Visualisations of the feature distribution generated from different models. The colour and shape of points indicate the identity and modality, respectively. The figure is best viewed in colour with PDF magnification