MAC-I2P is composed of two components: modality approximation and cone-block-point matching. First, the modality discrepancies are mitigated by introducing image depth estimation and point cloud voxelization. Subsequently, cross-modality feature embedding is employed to extract the cross-modality features of both image and point cloud. After obtaining the cross-modality features, pixel-point correspondences are established by cone-block-point matching. At last, the relative pose is inferred through the pose estimation.