The accurate acquisition of two-dimensional contour information of buildings is of great significance in the fields of three-dimensional reconstruction of buildings, urban change detection, and disaster emergency response. With the development of science and technology, the number of high-resolution remote sensing satellites is gradually increasing, and the high spatial resolution remote sensing images provided can more fully express the texture information between different features, which provides strong data support for building extraction. However, in building extraction from remote sensing images, there are often problems such as the loss of edge information of large buildings, discontinuous contours, "hollow" phenomenon, and the "missed detection" and "wrong detection" of small buildings, etc. To deal with the above phenomena, we have developed a new approach to building extraction. Aiming at the above phenomena, this study proposes a CBAM VGG16-UNet network incorporating dual-attention mechanism for building extraction. The network is based on U-Net network architecture. In the downsampling part, the encoder part of the U-Net network is replaced with the first five convolutional blocks of VGG16, which is used to increase the depth of the network and reduce the parameters. The dual-attention mechanism CBAM was introduced for each feature fusion in the up-sampling and the transposed convolution of U-Net was replaced with bilinear interpolation to improve the ability of the network to extract features. In this study, model validation was carried out using the WHU building dataset as well as the self-made Guiyang building dataset, while three common networks for extracting buildings, Mobile-UNet, U-Net, and VGG16-UNet, were analyzed as comparative models, and eight sets of experimental results were obtained from the four networks on the two types of datasets respectively. Four common evaluation indexes of semantic segmentation, Precision, Recall, F1-score, and IoU, were used to quantitatively analyze the experimental results, and the visual decoding method was used to comparatively analyze the extracted graphs of the training results obtained from five buildings with large scale differences of the four networks on the two types of datasets, which were selected respectively. The experiments show that CBAM VGG16-UNet achieves 94.90%, 95.46%, 95.18%, and 90.80% precision, recall, F1-score, and IoU on the WHU buildings dataset, and 77.53%, 84.46%, 84.46%, and 67.85% precision, recall, F1-score, and IoU on the Guiyang buildings dataset, 80.85%, and 67.85%, outperforming the three comparison models on both types of datasets. This study provides a new idea to solve the common problem of building extraction, which has some engineering application value.