This paper discusses the visual differentiation of districts of the city and quantifies the legibility of the cityscape. Much of the research in district legibility has been done on the basis of interviewing subjects and having them make maps of the way they understand the city. The question we ask here is whether the same ends can be achieved by quantitatively identifying the visual features that make a district unique. For this purpose, we apply a deep convolutional neural net-work (DCNN) to a large-scale dataset collected through Google Street View (GSV). The DCNN enables us to segment the urban elements in each image. Comparing the results can elucidate the degree of visual heterogeneity, and the unsupervised clustering analysis explores the optimized number for grouping them over the city. The results show overall consistency with the previous results obtained by other meth-ods, indicating that the capacity of the machine’s eye can capture the visual similarities among territories.