Abstract:
In this thesis we proposed a model called CSRNet provides data-driven and deep
learning that can understand highly congested scenes. Two major components: a
convolutional neural network (CNN) as the front-end for 2D feature extraction and a
dilated CNN for the back-end, which uses dilated kernels to deliver larger reception
fields and to replace pooling operations. CSRNet is an easy-trained model because of
its pure convolutional structure. Used dataset for training and testing purpose is
ShanghaiTech dataset. CSRNet is used as architecture for strong transfer learning
ability and its flexible architecture for easily concatenating the back-end for density
map generation. We choose VGG-16 as the front-end of CSRNet. Here the output size
of this front-end network is 1/8 of the original input size. If we continue to stack more
convolutional layers and pooling layers it is hard to generate high-quality density maps.
It also adapts the Gaussian kernel to the average head size to blur all the annotations
and the PSNR, SSIM to evaluate the quality of the output density map on
ShanghaiTech Part A dataset. Which includes the density map resizing with
interpolation and normalization for both ground truth and predicted density map. The
perspectives of images are not fixed and the images are collected from very different
scenarios. The Grid Average Mean Absolute Error is used for evaluation in this test
and achieves a significant improvement on four different Grid Average Mean Absolute
Error (GAME) crowd counting datasets with the state-of-the-art performance.