Traditionally, the CNNs are pretrained on the ImageNet-1K (IN1K) classification task.
Existing solutions tend to rely on extracting features from frames or sets of frames with pretrained and fixed Convolutional Neural Networks (CNNs). The video captioning problem consists of describing a short video clip with natural language. With the collection of sufficient training data, our deep learning focusing model provides a significantly faster alternative to conventional focusing methods. Furthermore, the rare cases where our algorithm does not find the focal plane can be detected, and a fine-focus algorithm can be applied to correct the result. The model was able to determine the in-focus position with high reliability, and was also significantly faster than conventional methods that rely on classical computer vision. The CNN model was tested on bare semiconductor sample using the projected shape of the F-stop. The ground truth focal plane was determined using a parabolic autofocus algorithm with the Tenengrad scoring metric. A training dataset was acquired from a semiconductor sample at different surface locations on the sample and at different distances from focus. The difference of these two images is processed through a regression CNN model, which was trained to learn a direct mapping between the amount of defocus aberration and the distance from the focal plane. As an alternative, we developed a deep learning model that predicts in one shot the distance offset to the focal plane from any initial position using an input of only two images taken a set distance apart. Conventional microscopy focusing methods perform a time consuming sweep through the Z-axis in order to estimate the focal plane.