使用CNN和LSTM构建图像字幕标题生成器

感谢参考原文-http://bjbsair.com/2020-04-01/tech-info/18508.html

当您看到一个图像，您的大脑可以轻松分辨出图像的含义，但是计算机可以分辨出图像的含义吗？计算机视觉研究人员为此做了很多工作，他们认为直到现在都不可能！随着深度学习技术的进步，海量数据集的可用性和计算机功能的增强，我们可以构建可以为图像生成字幕的模型。

这就是我们将在这个项目中实现的目标，在该项目中，我们将一起使用卷积神经网络和一种循环神经网络（LSTM）的深度学习技术。

什么是图像字幕生成器？

图像标题生成器是一项任务，涉及计算机视觉和自然语言处理概念，以识别图像的上下文并以自然语言描述它们。

我们项目的目的是学习CNN和LSTM模型的概念，并通过使用LSTM实现CNN来构建图像字幕生成器的工作模型。

在这个项目中我们将使用CNN（卷积神经网络） 和LSTM（长短期记忆）实现字幕生成器。图像特征将从Xception中提取，Xception是在imagenet数据集上训练的CNN模型，然后我们将特征输入到LSTM模型中，该模型将负责生成图像标题。

整理数据集

对于图像标题生成器，我们将使用Flickr_8K数据集。还有其他一些大数据集，例如Flickr_30K和MSCOCO数据集，但是训练网络可能需要数周的时间，因此我们将使用一个小的Flickr8k数据集。庞大的数据集的优势在于我们可以构建更好的模型。

准备条件

我们将需要以下的几种库

tensorflow
keras
pillow
numpy
tqdm
jupyterlab

1.首先，我们导入所有必需的库

import string
import numpy as np
from PIL import Image
import os
from pickle import dump, load
import numpy as np
from keras.applications.xception import Xception, preprocess_input
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers.merge import add
from keras.models import Model, load_model
from keras.layers import Input, Dense, LSTM, Embedding, Dropout
# small library for seeing the progress of loops.
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

2、获取并执行数据清理

我们文件的格式是图像和标题，用新行（“ \ n”）分隔。

每个图像有5个字幕，我们可以看到为每个字幕分配了＃（0到5）数字。

我们将定义5个函数：

load_doc（filename）–用于加载文档文件并将文件内部的内容读取为字符串。
all_img_captions（filename）–此函数将创建一个描述字典，该字典映射具有5个字幕列表的图像。
cleaning_text（descriptions）–此函数获取所有描述并执行数据清理。当使用文本数据时，这是重要的一步，根据目标，我们决定要对文本执行哪种类型的清理。在我们的例子中，我们将删除标点符号，将所有文本转换为小写并删除包含数字的单词。
text_vocabulary（descriptions）–这是一个简单的函数，它将分隔所有唯一的单词并从所有描述中创建词汇表。
save_descriptions（descriptions，filename）–该函数将创建一个已被预处理的所有描述的列表，并将它们存储到文件中。我们将创建一个descriptions.txt文件来存储所有标题。

# Loading a text file into memory
def load_doc(filename):
    # Opening the file as read only
    file = open(filename, ‘r‘)
    text = file.read()
    file.close()
    return text
# get all imgs with their captions
def all_img_captions(filename):
    file = load_doc(filename)
    captions = file.split(‘\n‘)
    descriptions ={}
    for caption in captions[:-1]:
        img, caption = caption.split(‘\t‘)
        if img[:-2] not in descriptions:
            descriptions[img[:-2]] =
        else:
            descriptions[img[:-2]].append(caption)
    return descriptions
#Data cleaning- lower casing, removing puntuations and words containing numbers
def cleaning_text(captions):
    table = str.maketrans(‘‘,‘‘,string.punctuation)
    for img,caps in captions.items():
        for i,img_caption in enumerate(caps):
            img_caption.replace("-"," ")
            desc = img_caption.split()
            #converts to lowercase
            desc = [word.lower() for word in desc]
            #remove punctuation from each token
            desc = [word.translate(table) for word in desc]
            #remove hanging ‘s and a
            desc = [word for word in desc if(len(word)>1)]
            #remove tokens with numbers in them
            desc = [word for word in desc if(word.isalpha())]
            #convert back to string
            img_caption = ‘ ‘.join(desc)
            captions[img][i]= img_caption
    return captions
def text_vocabulary(descriptions):
    # build vocabulary of all unique words
    vocab = set()
    for key in descriptions.keys():
        [vocab.update(d.split()) for d in descriptions[key]]
    return vocab
#All descriptions in one file
def save_descriptions(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + ‘\t‘ + desc )
    data = "\n".join(lines)
    file = open(filename,"w")
    file.write(data)
    file.close()
# Set these path according to project folder in you system
dataset_text = "D:\dataflair projects\Project - Image Caption Generator\Flickr_8k_text"
dataset_images = "D:\dataflair projects\Project - Image Caption Generator\Flicker8k_Dataset"
#we prepare our text data
filename = dataset_text + "/" + "Flickr8k.token.txt"
#loading the file that contains all data
#mapping them into descriptions dictionary img to 5 captions
descriptions = all_img_captions(filename)
print("Length of descriptions =" ,len(descriptions))
#cleaning the descriptions
clean_descriptions = cleaning_text(descriptions)
#building vocabulary
vocabulary = text_vocabulary(clean_descriptions)
print("Length of vocabulary = ", len(vocabulary))
#saving each description to file
save_descriptions(clean_descriptions, "descriptions.txt")

3、从所有图像中提取特征向量

这项技术也称为转移学习，我们不必自己做任何事情，我们使用已经在大型数据集上进行训练的预训练模型，并从这些模型中提取特征并将其用于我们的任务。我们正在使用Xception模型，该模型已经在imagenet数据集中进行了训练，该数据集具有1000个不同的类别进行分类。我们可以直接从keras.applications导入此模型。由于Xception模型最初是为imagenet构建的，因此与模型集成时，我们所做的改动很少。需要注意的一件事是，Xception模型采用299 299 3的图像尺寸作为输入。我们将删除最后一个分类层，并获得2048个特征向量。

模型= Xception（include_top = False，pooling =‘avg‘）

函数extract_features（）将提取所有图像的特征，然后将图像名称与它们各自的特征数组映射。然后，我们将特征字典转储到“ features.p”pickle文件中。

def extract_features(directory):
        model = Xception( include_top=False, pooling=‘avg‘ )
        features = {}
        for img in tqdm(os.listdir(directory)):
            filename = directory + "/" + img
            image = Image.open(filename)
            image = image.resize((299,299))
            image = np.expand_dims(image, axis=0)
            #image = preprocess_input(image)
            image = image/127.5
            image = image - 1.0
            feature = model.predict(image)
            features[img] = feature
        return features
#2048 feature vector
features = extract_features(dataset_images)
dump(features, open("features.p","wb"))

根据您的系统，此过程可能会花费很多时间。

features = load(open("features.p","rb"))

4、加载数据集以训练模型

在Flickr_8k_test文件夹中，我们有Flickr_8k.trainImages.txt文件，其中包含用于训练的6000个图像名称的列表。

为了加载训练数据集，我们需要更多函数：

load_photos（filename）–这将以字符串形式加载文本文件，并返回图像名称列表。
load_clean_descriptions（文件名，照片）–此函数将创建一个字典，其中包含照片列表中每张照片的标题。我们还为每个字幕附加了<start>和<end>标识符。我们需要这样做，以便我们的LSTM模型可以识别字幕的开始和结束。
load_features（photos）–此函数将为我们提供先前从Xception模型提取的图像名称及其特征向量的字典。

#load the data
def load_photos(filename):
    file = load_doc(filename)
    photos = file.split("\n")[:-1]
    return photos
def load_clean_descriptions(filename, photos):
    #loading clean_descriptions
    file = load_doc(filename)
    descriptions = {}
    for line in file.split("\n"):
        words = line.split()
        if len(words)<1 :
            continue
        image, image_caption = words[0], words[1:]
        if image in photos:
            if image not in descriptions:
                descriptions[image] = []
            desc = ‘<start> ‘ + " ".join(image_caption) + ‘ <end>‘
            descriptions[image].append(desc)
    return descriptions
def load_features(photos):
    #loading all features
    all_features = load(open("features.p","rb"))
    #selecting only needed features
    features = {k:all_features[k] for k in photos}
    return features
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
#train = loading_data(filename)
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)

5、词汇化

我们将用唯一的索引值映射词汇表中的每个单词。Keras库为我们提供了tokenizer函数，我们将使用该函数从词汇表创建令牌并将其保存到“ tokenizer.p”pickle文件中。

#calculate maximum length of descriptions
def max_length(descriptions):
    desc_list = dict_to_list(descriptions)
    return max(len(d.split()) for d in desc_list)  

max_length = max_length(descriptions)
max_length

我们的词汇表包含7577个单词。

我们计算描述的最大长度。这对于确定模型结构参数很重要。说明的最大长度为32。

#create input-output sequence pairs from the image description.
#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
    while 1:
        for key, description_list in descriptions.items():
            #retrieve photo features
            feature = features[key][0]
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)
            yield [[input_image, input_sequence], output_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
    X1, X2, y = list(), list(), list()
    # walk through each description for the image
    for desc in desc_list:
        # encode the sequence
        seq = tokenizer.texts_to_sequences([desc])[0]
        # split one sequence into multiple X,y pairs
        for i in range(1, len(seq)):
            # split into input and output pair
            in_seq, out_seq = seq[:i], seq[i]
            # pad input sequence
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
            # encode output sequence
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
            # store
            X1.append(feature)
            X2.append(in_seq)
            y.append(out_seq)
    return np.array(X1), np.array(X2), np.array(y)
#You can check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))

6、创建数据生成器

首先让我们看一下模型输入和输出的样子。为了使此任务成为监督学习任务，我们必须为模型提供输入和输出以进行训练。我们必须在6000张图像上训练模型，每张图像将包含2048个长度的特征向量，并且标题也以数字表示。不能将这6000个图像的数据量保存到内存中，因此我们将使用生成器方法来生成批处理。

生成器将产生输入和输出序列。

#create input-output sequence pairs from the image description.
#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
    while 1:
        for key, description_list in descriptions.items():
            #retrieve photo features
            feature = features[key][0]
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)
            yield [[input_image, input_sequence], output_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
    X1, X2, y = list(), list(), list()
    # walk through each description for the image
    for desc in desc_list:
        # encode the sequence
        seq = tokenizer.texts_to_sequences([desc])[0]
        # split one sequence into multiple X,y pairs
        for i in range(1, len(seq)):
            # split into input and output pair
            in_seq, out_seq = seq[:i], seq[i]
            # pad input sequence
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
            # encode output sequence
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
            # store
            X1.append(feature)
            X2.append(in_seq)
            y.append(out_seq)
    return np.array(X1), np.array(X2), np.array(y)
#You can check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))

7.定义CNN-RNN模型

为了定义模型的结构，我们将使用Functional API中的Keras模型。它将包括三个主要部分：

Feature Extractor–从图像中提取的特征大小为2048，带有密集层，我们会将尺寸减小到256个节点。
Sequence Processor–嵌入层将处理文本输入，然后是LSTM层。
Decoder –通过合并以上两层的输出，我们将按密集层进行处理以做出最终预测。最后一层将包含等于我们词汇量的节点数。

最终模型的视觉表示如下：

from keras.utils import plot_model
# define the captioning model
def define_model(vocab_size, max_length):
    # features from the CNN model squeezed from 2048 to 256 nodes
    inputs1 = Input(shape=(2048,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation=‘relu‘)(fe1)
    # LSTM sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    # Merging both models
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation=‘relu‘)(decoder1)
    outputs = Dense(vocab_size, activation=‘softmax‘)(decoder2)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss=‘categorical_crossentropy‘, optimizer=‘adam‘)
    # summarize model
    print(model.summary())
    plot_model(model, to_file=‘model.png‘, show_shapes=True)
    return model

8、训练模型

为了训练模型，我们将使用6000个训练图像，方法是分批生成输入和输出序列，并使用model.fit_generator（）方法将它们拟合到模型中。我们还将模型保存到我们的模型文件夹中。

# train our model
print(‘Dataset: ‘, len(train_imgs))
print(‘Descriptions: train=‘, len(train_descriptions))
print(‘Photos: train=‘, len(train_features))
print(‘Vocabulary Size:‘, vocab_size)
print(‘Description Length: ‘, max_length)
model = define_model(vocab_size, max_length)
epochs = 10
steps = len(train_descriptions)
# making a directory models to save our models
os.mkdir("models")
for i in range(epochs):
    generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
    model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)
    model.save("models/model_" + str(i) + ".h5")

9、测试模型

该模型已经过训练，现在，我们将制作一个单独的文件testing_caption_generator.py，它将加载模型并生成预测。预测包含索引值的最大长度，因此我们将使用相同的tokenizer.p pickle文件从其索引值中获取单词。

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import argparse
ap = argparse.ArgumentParser()
ap.add_argument(‘-i‘, ‘--image‘, required=True, help="Image Path")
args = vars(ap.parse_args())
img_path = args[‘image‘]
def extract_features(filename, model):
        try:
            image = Image.open(filename)
        except:
            print("ERROR: Couldn‘t open image! Make sure the image path and extension is correct")
        image = image.resize((299,299))
        image = np.array(image)
        # for images that has 4 channels, we convert them into 3 channels
        if image.shape[2] == 4:
            image = image[..., :3]
        image = np.expand_dims(image, axis=0)
        image = image/127.5
        image = image - 1.0
        feature = model.predict(image)
        return feature
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
     if index == integer:
         return word
return None
def generate_desc(model, tokenizer, photo, max_length):
    in_text = ‘start‘
    for i in range(max_length):
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        sequence = pad_sequences([sequence], maxlen=max_length)
        pred = model.predict([photo,sequence], verbose=0)
        pred = np.argmax(pred)
        word = word_for_id(pred, tokenizer)
        if word is None:
            break
        in_text += ‘ ‘ + word
        if word == ‘end‘:
            break
    return in_text
#path = ‘Flicker8k_Dataset/111537222_07e56d5a30.jpg‘
max_length = 32
tokenizer = load(open("tokenizer.p","rb"))
model = load_model(‘models/model_9.h5‘)
xception_model = Xception(include_top=False, pooling="avg")
photo = extract_features(img_path, xception_model)
img = Image.open(img_path)
description = generate_desc(model, tokenizer, photo, max_length)
print("\n\n")
print(description)
plt.imshow(img)

two girls are playing in the grass(两个女孩在草地上玩)

结论

在这个项目中，我们通过构建图像标题生成器实现了CNN-RNN模型。需要注意的一些关键点是，我们的模型取决于数据，因此，它无法预测词汇量之外的单词。我们使用了一个包含8000张图像的小型数据集。对于生产级别的模型，我们需要对大于100,000张图像的数据集进行训练，以产生更好的精度模型。

原文地址：https://blog.51cto.com/14744108/2484182

时间： 2024-11-13 11:25:40

使用CNN和LSTM构建图像字幕标题生成器

什么是图像字幕生成器？

整理数据集

准备条件

结论

使用CNN和LSTM构建图像字幕标题生成器的相关文章

基于pytorch的CNN、LSTM神经网络模型调参小结

原来CNN是这样提取图像特征的。。。

如何基于TensorFlow使用LSTM和CNN实现时序分类任务

(转) 干货 | 图解LSTM神经网络架构及其11种变体（附论文）

DeepLearning tutorial（5）CNN卷积神经网络应用于人脸识别（详细流程+代码实现）

循环和递归神经网络 (RNN) 与长短时记忆 (LSTM)

Nature | 光学CNN层替换传统CNN层，超省电

Web开发——HTML基础（HTML中的图像）

卷积神经网络（CNN）在语音识别中的应用