感谢参考原文-http://bjbsair.com/2020-04-01/tech-info/18508.html
当您看到一个图像,您的大脑可以轻松分辨出图像的含义,但是计算机可以分辨出图像的含义吗?计算机视觉研究人员为此做了很多工作,他们认为直到现在都不可能!随着深度学习技术的进步,海量数据集的可用性和计算机功能的增强,我们可以构建可以为图像生成字幕的模型。
这就是我们将在这个项目中实现的目标,在该项目中,我们将一起使用卷积神经网络和一种循环神经网络(LSTM)的深度学习技术。
什么是图像字幕生成器?
图像标题生成器是一项任务,涉及计算机视觉和自然语言处理概念,以识别图像的上下文并以自然语言描述它们。
我们项目的目的是学习CNN和LSTM模型的概念,并通过使用LSTM实现CNN来构建图像字幕生成器的工作模型。
在这个项目中我们将使用CNN(卷积神经网络) 和LSTM(长短期记忆)实现字幕生成器。图像特征将从Xception中提取,Xception是在imagenet数据集上训练的CNN模型,然后我们将特征输入到LSTM模型中,该模型将负责生成图像标题。
整理数据集
对于图像标题生成器,我们将使用Flickr_8K数据集。还有其他一些大数据集,例如Flickr_30K和MSCOCO数据集,但是训练网络可能需要数周的时间,因此我们将使用一个小的Flickr8k数据集。庞大的数据集的优势在于我们可以构建更好的模型。
准备条件
我们将需要以下的几种库
- tensorflow
- keras
- pillow
- numpy
- tqdm
- jupyterlab
1.首先,我们导入所有必需的库
import string
import numpy as np
from PIL import Image
import os
from pickle import dump, load
import numpy as np
from keras.applications.xception import Xception, preprocess_input
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers.merge import add
from keras.models import Model, load_model
from keras.layers import Input, Dense, LSTM, Embedding, Dropout
# small library for seeing the progress of loops.
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()
2、获取并执行数据清理
我们文件的格式是图像和标题,用新行(“ \ n”)分隔。
每个图像有5个字幕,我们可以看到为每个字幕分配了#(0到5)数字。
我们将定义5个函数:
- load_doc(filename)–用于加载文档文件并将文件内部的内容读取为字符串。
- all_img_captions(filename)–此函数将创建一个描述字典,该字典映射具有5个字幕列表的图像。
- cleaning_text(descriptions)–此函数获取所有描述并执行数据清理。当使用文本数据时,这是重要的一步,根据目标,我们决定要对文本执行哪种类型的清理。在我们的例子中,我们将删除标点符号,将所有文本转换为小写并删除包含数字的单词。
- text_vocabulary(descriptions)–这是一个简单的函数,它将分隔所有唯一的单词并从所有描述中创建词汇表。
- save_descriptions(descriptions,filename)–该函数将创建一个已被预处理的所有描述的列表,并将它们存储到文件中。我们将创建一个descriptions.txt文件来存储所有标题。
# Loading a text file into memory
def load_doc(filename):
# Opening the file as read only
file = open(filename, ‘r‘)
text = file.read()
file.close()
return text
# get all imgs with their captions
def all_img_captions(filename):
file = load_doc(filename)
captions = file.split(‘\n‘)
descriptions ={}
for caption in captions[:-1]:
img, caption = caption.split(‘\t‘)
if img[:-2] not in descriptions:
descriptions[img[:-2]] =
else:
descriptions[img[:-2]].append(caption)
return descriptions
#Data cleaning- lower casing, removing puntuations and words containing numbers
def cleaning_text(captions):
table = str.maketrans(‘‘,‘‘,string.punctuation)
for img,caps in captions.items():
for i,img_caption in enumerate(caps):
img_caption.replace("-"," ")
desc = img_caption.split()
#converts to lowercase
desc = [word.lower() for word in desc]
#remove punctuation from each token
desc = [word.translate(table) for word in desc]
#remove hanging ‘s and a
desc = [word for word in desc if(len(word)>1)]
#remove tokens with numbers in them
desc = [word for word in desc if(word.isalpha())]
#convert back to string
img_caption = ‘ ‘.join(desc)
captions[img][i]= img_caption
return captions
def text_vocabulary(descriptions):
# build vocabulary of all unique words
vocab = set()
for key in descriptions.keys():
[vocab.update(d.split()) for d in descriptions[key]]
return vocab
#All descriptions in one file
def save_descriptions(descriptions, filename):
lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
lines.append(key + ‘\t‘ + desc )
data = "\n".join(lines)
file = open(filename,"w")
file.write(data)
file.close()
# Set these path according to project folder in you system
dataset_text = "D:\dataflair projects\Project - Image Caption Generator\Flickr_8k_text"
dataset_images = "D:\dataflair projects\Project - Image Caption Generator\Flicker8k_Dataset"
#we prepare our text data
filename = dataset_text + "/" + "Flickr8k.token.txt"
#loading the file that contains all data
#mapping them into descriptions dictionary img to 5 captions
descriptions = all_img_captions(filename)
print("Length of descriptions =" ,len(descriptions))
#cleaning the descriptions
clean_descriptions = cleaning_text(descriptions)
#building vocabulary
vocabulary = text_vocabulary(clean_descriptions)
print("Length of vocabulary = ", len(vocabulary))
#saving each description to file
save_descriptions(clean_descriptions, "descriptions.txt")
3、从所有图像中提取特征向量
这项技术也称为转移学习,我们不必自己做任何事情,我们使用已经在大型数据集上进行训练的预训练模型,并从这些模型中提取特征并将其用于我们的任务。我们正在使用Xception模型,该模型已经在imagenet数据集中进行了训练,该数据集具有1000个不同的类别进行分类。我们可以直接从keras.applications导入此模型。由于Xception模型最初是为imagenet构建的,因此与模型集成时,我们所做的改动很少。需要注意的一件事是,Xception模型采用299 299 3的图像尺寸作为输入。我们将删除最后一个分类层,并获得2048个特征向量。
模型= Xception(include_top = False,pooling =‘avg‘)
函数extract_features()将提取所有图像的特征,然后将图像名称与它们各自的特征数组映射。然后,我们将特征字典转储到“ features.p”pickle文件中。
def extract_features(directory):
model = Xception( include_top=False, pooling=‘avg‘ )
features = {}
for img in tqdm(os.listdir(directory)):
filename = directory + "/" + img
image = Image.open(filename)
image = image.resize((299,299))
image = np.expand_dims(image, axis=0)
#image = preprocess_input(image)
image = image/127.5
image = image - 1.0
feature = model.predict(image)
features[img] = feature
return features
#2048 feature vector
features = extract_features(dataset_images)
dump(features, open("features.p","wb"))
根据您的系统,此过程可能会花费很多时间。
features = load(open("features.p","rb"))
4、加载数据集以训练模型
在Flickr_8k_test文件夹中,我们有Flickr_8k.trainImages.txt文件,其中包含用于训练的6000个图像名称的列表。
为了加载训练数据集,我们需要更多函数:
- load_photos(filename)–这将以字符串形式加载文本文件,并返回图像名称列表。
- load_clean_descriptions(文件名,照片)–此函数将创建一个字典,其中包含照片列表中每张照片的标题。我们还为每个字幕附加了<start>和<end>标识符。我们需要这样做,以便我们的LSTM模型可以识别字幕的开始和结束。
- load_features(photos)–此函数将为我们提供先前从Xception模型提取的图像名称及其特征向量的字典。
#load the data
def load_photos(filename):
file = load_doc(filename)
photos = file.split("\n")[:-1]
return photos
def load_clean_descriptions(filename, photos):
#loading clean_descriptions
file = load_doc(filename)
descriptions = {}
for line in file.split("\n"):
words = line.split()
if len(words)<1 :
continue
image, image_caption = words[0], words[1:]
if image in photos:
if image not in descriptions:
descriptions[image] = []
desc = ‘<start> ‘ + " ".join(image_caption) + ‘ <end>‘
descriptions[image].append(desc)
return descriptions
def load_features(photos):
#loading all features
all_features = load(open("features.p","rb"))
#selecting only needed features
features = {k:all_features[k] for k in photos}
return features
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
#train = loading_data(filename)
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)
5、词汇化
我们将用唯一的索引值映射词汇表中的每个单词。Keras库为我们提供了tokenizer函数,我们将使用该函数从词汇表创建令牌并将其保存到“ tokenizer.p”pickle文件中。
#calculate maximum length of descriptions
def max_length(descriptions):
desc_list = dict_to_list(descriptions)
return max(len(d.split()) for d in desc_list)
max_length = max_length(descriptions)
max_length
我们的词汇表包含7577个单词。
我们计算描述的最大长度。这对于确定模型结构参数很重要。说明的最大长度为32。
#create input-output sequence pairs from the image description.
#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
while 1:
for key, description_list in descriptions.items():
#retrieve photo features
feature = features[key][0]
input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)
yield [[input_image, input_sequence], output_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
X1, X2, y = list(), list(), list()
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(feature)
X2.append(in_seq)
y.append(out_seq)
return np.array(X1), np.array(X2), np.array(y)
#You can check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))
6、创建数据生成器
首先让我们看一下模型输入和输出的样子。为了使此任务成为监督学习任务,我们必须为模型提供输入和输出以进行训练。我们必须在6000张图像上训练模型,每张图像将包含2048个长度的特征向量,并且标题也以数字表示。不能将这6000个图像的数据量保存到内存中,因此我们将使用生成器方法来生成批处理。
生成器将产生输入和输出序列。
#create input-output sequence pairs from the image description.
#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
while 1:
for key, description_list in descriptions.items():
#retrieve photo features
feature = features[key][0]
input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)
yield [[input_image, input_sequence], output_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
X1, X2, y = list(), list(), list()
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(feature)
X2.append(in_seq)
y.append(out_seq)
return np.array(X1), np.array(X2), np.array(y)
#You can check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))
7.定义CNN-RNN模型
为了定义模型的结构,我们将使用Functional API中的Keras模型。它将包括三个主要部分:
- Feature Extractor–从图像中提取的特征大小为2048,带有密集层,我们会将尺寸减小到256个节点。
- Sequence Processor–嵌入层将处理文本输入,然后是LSTM层。
- Decoder –通过合并以上两层的输出,我们将按密集层进行处理以做出最终预测。最后一层将包含等于我们词汇量的节点数。
最终模型的视觉表示如下:
from keras.utils import plot_model
# define the captioning model
def define_model(vocab_size, max_length):
# features from the CNN model squeezed from 2048 to 256 nodes
inputs1 = Input(shape=(2048,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation=‘relu‘)(fe1)
# LSTM sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# Merging both models
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation=‘relu‘)(decoder1)
outputs = Dense(vocab_size, activation=‘softmax‘)(decoder2)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss=‘categorical_crossentropy‘, optimizer=‘adam‘)
# summarize model
print(model.summary())
plot_model(model, to_file=‘model.png‘, show_shapes=True)
return model
8、训练模型
为了训练模型,我们将使用6000个训练图像,方法是分批生成输入和输出序列,并使用model.fit_generator()方法将它们拟合到模型中。我们还将模型保存到我们的模型文件夹中。
# train our model
print(‘Dataset: ‘, len(train_imgs))
print(‘Descriptions: train=‘, len(train_descriptions))
print(‘Photos: train=‘, len(train_features))
print(‘Vocabulary Size:‘, vocab_size)
print(‘Description Length: ‘, max_length)
model = define_model(vocab_size, max_length)
epochs = 10
steps = len(train_descriptions)
# making a directory models to save our models
os.mkdir("models")
for i in range(epochs):
generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)
model.save("models/model_" + str(i) + ".h5")
9、测试模型
该模型已经过训练,现在,我们将制作一个单独的文件testing_caption_generator.py,它将加载模型并生成预测。预测包含索引值的最大长度,因此我们将使用相同的tokenizer.p pickle文件从其索引值中获取单词。
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import argparse
ap = argparse.ArgumentParser()
ap.add_argument(‘-i‘, ‘--image‘, required=True, help="Image Path")
args = vars(ap.parse_args())
img_path = args[‘image‘]
def extract_features(filename, model):
try:
image = Image.open(filename)
except:
print("ERROR: Couldn‘t open image! Make sure the image path and extension is correct")
image = image.resize((299,299))
image = np.array(image)
# for images that has 4 channels, we convert them into 3 channels
if image.shape[2] == 4:
image = image[..., :3]
image = np.expand_dims(image, axis=0)
image = image/127.5
image = image - 1.0
feature = model.predict(image)
return feature
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
def generate_desc(model, tokenizer, photo, max_length):
in_text = ‘start‘
for i in range(max_length):
sequence = tokenizer.texts_to_sequences([in_text])[0]
sequence = pad_sequences([sequence], maxlen=max_length)
pred = model.predict([photo,sequence], verbose=0)
pred = np.argmax(pred)
word = word_for_id(pred, tokenizer)
if word is None:
break
in_text += ‘ ‘ + word
if word == ‘end‘:
break
return in_text
#path = ‘Flicker8k_Dataset/111537222_07e56d5a30.jpg‘
max_length = 32
tokenizer = load(open("tokenizer.p","rb"))
model = load_model(‘models/model_9.h5‘)
xception_model = Xception(include_top=False, pooling="avg")
photo = extract_features(img_path, xception_model)
img = Image.open(img_path)
description = generate_desc(model, tokenizer, photo, max_length)
print("\n\n")
print(description)
plt.imshow(img)
two girls are playing in the grass(两个女孩在草地上玩)
结论
在这个项目中,我们通过构建图像标题生成器实现了CNN-RNN模型。需要注意的一些关键点是,我们的模型取决于数据,因此,它无法预测词汇量之外的单词。我们使用了一个包含8000张图像的小型数据集。对于生产级别的模型,我们需要对大于100,000张图像的数据集进行训练,以产生更好的精度模型。
原文地址:https://blog.51cto.com/14744108/2484182