DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

Spelling Translate

原文链接

prepartion

选用RNN的网络,主要是针对词来算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Xdata = []
Ydata = []
MAX_LENGTH_WORD = 10


feature_dict= dict()
feature_list = list()

PADDING_CHARACTER = '~'
feature_dict[PADDING_CHARACTER]=0
feature_list.append(PADDING_CHARACTER)
max_features = 1

def get_vector_from_string(input_s):
global max_features
vector_x = []
for i in input_s:
if i not in feature_dict:
feature_dict[i]=max_features
feature_list.append(i)
max_features += 1
vector_x.append(feature_dict[i])
return vector_x
def add_to_data(input_s,output_s):
if len(input_s) < MAX_LENGTH_WORD and len(output_s) < MAX_LENGTH_WORD:
vector_x = get_vector_from_string(input_s)
vector_y = get_vector_from_string(output_s)
Xdata.append(vector_x)
Ydata.append(vector_y)


def print_vector(vector,end_token='\n'):
print(''.join([feature_list[i] for i in vector]),end=end_token)

with open("dictionary_old_new_dutch.csv") as in_file:
for line in in_file:
in_s,out_s = line.strip().split(",")
add_to_data(in_s,out_s)
for i in range(10):
print_vector(Xdata[i],end_token='')
print(' -> ', end='')
print_vector(Ydata[i])

上面部分是将词转化成为vector,注意几点:

  • 限制max_length_word部分
  • 这里面没有像之前lstm那样,通过记录词频的方式,来进行one_hot编码
  • feature_list记录所有的词,同时,feature_dict记录词进入feature_list的位置

data preprocessing

As mentioned above I would like to use a sequence to sequence approach. Important for this approach is having a certain length of words. Words that are longer than that length have been discarded in de data-reading step above. Now we will add paddings to the words that are not long enough. — 对于不够的词增加相应的padding

Another important step is creating a train and a test set. We only show the network examples from the train set. At the end I will manually evaluate some of the examples in the testset and discuss what the network learned. During training we train in batches with a small amount of data. With a random data splitter we get a different trainset every run. — 挑选合适的训练集和测试集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
before_padding = Xdata[0]
Xdata = sequence.pad_sequences(Xdata, maxlen=MAX_LENGTH_WORD)
Ydata = sequence.pad_sequences(Ydata, maxlen=MAX_LENGTH_WORD)
after_padding = Xdata[0]

print_vector(before_padding,end_token='')
print(" -> after padding: ", end='')
print_vector(after_padding)

class DataSplitter:
def __init__(self,percentage):
self.percentage = percentage
def split_data(self,data):
splitpoint = int(len(data)*self.percentage)
return data[:splitpoint], data[splitpoint:]
splitter = DataSplitter(0.8)
Xdata,Xtest = splitter.split_data(Xdata)
Ydata,Ytest = splitter.split_data(Ydata)

def get_random_reversed_dataset(Xdata,Ydata,batch_size):
newX = []
newY = []
for _ in range(batch_size):
index_taken = random.randint(0,len(Xdata)-1)
newX.append(Xdata[index_taken])
newY.append(Ydata[index_taken][::-1])
return newX,newY

这里的几点:

  • 使用sequence.pad_sequences警醒padding补充

The network

  • embeds our characters
  • has an encoder which returns a sequence of outputs
  • has an attention model which uses this sequence to generate output characters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
batch_size = 64
memory_dim = 256
embedding_dim = 32

enc_input = [tf.placeholder(tf.int32, shape=(None,)) for i in range(MAX_LENGTH_WORD)]
dec_output = [tf.placeholder(tf.int32, shape=(None,)) for t in range(MAX_LENGTH_WORD)]

weights = [tf.ones_like(labels_t, dtype=tf.float32) for labels_t in enc_input]

dec_inp = ([tf.zeros_like(enc_input[0], dtype=np.int32)]+[dec_output[t] for t in range(MAX_LENGTH_WORD-1)])
empty_dec_inp = ([tf.zeros_like(enc_input[0], dtype=np.int32,name="empty_dec_input") for t in range(MAX_LENGTH_WORD)])

cell = tf.nn.rnn_cell.GRUCell(memory_dim)

# Create a train version of encoder-decoder, and a test version which does not feed the previous input
with tf.variable_scope("decoder1") as scope:
outputs, _ = tf.nn.seq2seq.embedding_attention_seq2seq(enc_input,dec_inp,
cell,max_features,max_features,
embedding_dim, feed_previous=False)
with tf.variable_scope("decoder1",reuse=True) as scope:
runtime_outputs, _ = tf.nn.seq2seq.embedding_attention_seq2seq(enc_input,empty_dec_inp,
cell,max_features,max_features,
embedding_dim,feed_previous=True)

loss = tf.nn.seq2seq.sequence_loss(outputs, dec_output, weights, max_features)

optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss)

# Init everything
sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())

代码解析:

 1. enc_input和dec_output分别是10dim的数组
 2. init weights部分
 3. memory_dim的意义?
 4. 采用的是GRUCell的小处理阀门
 5. feed_previous的意义?

Training

1
2
3
4
5
6
7
8
9
for index_now in range(1002):
Xin,Yin = get_random_reversed_dataset(Xdata,Ydata,batch_size)
Xin = np.array(Xin).T
Yin = np.array(Yin).T
feed_dict = {enc_input[t]: Xin[t] for t in range(MAX_LENGTH_WORD)}
feed_dict.update({dec_output[t]: Yin[t] for t in range(MAX_LENGTH_WORD)})
_, l = sess.run([train_op,loss], feed_dict)
if index_now%100==1:
print(l)

每次选取不同的random的数据集来进行相关的训练,增加随机性部分?

Train analysis

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def get_reversed_max_string_logits(logits):
string_logits = logits[::-1]
concatenated_string = ""
for logit in string_logits:
val_here = np.argmax(logit)
concatenated_string += feature_list[val_here]
return concatenated_string
def print_out(out):
out = list(zip(*out))
out = out[:10] # only show the first 10 samples

for index,string_logits in enumerate(out):
print("input: ",end='')
print_vector(Xin[index])
print("expected: ",end='')
expected= Yin[index][::-1]
print_vector(expected)

output = get_reversed_max_string_logits(string_logits)
print("output: " + output)


print("==============")


# Now run a small test to see what our network does with words
RANDOM_TESTSIZE = 5
Xin,Yin = get_random_reversed_dataset(Xtest,Ytest,RANDOM_TESTSIZE)
Xin_transposed = np.array(Xin).T
Yin_transposed = np.array(Yin).T
feed_dict = {enc_input[t]: Xin_transposed[t] for t in range(MAX_LENGTH_WORD)}
out = sess.run(runtime_outputs, feed_dict)
print_out(out)


def translate_single_word(word):
Xin = [get_vector_from_string(word)]
Xin = sequence.pad_sequences(Xin, maxlen=MAX_LENGTH_WORD)
Xin_transposed = np.array(Xin).T
feed_dict = {enc_input[t]: Xin_transposed[t] for t in range(MAX_LENGTH_WORD)}
out = sess.run(runtime_outputs, feed_dict)
return get_reversed_max_string_logits(out)

interesting_words = ["aerde","duyster", "salfde", "ontstondt", "tusschen","wacker","voorraet","gevreeset","cleopatra"]
for word in interesting_words:
print(word + " becomes: " + translate_single_word(word).replace("~",""))

代码解析:

  • 首先选取相应的test_case,训练处最终的结果,按照loss最小的参数
  • 然后是通过run,得到single_word的translate的结果
  • 相关输出的一些反转,通过vector还原成为相应的单词