文章/答案/技术大牛

发布

社区首页 >问答首页 >字嵌入、数据格式设置不确定如何将它们结合在一起

问字嵌入、数据格式设置不确定如何将它们结合在一起
EN

Stack Overflow用户

提问于 2020-10-07 15:04:09

回答 1查看 206关注 0票数 0

我试图通过一个教程，并重新应用于另一个问题的实践。最初的教程就在这里，https://www.tensorflow.org/tutorials/text/word_embeddings。

我正在使用这个数据集，而不是优化最终盈利，

id - product - url - html_snippet - html_blob - profitable
1 - toygun - "https://toyguns.com" - "a place for toy guns" - "<!DOCTYPE html><ht.." - 1
1 - umbrella - "https://umbrellas.com" - "a place for umbrellas" - "<!DOCTYPE ...." - 0

获取数据的例子如下，

url = 'https://moodmap.app'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')

print(soup)

现在，当我尝试将这些数据放入这种格式时，它会产生一些错误，我很困惑是否可以将它以与本例中相同的格式进行处理？

...
batch_size = 1024
seed = 123
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='training', seed=seed)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='validation', seed=seed)
..

有了这个目录部分，我可以用上面的值来解析数据吗？

AUTOTUNE = tf.data.experimental.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

有了这一节，1000字的词汇，这对我输入的内容是正确的吗？

# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)

result = embedding_layer(tf.constant([1,2,3]))
result.numpy()

result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.shape

# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to 
# integers. Note that the layer uses the custom standardization defined above. 
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

我觉得剩下的应该管用。

我想在这里得到一些帮助--也许我的做法是错误的，但似乎是一个类似的工作流程？

谢谢!

python

tensorflow

keras

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-10-08 12:34:51

因此，我不完全确定您想要实现什么，但假设您有某种文本数据，您想要输入一个神经网络，并为分类提供一个嵌入层，我将这样对待它：

首先，在您进行实验时，我将使用原始数据，而不是使用tf.keras.preprocessing.text_dataset_from_directory，，以避免对数据的性质和结构产生任何混淆。它也将允许你没有一个静态的火车测试分裂，这在我的经验是需要更稳定的解决方案。

这意味着我将加载整个数据集，对其进行预处理，然后将其拆分为Test、培训和验证数据集。我更喜欢在探索性阶段这样做，因为我相信这样可以更容易地理解您的数据。此外，我还将使用一些我建议您学习使用的包，如果您还没有使用它们(大熊猫、numpy、sklearn)，那么类似这样的东西(假设您的数据在CSV文件中)：

# ----- Import needed packages -----
import pandas as pd
import numpy as np
import tensorflow as tf

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from collections import Counter

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# ----- Load data -----
file_path = "/PATH/TO/DATA/data.csv"
df = pd.read_pickle(file_path)

# ----- Get labels -----
y = np.int32(df.NAME_OF_LABEL_COLUMN.astype('category').cat.codes.to_numpy())

# ----- Get number of classes -----
num_classes = np.unique(y).shape[0]

# ----- Remove HTML tags from your text -----
def custom_standardization(text):
    #rewrite your function to apply on each text in your data
    pass

df['Cleaned_Text'] = df.NAME_OF_TEXT_COLUMN.apply(custom_standardization)

# ----- Prepare text for embedding -----
# Define these values so they fit your project
max_features = 10000
output_dim = 16

# ----- Get top 10000 most occuring words in list-----
results = Counter()
df['Cleaned_Text'].str.split().apply(results.update)
vocabulary = [key[0] for key in results.most_common(max_features)]

# ----- Create tokenizer based on your top 10000 words -----
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(vocabulary)

# ----- Convert words to ints and pad -----
X = tokenizer.texts_to_sequences(df['Cleaned_Text'].values)
X = pad_sequences(X)

max_input_lenght = X.shape[1]

# ----- Split into Train, Test, Validation sets -----
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

然后，一旦进行了预处理，我们就可以构造和训练模型。

# ----- Define model -----
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=max_features, output_dim=output_dim, input_length=max_input_lenght))
model.add(tf.keras.layers.GlobalAveragePooling1D())
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

# ----- Compile model -----
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(), optimizer=tf.keras.optimizers.Adam(1e-4), metrics=["accuracy"])

# ----- Train model -----
history = model.fit(X_train, y_train, batch_size=8,epochs=20, validation_data=(X_val, y_val))

# ----- Evaluate model -----
probabilities = model.predict(X_test)
pred = np.argmax(probabilities, axis=1)

print(" ")
print("Results")

accuracy = accuracy_score(y_test, pred)

print('Accuracy: {:.4f}'.format(accuracy))
print(" ")
print(classification_report(y_test, pred))

可以对此代码进行改进，但我认为您可以更容易地测试您想要测试的任何内容，因为它在每个步骤中都提供了更大的透明度。当然，你需要修改它以适应你的具体情况。

如果您热衷于使用基于您的工作的Tensorflow示例，我建议您查看一下您自己的案例和教程中的数据格式，因为这可能是问题所在。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64246893

复制

相似问题

问字嵌入、数据格式设置不确定如何将它们结合在一起
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字嵌入、数据格式设置不确定如何将它们结合在一起EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字嵌入、数据格式设置不确定如何将它们结合在一起
EN