文章/答案/技术大牛

发布

社区首页 >问答首页 >最优ETL工艺与平台

问最优ETL工艺与平台
EN

Stack Overflow用户

提问于 2020-05-15 14:09:21

回答 1查看 54关注 0票数 0

我面临以下问题，我是云计算和数据库的新手。我想为应用程序设置一个简单的仪表板。基本上，我想复制这个显示空气污染数据的网站。https://airtube.info/

在我看来，我需要做的是：

从API下载数据：https://github.com/opendata-stuttgart/meta/wiki/EN-APIs和我想到了这个链接，特别是"https://data.sensor.community/static/v2/data.1h.json --每个传感器最后一个小时的所有测量值的平均值“。(技术: Python )
设置了一个机器人来转换数据，以适应我们的需要。(技术:Python)
将数据上传到数据库。(技术: Google或AWS)
将数据库连接到可视化工具上，这样每个人都可以在我们的网页上看到它。(技术:可能在Python中飞奔)

我的问题如下。1.你同意我的想法吗?还是你会改变一些元素来提高效率? 2.你认为运行python脚本来转换数据怎么样？有更简单的想法吗? 3.你建议建立数据库的技术是什么？

谢谢您的评论！向你问好，巴特克

amazon-web-services

google-bigquery

etl

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-05-19 09:06:22

如果您想对您的数据进行一些分析，我建议将数据上传到BigQuery，一旦完成，您可以在这里创建新的查询并获得要分析的结果。我正在查看数据集"data.1h.json“，我将使用如下模式在BigQuery中创建一个表：

CREATE TABLE dataset.pollution
(
  id NUMERIC,
  sampling_rate STRING,
  timestamp TIMESTAMP,
  location STRUCT<
  id NUMERIC,
  latitude FLOAT64,
  longitude FLOAT64,
  altitude FLOAT64,
  country STRING,
  exact_location INT64,
  indoor INT64
  >,
  sensor STRUCT<
    id NUMERIC,
    pin STRING,
    sensor_type STRUCT<
      id INT64,
      name STRING,
      manufacturer STRING
    >
  >,
  sensordatavalues ARRAY<STRUCT<
    id NUMERIC,
    value FLOAT64,
    value_type STRING
  >>
)

好的，我们已经创建了我们的表，所以现在我们需要将来自JSON文件的所有数据插入到那个表中，因为您想要使用BigQuery，所以我将使用Python库1从必须存储文件的2中的存储桶读取数据，并转换数据将其上传到BigQuery表。

代码应该是这样的：

from google.cloud import storage
import json
from google.cloud import bigquery

client = bigquery.Client()

table_id = "project.dataset.pollution"
# Instantiate a Google Cloud Storage client and specify required bucket and 
file
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket')
blob = bucket.blob('folder/data.1h.json')
table = client.get_table(table_id)
# Download the contents of the blob as a string and then parse it using 
json.loads() method
data = json.loads(blob.download_as_string(client=None))

# Partition the request in order to avoid reach quotas
partition = len(data)/4

cont = 0
data_aux = []
for part in data:
    if cont >= partition:
        errors = client.insert_rows(table, data_aux)  # Make an API request.
        if errors == []:
            print("New rows have been added.")
        else:
            print(errors)
        cont = 0
        data_aux = []
    # Avoid empty values (clean data)
    if part['location']['altitude'] is "":
        part['location']['altitude'] = 0
    if part['location']['latitude'] is "":
        part['location']['latitude'] = 0
    if part['location']['longitude'] is "":
        part['location']['longitude'] = 0
    data_aux.append(part)
    cont += 1

正如您在上面看到的，为了避免在请求大小上达到配额，我必须创建一个分区。在这里，您可以看到配额的数量以避免3。

此外，位置字段中的某些数据似乎有空值，因此有必要控制它们以避免错误。

由于您已经将数据存储在BigQuery中，为了创建一个新的仪表板，我将使用4可视化您的BigQuery数据，并对要显示的列创建查询。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61821269

复制

相似问题

问最优ETL工艺与平台
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最优ETL工艺与平台EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最优ETL工艺与平台
EN