我使用GCS python SDK和google API客户端的组合来遍历支持版本的存储桶,并根据元数据下载特定的对象。
from google.cloud import storage
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
def downloadepoch_objects():
request = service.objects().list(
bucket=bucket_name,
versions=True
)
response = request.execute()
for item in response['items']:
if item['metadata']['epoch'] == restore_epoch:
print(item['bucket'])
print(item['name'])
print(item['metadata']['epoch'])
print(item['updated'])
blob = source_bucket.blob(item['name'])
blob.download_to_filename(
'/Users/admin/git/data-processing/{}'.format(item))
downloadepoch_objects()对于不在目录(gs://bucketname/ test1.txt )中的blob,上述函数可以正常工作,因为传入的项只是test1.txt。我遇到的问题是在尝试从复杂的目录树(gs://bucketname/ nfs/media/docs/test1.txt )下载文件时,传递的项是nfs/media/docs/test1.txt。如果目录不存在,是否可以使用.download_to_file()方法来创建这些目录?
发布于 2018-02-22 08:58:45
下面是可行的解决方案。我最终从对象名中去掉了路径,并动态地创建了目录结构。一个更好的方法可能是@Brandon Yarbrough建议使用‘前缀+响应’前缀‘’,但我不太明白。希望这对其他人有所帮助。
#!/usr/local/bin/python3
from google.cloud import storage
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json
import os
import pathlib
bucket_name = 'test-bucket'
restore_epoch = '1519189202'
restore_location = '/Users/admin/data/'
credentials = GoogleCredentials.get_application_default()
service = discovery.build('storage', 'v1', credentials=credentials)
storage_client = storage.Client()
source_bucket = storage_client.get_bucket(bucket_name)
def listall_objects():
request = service.objects().list(
bucket=bucket_name,
versions=True
)
response = request.execute()
print(json.dumps(response, indent=2))
def listname_objects():
request = service.objects().list(
bucket=bucket_name,
versions=True
)
response = request.execute()
for item in response['items']:
print(item['name'] + ' Uploaded on: ' + item['updated'] +
' Epoch: ' + item['metadata']['epoch'])
def downloadepoch_objects():
request = service.objects().list(
bucket=bucket_name,
versions=True
)
response = request.execute()
try:
for item in response['items']:
if item['metadata']['epoch'] == restore_epoch:
print('Downloading ' + item['name'] + ' from ' +
item['bucket'] + '; Epoch= ' + item['metadata']['epoch'])
print('Saving to: ' + restore_location)
blob = source_bucket.blob(item['name'])
path = pathlib.Path(restore_location + r'{}'.format(item['name'])).parent
if os.path.isdir(path):
blob.download_to_filename(restore_location + '{}'.format(item['name']))
print('Download complete')
else:
os.mkdir(path)
blob.download_to_filename(restore_location + '{}'.format(item['name']))
print('Download complete')
except Exception:
pass
# listall_objects()
# listname_objects()
downloadepoch_objects()发布于 2018-01-26 03:20:04
GCS没有“目录”的概念,尽管像gsutil这样的工具为了方便起见在伪装方面做得很好。如果您想要"nfs/media/docs/“路径下的所有对象,可以将其指定为前缀,如下所示:
request = service.objects.list(
bucket=bucket_name,
versions=True,
prefix='nfs/media/docs/', # Only show objects beginning like this
delimiter='/' # Consider this character a directory marker.
)
response = request.execute()
subdirectories = response['prefixes']
objects = response['items']由于prefix参数的原因,在response['items']中只返回以'nfs/media/docs‘开头的对象。因为有了delimiter参数,所以在response['prefixes']中会返回“子目录”。您可以在Python documentation of the objects.list method中获得更多详细信息。
如果您要使用我为新代码推荐的较新的google-cloud Python library,同样的调用将看起来像pretty similar
from google.cloud import storage
client = storage.Client()
bucket = client.bucket(bucket_name)
iterator = bucket.list_blobs(
versions=True,
prefix='nfs/media/docs/',
delimiter='/'
)
subdirectories = iterator.prefixes
objects = list(iterator)https://stackoverflow.com/questions/48449299
复制相似问题