如何处理BigTable Scan InvalidChunk异常？

2024-10-03 19:21:41 发布

男 | 程序猿一只，喜欢编程写python代码。

我试图扫描BigTable数据，其中有些行是“脏的”——但是根据扫描的不同，这会失败，导致（序列化？）InvalidChunk异常。代码如下：

from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)

for key, row in table.scan(limit=5000):  #BOOM!
    pass

省略一些列或将行限制为更少或指定start和stop键可以使扫描成功。我无法从stacktrace中检测出哪些值有问题-它在不同的列中有所不同-扫描只是失败了。这使得在源代码处清除数据变得有问题。在

当我使用python调试器时，我看到块（类型谷歌.bigtable.v2.bigtable_pb2.CellChunk）没有值（为NULL/未定义）：

^{pr2}$

我可以通过rowkey的HBase shell来确认这一点（我从self获得）_row.row_键）

所以问题变成了：BigTable扫描如何过滤出具有未定义/空/空值的列？

我从两个google云api中得到了相同的问题，这两个api返回的生成器在gRPC上以数据块的形式进行内部流式传输：

在谷歌.cloud.快乐基地。表。表#扫描（）
在谷歌.cloud.大表。表。表#读取行（）

缩写stacktrace如下：

---------------------------------------------------------------------------
InvalidChunk                              Traceback (most recent call last)
<ipython-input-48-922c8127f43b> in <module>()
      1 row_gen = table.scan(limit=n) 
      2 rows = []
----> 3 for kvp in row_gen:
      4     pass
.../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs)
    391         while True:
    392             try:
--> 393                 partial_rows_data.consume_next()
    394                 for row_key in sorted(rows_dict):
    395                     curr_row_data = rows_dict.pop(row_key)

.../site-packages/google/cloud/bigtable/row_data.py in consume_next(self)
    273         for chunk in response.chunks:
    274 
--> 275             self._validate_chunk(chunk)
    276 
    277             if chunk.reset_row:

.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk)
    388             self._validate_chunk_new_row(chunk)
    389         if self.state == self.ROW_IN_PROGRESS:
--> 390             self._validate_chunk_row_in_progress(chunk)
    391         if self.state == self.CELL_IN_PROGRESS:
    392             self._validate_chunk_cell_in_progress(chunk)

.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk)
    368         self._validate_chunk_status(chunk)
    369         if not chunk.HasField('commit_row') and not chunk.reset_row:
--> 370             _raise_if(not chunk.timestamp_micros or not chunk.value)
    371         _raise_if(chunk.row_key and
    372                   chunk.row_key != self._row.row_key)

.../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args)
    439     """Helper for validation methods."""
    440     if predicate:
--> 441         raise InvalidChunk(*args)

InvalidChunk:

你能告诉我如何从Python扫描BigTable，忽略/记录引发InvalidChunk的脏行吗？ （试试。。。除了不能在生成器周围工作，生成器位于google cloud APIrow\u data PartialRowsData类中）

另外，你能给我演示BigTable中的块流表扫描的代码吗？ HappyBase批处理大小扫描批处理似乎不受支持。在

Tags： instance key in self cloud for data if

1条回答

网友

1楼 · 发布于 2024-10-03 19:21:41

这可能是由于这个错误：https://github.com/googleapis/google-cloud-python/issues/2980

这个bug已经被修复了，所以这应该不再是一个问题了。在

如何处理BigTable Scan InvalidChunk异常？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何处理BigTable Scan InvalidChunk异常？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >