我有来自这个数据集(https://raw.githubusercontent.com/alvations/stasis/master/sts.csv)的行:
dataset Domain Score Sent1 Sent2
STS2012-gold surprise.OnWN 5.000 render one language in another language restate (words) from one language into another language.
STS2012-gold surprise.OnWN 3.250 nations unified by shared interests, history or institutions a group of nations having common interests.
STS2012-gold surprise.OnWN 3.250 convert into absorbable substances, (as if) with heat or chemical process soften or disintegrate by means of chemical action, heat, or moisture.
STS2012-gold surprise.OnWN 4.000 devote or adapt exclusively to an skill, study, or work devote oneself to a special area of work.
STS2012-gold surprise.OnWN 3.250 elevated wooden porch of a house a porch that resembles the deck on a ship.
我已经使用read_csv()
函数将其读入graphlab.SFrame
:
还有一些行没有被解析。回溯如下:
PROGRESS: Unable to parse line "STS2012-gold MSRpar 3.800 "She was crying and scared,' said Isa Yasin, the owner of the store. "She was crying and she was really scared," said Yasin."
PROGRESS: Unable to parse line "STS2012-gold MSRpar 2.200 "And about eight to 10 seconds down, I hit. "I was in the water for about eight seconds."
PROGRESS: Unable to parse line "STS2012-gold MSRpar 2.800 "It's a major victory for Maine, and it's a major victory for other states. The Maine program could be a model for other states."
PROGRESS: Unable to parse line "STS2012-gold MSRpar 4.000 "Right from the beginning, we didn't want to see anyone take a cut in pay. But Mr. Crosby told The Associated Press: "Right from the beginning, we didn't want to see anyone take a cut in pay."
PROGRESS: Unable to parse line "STS2014-gold deft-forum 0.8 "Then the captain was gone. Then the captain came back."
PROGRESS: Unable to parse line "STS2014-gold deft-forum 1.8 "Oh, you're such a good person! You're such a bad person!""
PROGRESS: Unable to parse line "STS2012-train MSRpar 3.750 "We put a lot of effort and energy into improving our patching process, probably later than we should have and now we're just gaining incredible speed. "We've put a lot of effort and energy into improving our patching progress, p..."
PROGRESS: Unable to parse line "STS2012-train MSRpar 4.000 "Tomorrow at the Mission Inn, I have the opportunity to congratulate the governor-elect of the great state of California. "I have the opportunity to congratulate the governor-elect of the great state of California, and I'm lookin..."
PROGRESS: Unable to parse line "STS2012-train MSRpar 3.600 "Unlike many early-stage Internet firms, Google is believed to be profitable. The privately held Google is believed to be profitable."
PROGRESS: Unable to parse line "STS2012-train MSRpar 4.000 "It was a final test before delivering the missile to the armed forces. State radio said it was the last test before the missile was delivered to the armed forces."
PROGRESS: 22 lines failed to parse correctly
PROGRESS: Finished parsing file /home/alvas/git/stasis/sts.csv
PROGRESS: Parsing completed. Parsed 19075 lines in 0.069578 secs.
看看这几行,如果我的Sent1
或Sent2
列中有一个包含奇数个双引号的问题。在
使用error_bad_lines
跟踪有问题的行:
sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str],
error_bad_lines=True)
它会进行回溯:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-15-a1ec53597af9> in <module>()
1 sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str],
----> 2 error_bad_lines=True)
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in read_csv(cls, url, delimiter, header, error_bad_lines, comment_char, escape_char, double_quote, quote_char, skip_initial_space, column_type_hints, na_values, line_terminator, usecols, nrows, skiprows, verbose, **kwargs)
1537 verbose=verbose,
1538 store_errors=False,
-> 1539 **kwargs)[0]
1540
1541
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in _read_csv_impl(cls, url, delimiter, header, error_bad_lines, comment_char, escape_char, double_quote, quote_char, skip_initial_space, column_type_hints, na_values, line_terminator, usecols, nrows, skiprows, verbose, store_errors, **kwargs)
1097 glconnect.get_client().set_log_progress(False)
1098 with cython_context():
-> 1099 errors = proxy.load_from_csvs(internal_url, parsing_config, type_hints)
1100 except Exception as e:
1101 if type(e) == RuntimeError and "CSV parsing cancelled" in e.message:
/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Unable to parse line "STS2012-gold MSRpar 3.800 "She was crying and scared,' said Isa Yasin, the owner of the store. "She was crying and she was really scared," said Yasin."
Set error_bad_lines=False to skip bad lines
如果我的行包含奇数个双引号,有没有办法解决这个问题?
是否有一种方法可以在不清理数据的情况下完成这项工作(例如,识别有问题的行,然后清理/更正它们,但保留另一个SFrame来跟踪清理/更正?
作为一个健全性检查,如果我们在原始csv文件中搜索\t
,那么行中会有一个选项卡给出问题,但是当graphlab
解析它时,它就会消失:
作为另一个健全性检查,逐行读取文件并按\t
拆分,将为整个文件返回5列:
alvas@ubi:~/git/stasis$ head sts.csv
Dataset Domain Score Sent1 Sent2
STS2012-gold surprise.OnWN 5.000 render one language in another language restate (words) from one language into another language.
STS2012-gold surprise.OnWN 3.250 nations unified by shared interests, history or institutions a group of nations having common interests.
STS2012-gold surprise.OnWN 3.250 convert into absorbable substances, (as if) with heat or chemical process soften or disintegrate by means of chemical action, heat, or moisture.
STS2012-gold surprise.OnWN 4.000 devote or adapt exclusively to an skill, study, or work devote oneself to a special area of work.
STS2012-gold surprise.OnWN 3.250 elevated wooden porch of a house a porch that resembles the deck on a ship.
STS2012-gold surprise.OnWN 4.000 either half of an archery bow either of the two halves of a bow from handle to tip.
STS2012-gold surprise.OnWN 3.333 a removable device that is an accessory to larger object a supplementary part or accessory.
STS2012-gold surprise.OnWN 4.750 restrict or confine place limits on (extent or access).
STS2012-gold surprise.OnWN 0.500 orient, be positioned be opposite.
alvas@ubi:~/git/stasis$ python
Python 2.7.10 (default, Jun 30 2015, 15:30:23)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('sts.csv') as fin:
... for line in fin:
... print len(line.split('\t'))
... break
...
5
>>> with open('sts.csv') as fin:
... for line in fin:
... assert len(line.split('\t')) == 5
...
>>>
在我的版本graphlab
中,对4列行的@papayawarrior示例进行了正确的解析:
我手动检查了有问题的线路,它们是:
STS2012-gold MSRpar 3.800 "She was crying and scared,' said Isa Yasin, the owner of the store. "She was crying and she was really scared," said Yasin.
STS2012-gold MSRpar 2.200 "And about eight to 10 seconds down, I hit. "I was in the water for about eight seconds.
STS2012-gold MSRpar 2.800 "It's a major victory for Maine, and it's a major victory for other states. The Maine program could be a model for other states.
STS2012-gold MSRpar 4.000 "Right from the beginning, we didn't want to see anyone take a cut in pay. But Mr. Crosby told The Associated Press: "Right from the beginning, we didn't want to see anyone take a cut in pay.
STS2012-train MSRpar 3.750 "We put a lot of effort and energy into improving our patching process, probably later than we should have and now we're just gaining incredible speed. "We've put a lot of effort and energy into improving our patching progress, probably later than we should have.
STS2012-train MSRpar 4.000 "Tomorrow at the Mission Inn, I have the opportunity to congratulate the governor-elect of the great state of California. "I have the opportunity to congratulate the governor-elect of the great state of California, and I'm looking forward to it."
STS2012-train MSRpar 3.600 "Unlike many early-stage Internet firms, Google is believed to be profitable. The privately held Google is believed to be profitable.
STS2012-train MSRpar 4.000 "It was a final test before delivering the missile to the armed forces. State radio said it was the last test before the missile was delivered to the armed forces.
STS2012-train MSRpar 4.750 "The economy, nonetheless, has yet to exhibit sustainable growth. But the economy hasn't shown signs of sustainable growth.
STS2014-gold deft-forum 0.8 "Then the captain was gone. Then the captain came back.
STS2014-gold deft-forum 1.8 "Oh, you're such a good person! You're such a bad person!"
STS2015-gold answers-forums "Normal, healthy (physically, nutritionally and mentally) individuals have little reason to worry about accidentally consuming too much water. It's fine to skip arm specific exercises if you are already happy with how they are progressing without direct exercises.
STS2015-gold answers-forums 1.40 "The grass family is one of the most widely distributed and abundant groups of plants on Earth. As noted on the Wiki page, grass seed was imported to the new world to improve pasturage for livestock.
STS2015-gold answers-forums "God is exactly this Substance underlying who supports, exist independently of, and persist through time changes in material nature. I'd argue that matter and energy are substances in the category of empirical scientific knowledge.
STS2015-gold belief "watching the first fight i saw that manny pacquiao was getting tired, and i wasn't. at the same time, an asian summit is being held in a tourist resort.
STS2015-gold belief "global warming doesn't mean every year will be warmer than the last. doesn't matter, that will just be obama's fault as well.
STS2015-gold belief "the only reason i'm not as confident that there's something about the birth certificate... the conventional view is that the us and ussr fought it out in the body of vietnam.
STS2015-gold belief "im not playing these bullshit games... if not get the hell out of there.
STS2015-gold belief "that oil is already contaminating our shoreline. what point are you trying to relay?
STS2015-gold belief "we cannot write history with laws. "she's not sitting here" he said.
STS2015-gold belief the protest is going well so far. our request is the same.
STS2015-gold belief "for over 20 years, i have illustrated the absurd with absurdity, three hours a day, five days a week. for the first 1-2 years he hated me going out with my friends.
不是通过反复从PROGRESS: ...
详细消息中清除这些行来手动查找这些行,是否有一种方法可以在将这些行加载到Graphlab SFrame时将其转储出去?
更新答案
向@alvas道歉,我没有看到原始帖子中有完整的数据集链接。在所有的行中确实有五列,问题似乎是引号不匹配。如果列中没有匹配的引号,
SFrame
CSV解析器会感到困惑,因此简单的答案是将引号字符更改为数据集中没有出现的字符。在这为我成功地读取了19097行。在
另外,还有一个
SFrame.read_csv_with_errors
方法,它将把“好”行读入SFrame,并在未解析的SArray
中收集“坏”行。这样就可以用编程的方式跟踪有问题的行。在原始答案
您的数据行似乎不包含任何引号,因此这不是问题所在。问题是在某些数据行(和标题)中有5列,而在其他数据行中只有4列。在
第一行有四列:
^{pr2}$第二排有五个:
为了解决这个问题,我将调用
SFrame
csv解析器两次,一次用于四列数据,一次用于五列数据。因为第一个fow有四个列,所以这一列更简单:对于五列数据,我们必须跳过标题和第一行,然后重命名列:
那么
sts4
看起来像并且
sts5
是相关问题 更多 >
编程相关推荐