我对挖掘科学文献特别是PubMed很感兴趣。我想确定一个关键字左右两侧的修饰词。我的计划是(1)在我的听力和助听器数据库中查询“AID”这个词。(2) 然后,我从包含标题+摘要的字段中删除了标点符号、双空格等,由于历史原因,这些都是大写的。(3) 接下来,我在空格处拆分文本,(4)从MYSQL获得的列表中删除stopwords。回想起来,这个列表应该在某个类的某个地方。(5) 我寻找关键字“援助”并收集了前后的钥匙。由于我是python和sqlite的新手,因此这些代码来自StackOverflow和其他站点的许多源代码。代码中的问题区域如下。你知道吗
my_stopwords = '''['A','ABLE','ABOUT','ABOVE','ACCORDING','ACCORDINGLY','ACROSS','ACTUALLY','AFTER','AFTERWARDS','AGAIN','AGAINST','ALL','ALLOW','ALLOWS','ALMOST','ALONE','ALONG','ALREADY','ALSO','ALTHOUGH','ALWAYS','AM','AMONG','AMONGST','AN','ANOTHER',
'ANY','ANYBODY','ANYHOW','ANYONE','ANYTHING','ANYWAY','ANYWAYS','ANYWHERE','APART','APPEAR','APPRECIATE','APPROPRIATE','ARE',
'AROUND','AS','ASIDE','ASK','ASKING','ASSOCIATED','AT','AVAILABLE','AWAY','AWFULLY','BE','BECAME','BECAUSE','BECOME','BECOMES',
'BECOMING','BEEN','BEFORE','BEFOREHAND','BEHIND','BEING','BELIEVE','BELOW','BESIDE','BESIDES','BEST','BETTER','BETWEEN','BEYOND',
'BOTH','BRIEF','BUT','BY','CAME','CAN','CANNOT','CANT','CAUSE','CAUSES','CERTAIN','CERTAINLY','CHANGES','CLEARLY','CO','COM','COME',
'COMES','CONCERNING','CONSEQUENTLY','CONSIDER','CONSIDERING','CONTAIN','CONTAINING','CONTAINS','CORRESPONDING','COULD','COURSE',
'CURRENTLY','DEFINITELY','DESCRIBED','DESPITE','DETERMINE','DETERMINED','DID','DIFFERENT','DO','DOES','DOING','DONE','DOWN','DOWNWARDS','DURING','EACH','EDU',
'EFFECT','EFFECTS','EG','EIGHT','EITHER','ELSE','ELSEWHERE','ENOUGH','ENTIRELY','ESPECIALLY','ET','ETC','EVEN','EVER','EVERY','EVERYBODY','EVERYONE',
'EVERYTHING','EVERYWHERE','EX','EXACTLY','EXAMPLE','EXCEPT','FAR','FEW','FIFTH','FIRST','FIVE','FOLLOWED','FOLLOWING','FOLLOWS',
'FOR','FORMER','FORMERLY','FORTH','FOUR','FROM','FURTHER','FURTHERMORE','GET','GETS','GETTING','GIVEN','GIVES','GO','GOES','GOING',
'GONE','GOT','GOTTEN','GREETINGS','HAD','HAPPENS','HARDLY','HAS','HAVE','HAVING','HE','HELLO','HELP','HENCE','HER','HERE','HEREAFTER',
'HEREBY','HEREIN','HEREUPON','HERS','HERSELF','HI','HIM','HIMSELF','HIS','HITHER','HOPEFULLY','HOW','HOWBEIT','HOWEVER','IE','IF',
'IGNORED','IMMEDIATE','IN','INASMUCH','INC','INDEED','INDICATE','INDICATED','INDICATES','INNER','INSOFAR','INSTEAD','INTO','INWARD',
'IS','IT','ITS','ITSELF','JUST','KEEP','KEEPS','KEPT','KNOW','KNOWN','KNOWS','LAST','LATELY','LATER','LATTER','LATTERLY','LEAST','LESS',
'LEST','LET','LIKED','LIKELY','LITTLE','LOOK','LOOKING','LOOKS','LTD','MAINLY','MANY','MAY','MAYBE','ME','MEAN','MEANWHILE','MERELY',
'MIGHT','MORE','MOREOVER','MOST','MOSTLY','MUCH','MUST','MY','MYSELF','NAME','NAMELY','ND','NEAR','NEARLY','NECESSARY','NEED','NEEDS',
'NEITHER','NEVER','NEVERTHELESS','NEW','NEXT','NINE','NO','NOBODY','NON','NONE','NOONE','NOR','NORMALLY','NOT','NOTHING','NOVEL','NOW',
'NOWHERE','OBVIOUSLY','OF','OFF','OFTEN','OH','OK','OKAY','OLD','ON','ONCE','ONE','ONES','ONLY','ONTO','OTHER','OTHERS','OTHERWISE',
'OUGHT','OUR','OURS','OURSELVES','OUT','OUTSIDE','OVER','OVERALL','OWN','PARTICULAR','PARTICULARLY','PER','PERHAPS','PLACED','PLEASE',
'PLUS','POSSIBLE','PRESUMABLY','PROBABLY','PROVIDES','QUE','QUITE','QV','RATHER','RD','RE','REALLY','REASONABLY','REGARDING',
'REGARDLESS','REGARDS','RELATIVELY','RESPECTIVELY','RIGHT','SAID','SAME','SAW','SAY','SAYING','SAYS','SECOND','SECONDLY','SEE','SEEING',
'SEEM','SEEMED','SEEMING','SEEMS','SEEN','SELF','SELVES','SENSIBLE','SENT','SERIOUS','SERIOUSLY','SEVEN','SEVERAL','SHALL','SHE','SHOULD',
'SHOWED','SHOWS','SINCE','SIGNIFICANTLY','SIX','SO','SOME','SOMEBODY','SOMEHOW','SOMEONE','SOMETHING','SOMETIME','SOMETIMES','SOMEWHAT','SOMEWHERE','SOON','SORRY',
'SPECIFIED','SPECIFY','SPECIFYING','STILL','STUDY','SUB','SUCH','SUP','SURE','TAKE','TAKEN','TELL','TENDS','TH','THAN','THANK','THANKS',
'THANX','THAT','THATS','THE','THEIR','THEIRS','THEM','THEMSELVES','THEN','THENCE','THERE','THEREAFTER','THEREBY','THEREFORE',
'THEREIN','THERES','THEREUPON','THESE','THEY','THINK','THIRD','THIS','THOROUGH','THOROUGHLY','THOSE','THOUGH','THREE','THROUGH',
'THROUGHOUT','THRU','THUS','TO','TOGETHER','TOO','TOOK','TOWARD','TOWARDS','TRIED','TRIES','TRULY','TRY','TRYING','TWICE','TWO',
'UN','UNDER','UNFORTUNATELY','UNLESS','UNLIKELY','UNTIL','UNTO','UP','UPON','US','USE','USED','USEFUL','USES','USING','USUALLY',
'VALUE','VARIOUS','VERY','VIA','VIZ','VS','WANT','WANTS','WAS','WAY','WE','WELCOME','WELL','WENT','WERE','WHAT','WHATEVER','WHEN',
'WHENCE','WHENEVER','WHERE','WHEREAFTER','WHEREAS','WHEREBY','WHEREIN','WHEREUPON','WHEREVER','WHETHER','WHICH','WHILE','WHITHER',
'WHO','WHOEVER','WHOLE','WHOM','WHOSE','WHY','WILL','WILLING','WISH','WITH','WITHIN','WITHOUT','WONDER','WOULD','YES','YET','YOU',
'YOUR','YOURS','YOURSELF','YOURSELVES, 'zzz', 'ZZZ', zzSTOPzz']'''
str_split = string.split(' ')
keys = [word for word in str_split if word.upper() not in my_stopwords]
print ("Split Input: ", keys)
num_wds = len(keys)
print("Number of words = ", num_wds, "\n")
基本上,这是可行的,但关键字“援助”已经提出了一个两难的问题,我。下面是输出示例。你知道吗
在初始查询(代码未显示)之后,我得到以下结果。你知道吗
Input Abstract: PMID21839526zzz BONE-ANCHORED HEARING **AID** (BAHA) IN PATIENTS WITH TREACHER COLLINS SYNDROME: ....
在我清除了标点符号等之后,我得到了以下内容。你知道吗
Cleaned Input: PMID21839526zzz BONE-ANCHORED HEARING **AID** BAHA IN PATIENTS WITH TREACHER COLLINS SYNDROME....
在我运行上面的代码来拆分空格并删除不包含单词AID的stopwords列表之后,我得到了以下结果。注意,“援助”这个词已经从列表中删除了,这违背了我的目的。你知道吗
Split Input: ['PMID21839526zzz', 'BONE-ANCHORED', 'HEARING', 'BAHA', 'PATIENTS', 'TREACHER', 'COLLINS', 'SYNDROME',....
这段代码可以与其他关键字正常工作,包括“AIDS”、“MAGNETIC”等。问题出现在三个字母的关键字“AID”中。我将非常感谢解释或思考为什么这可能发生在这个具体的案件。我希望这足够清楚。谢谢你的帮助。你知道吗
我并不完全了解您的算法,但您的停止词列表需要是
list
(更好的是set
),而不是字符串:否则,只需进行子字符串匹配,而不是列表中的精确字符串匹配。你知道吗
例如,对于
s = "['THEY', 'THEM']"
,'HE' in s
是真的。如果s = ['THEY', 'THEM']
,'HE' in s
不是真的。前者是一个字符串,其内容类似于pythonlist
的语法。后者是pythonlist
。你知道吗相关问题 更多 >
编程相关推荐