跳过EMR的重复项

1条回答

网友
1楼 · 发布于 2024-09-27 01:26:22

如果我看对了，你想知道bi是否不在A。你知道吗
我不在Python中编码，但我看到它是这样的（在<强> C++ >强>类语言中）
bool untranslated(int j,int m,int n,string *a,string *b) { // the dictionaries are: a[m],b[n] for (int i=0;j<m;i++) // inspect all tokens of A if (b[j]==a[i]) // if b[j] present in A return false; return true; }
如果字典相当大，那么您需要将此线性搜索更改为二进制搜索。此外，为了加快速度（如果单词很大），您需要使用哈希（哈希映射）进行匹配。当然，根据你的语言，你不能天真地将单词与==进行比较，而应该实现一些功能，将单词转换为其简单语法形式并存储到字典中。实现起来可能相当复杂。你知道吗
现在整句话的概率是：
// your dictionaries: const int m=?,n=?; string A[m],string B[n]; // code: int j; float p; for (p=0.0,j=0;j<n;j++) // test all words of B if (untranslated(j,m,n,A,B)) p++; // and count how many are untranslated p/=float(n); // normalize p to <0,1> its your probability that sentence B is not in A
得到的概率p在<0,1>范围内，所以如果你想要百分比，只需将它乘以100。你知道吗
[Edit1]出现bi
这是完全不同的问题，但也相对容易解决。与计算直方图相同，因此：
为A字典中的每个单词添加计数器
所以A的每个记录都是这样的：
struct A_record { string word; int cnt; }; int m=0; A_record a[];
处理B句
在每个单词上查字典。如果不存在，则将其添加到dictionary并将其计数器设置为1。如果存在，则只需将其计数器增加1即可。你知道吗
const int n=?; // input sentence word count string b[n]={...}; // input sentence words int i,j; for (i=0;i<n;i++) // process B for (j=0;j<m;j++) // search in A (should be binary search or has-map search) if (b[i]==a[j].word) { a[j].cnt++; j=-1; break; } // here a[j].cnt is the bi occurrence you wanted if divided by m then its probability <0,1> if (j<0) { a[m].word=b[i]; a[m].cnt=1; m++; } // here no previous occurrence of bi
现在，如果您只希望前面出现的bi，那么在搜索过程中查看匹配的a[j].cnt。如果要在整个文本中出现任何b[i]单词，请在处理整个文本后查看同一计数器。

相关问题更多 >

编程相关推荐

热门问题

热门文章