提高弹性搜索信号的性能

2024-06-26 13:48:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用的是一个叫做commonsearch的系统。特别是在这篇文章中,我们将讨论它的后端部分,witch是用python编写的。你知道吗

后端系统将warc文件和索引的内容流到2个elasticsearch集群-1)Text elasticsearch集群2)Document elasticsearch集群。你知道吗

在添加我的更改之前,索引的平均速度约为每个索引0.02。你知道吗

在我的更改之后,它是~1.00(在aws上是0.4)。你知道吗

所以我做了什么。

我使用html2text为每个warc主体剥离html,但实际上并不需要太多时间(可能+0.02),但它肯定会使性能更加尖刻,内容更多,剥离html所需的时间更多。你知道吗

我为每个索引值添加了2textblob文本分类器(naiveBays)检查,它的训练被序列化(pickle)并在循环之前加载。你知道吗

第一个训练包含33000个测试数据,第二个包含几百个(我将在第二个训练中添加更多)。你知道吗

性能分析

各10个示例。你知道吗

更改前:

Indexing http://2sao.vn/p1004c1007n20110413113841718/mau-vay-du-tiec-cho-quy-co-hoan-hao.vnn                                                                                           [64/1817]
--- 0.0224668979645 seconds ---
Indexing http://2sidesoftheocean.blogspot.com/2012/04/my-first-family-in-1940-us-census_02.html
--- 0.0367019176483 seconds ---
Indexing http://3.pulsitemeter.com/exbii/exbii-photos-aunties-bath-.html
--- 0.00342702865601 seconds ---
Indexing http://303cycling.com/Meredith-Miller-USGP-Cyclocross-Video-Specialized-bikes
--- 0.0187289714813 seconds ---
Indexing http://303magazine.com/2012/10/undead-mans-party-casselmans-hosts-zombie-crawl-aftermath-featuring-celldweller/
--- 0.0460560321808 seconds ---
Indexing http://38-avg.blogspot.com/2008/05/birdheart.html
--- 0.0178949832916 seconds ---
Indexing http://3docean.net/item/motorola-droid-razr-low-poly-/3712487?sso
--- 0.0468878746033 seconds ---
Indexing http://4.bp.blogspot.com/_hZs38tqNXns/StdbQyR_zGI/AAAAAAAAEyw/VvNCalngDbY/s1600-h/Vanderwood
--- 0.00142908096313 seconds ---
Indexing http://411mania.com/sports/young-firpo-the-best-light-heavywieght-to-never-win-a-title/
--- 0.0295450687408 seconds ---

添加html2text后:

Indexing http://17hmr.net/index.php?action=profile;area=showposts;u=994
--- 0.0240960121155 seconds ---
Indexing http://17hmr.net/index.php?board=1.3060;sort=last_post
--- 0.0262401103973 seconds ---
Indexing http://17hmr.net/index.php?topic=12827.msg177073
--- 0.0259499549866 seconds ---
Indexing http://17hmr.net/index.php?topic=6751.45
--- 0.0249440670013 seconds ---
Indexing http://1889.ca/2012/11/interview-with-horror-author-mike-kearby/
--- 0.0152020454407 seconds ---
Indexing http://1980s.fm/modules.php?name=Forums&file=profile&mode=viewprofile&u=94
--- 0.151058912277 seconds ---
Indexing http://1n73r.net/category/microsoft/windows-microsoft/xp/
--- 0.0693669319153 seconds ---
Indexing http://2013missworld.com/
--- 0.0448951721191 seconds ---
Indexing http://24demayito.blogspot.com/
--- 0.111493110657 seconds ---
Indexing http://24kadra.com/2009/03/04/serial-bratany/
--- 0.145864963531 seconds ---

添加html2text和一个分类器(小分类器)后:

Indexing http://102theriver.iheart.com/articles
--- 0.333050012589 seconds ---
Indexing http://1035kissfm.iheart.com/articles/trending-104650/reading-rainbow-campaign-nets-1-million-12410738
--- 0.334407091141 seconds ---
Indexing http://1037theq.iheart.com/articles/trending-465498/tiesto-celebrates-a-town-called-paradise-12478486/
--- 0.34556388855 seconds ---
Indexing http://1065ctq.iheart.com/articles/national-news-104668/new-electronic-license-plates-could-be-11383289/
--- 0.330471038818 seconds ---
Indexing http://10kbullets.com/reviews/neon-nights/
--- 0.328196048737 seconds ---
Indexing http://12160.info/group/gunsandtactics/forum/topic/show?id=2649739%3ATopic%3A1105218&xg_source=msg
--- 0.353976011276 seconds ---
Indexing http://12under12under2012.blogspot.com/2012/04/aprils-forsta-vinnare-blev.html
--- 0.363568067551 seconds ---
Indexing http://1350kman.com/settlement-reached-in-salina-contamination-cleanup/
--- 0.367321968079 seconds ---
Indexing http://14ers.com/php14ers/loginviaforum.php?prgm=peakstatus_main
--- 0.309129953384 seconds ---
Indexing http://16sarkisozleri.blogspot.com/2012/12/nasip-degilmis-demet-akaln-ftozcan-deniz.html
--- 0.361335992813 seconds ---

添加html2text和一个分类器(大的分类器)后:

Indexing http://10000birds.com/white-crested-laughingthrush.htm
--- 2.16983008385 seconds ---
Indexing http://1012lounge.com/
--- 1.48357391357 seconds ---
Indexing http://1015store.com/dresses-by-colors/coral-dresses.html
--- 1.85999703407 seconds ---
Indexing http://1019ampradio.cbslocal.com/tag/happy-holidays/
--- 1.24361300468 seconds ---
Indexing http://102theriver.iheart.com/articles
--- 1.25308895111 seconds ---
Indexing http://1035kissfm.iheart.com/articles/trending-104650/reading-rainbow-campaign-nets-1-million-12410738
--- 1.19226098061 seconds ---
Indexing http://1037theq.iheart.com/articles/trending-465498/tiesto-celebrates-a-town-called-paradise-12478486/
--- 1.14514183998 seconds ---
Indexing http://1065ctq.iheart.com/articles/national-news-104668/new-electronic-license-plates-could-be-11383289/
--- 1.09987902641 seconds ---
Indexing http://10kbullets.com/reviews/neon-nights/
--- 1.07253599167 seconds ---
Indexing http://12160.info/group/gunsandtactics/forum/topic/show?id=2649739%3ATopic%3A1105218&xg_source=msg
--- 1.1537129879 seconds ---

添加html2text和两个分类器后:

Indexing http://12under12under2012.blogspot.com/2012/04/aprils-forsta-vinnare-blev.html
--- 1.43961000443 seconds ---
Indexing http://1350kman.com/settlement-reached-in-salina-contamination-cleanup/
--- 1.37341785431 seconds ---
Indexing http://14ers.com/php14ers/loginviaforum.php?prgm=peakstatus_main
--- 1.26939201355 seconds ---
Indexing http://16sarkisozleri.blogspot.com/2012/12/nasip-degilmis-demet-akaln-ftozcan-deniz.html
--- 1.36402606964 seconds ---
Indexing http://17hmr.net/index.php?action=profile;area=showposts;u=994
--- 1.23323822021 seconds ---
Indexing http://17hmr.net/index.php?board=1.3060;sort=last_post
--- 1.22554993629 seconds ---
Indexing http://17hmr.net/index.php?topic=12827.msg177073
--- 1.23036003113 seconds ---
Indexing http://17hmr.net/index.php?topic=6751.45
--- 1.20131611824 seconds ---
Indexing http://1889.ca/2012/11/interview-with-horror-author-mike-kearby/
--- 1.1732749939 seconds ---
Indexing http://1980s.fm/modules.php?name=Forums&file=profile&mode=viewprofile&u=94
--- 1.36015105247 seconds ---
Indexing http://1n73r.net/category/microsoft/windows-microsoft/xp/
--- 1.2988049984 seconds ---

很少提及

这个项目也部署在aws上。当我在aws上运行它时,它的每个索引显示0.4(我自己是1.3)。你知道吗

问题

我怎样才能提高这一切的表现呢? 我应该让分类器的训练更轻松,但更精确吗? 为什么aws和我的电脑差别这么大? 你需要代码来理解吗?如果需要我可以加上。你知道吗

欢迎提出任何意见!你知道吗


Tags: comawshttptopicindexnet分类器html
1条回答
网友
1楼 · 发布于 2024-06-26 13:48:16

对于每个问题:

我怎样才能提高这一切的表现? 这里有几种方法,可以根据训练所用的模型(例如单词包)为文本和类选择特征,也可以尝试LSA和LSI,请看:Text classification performance

我应该让分类器的训练更轻松,但更精确吗?根据你所说的精确,几乎是肯定的,一些模型的文本表示,是高维的,可能会发生维数灾难,你可以使用特征选择。您还可以使用一些采样方法来减少数据的训练元组,请看以下内容: http://searchbusinessanalytics.techtarget.com/definition/data-sampling

为什么aws和我的电脑有这么大的区别?很简单,AWS有更先进的算法和强大的资源

相关问题 更多 >