在 Lucene 中获取词频
有没有一种快速简便的方法从 Lucene 索引中获取词频,而无需通过 TermVectorFrequencies 类来完成,因为对于大型集合来说这需要大量时间?
Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies class, since that takes an awful lot of time for large collections?
我的意思是,有没有像 TermEnum 这样的东西,它不仅有文档频率,还有词频?
What I mean is, is there something like TermEnum which has not just the document frequency but term frequency as well?
更新:使用 TermDocs 太慢了.
UPDATE: Using TermDocs is way too slow.
推荐答案
使用TermDocs 获取给定文档的词频.与文档频率一样,您可以使用感兴趣的术语从 IndexReader 获取术语文档.
您不会找到比 TermDocs 更快的方法而不失一些通用性.TermDocs 直接从索引段中的.frq"文件中读取,其中每个术语频率按文档顺序列出.
You won't find a faster method than TermDocs without losing some generality. TermDocs reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.
如果这太慢",请确保您已优化索引以将多个段合并为一个段.按顺序遍历文档(跳过没问题,但不能高效地在文档列表中来回跳转).
If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).
您的下一步可能是进行额外处理,以创建一个更专业的文件结构,省略 SkipData.就我个人而言,我会寻找更好的算法来实现我的目标,或者提供更好的硬件——大量内存,或者保存 RAMDirectory,或者提供给操作系统以在其自己的文件缓存系统上使用.
Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory, or to give to the OS for use on its own file-caching system.
相关文章