4 releases
0.1.3 | Jun 20, 2022 |
---|---|
0.1.2 | Jun 20, 2022 |
0.1.1 | Jun 20, 2022 |
0.1.0 | Jun 20, 2022 |
#22 in #pinyin
15KB
133 lines
tantivy-pinyin
就像名字一样,这是一个 tantivy 的拼音分析器
Just like the name, this is a pinyin tokenizer of tantivy
Usage (用法)
add dependencies
tantivy_pinyin = "0.1.0"
This is an example of pinyin tokenizer:
use tantivy::collector::{Count, TopDocs};
use tantivy::query::TermQuery;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};
use tantivy::tokenizer::{PreTokenizedString, Token, Tokenizer};
use tempfile::TempDir;
use tantivy_pinyin::PinyinTokenizer;
fn pre_tokenize_text(text: &str) -> Vec<Token> {
let mut token_stream = PinyinTokenizer.token_stream(text);
let mut tokens = vec![];
while token_stream.advance() {
tokens.push(token_stream.token().clone());
}
tokens
}
pub fn main() -> tantivy::Result<()> {
let index_path = TempDir::new()?;
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("title", TEXT | STORED);
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?;
// We can create a document manually, by setting the fields
// one by one in a Document object.
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let title_text = "大多数知识,不需要我们记住";
let body_text = "大多数知识,只需要认知即可";
// Content of our first document
// We create `PreTokenizedString` which contains original text and vector of tokens
let title_tok = PreTokenizedString {
text: String::from(title_text),
tokens: pre_tokenize_text(title_text),
};
println!(
"Original text: \"{}\" and tokens: {:?}",
title_tok.text, title_tok.tokens
);
let body_tok = PreTokenizedString {
text: String::from(body_text),
tokens: pre_tokenize_text(body_text),
};
// Now lets create a document and add our `PreTokenizedString`
let old_man_doc = doc!(title => title_tok, body => body_tok);
// ... now let's just add it to the IndexWriter
index_writer.add_document(old_man_doc)?;
// Let's commit changes
index_writer.commit()?;
// ... and now is the time to query our index
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()?;
let searcher = reader.searcher();
// We want to get documents with token "Man", we will use TermQuery to do it
// Using PreTokenizedString means the tokens are stored as is avoiding stemming
// and lowercasing, which preserves full words in their original form
let query = TermQuery::new(
//Term::from_field_text(title, "liu"),
Term::from_field_text(body, "xin"),
IndexRecordOption::Basic,
);
let (top_docs, count) = searcher.search(&query, &(TopDocs::with_limit(2), Count))?;
println!("Found {} documents", count);
// Now let's print out the results.
// Note that the tokens are not stored along with the original text
// in the document store
for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("Document: {}", schema.to_json(&retrieved_doc));
}
Ok(())
}
Features
stop_words 中文停用词
Test
cargo test
附言
项目比较小,如果帮助到了你,给个 star 鼓励一下作者吧
Dependencies
~21MB
~331K SLoC