13 Commits

Author SHA1 Message Date
Gea-Suan Lin
4a1e2b9c5e Fix test function naming and the actual test. 2024-02-19 10:42:23 +08:00
Gea-Suan Lin
14e54f2393 Fix test function naming. 2024-02-19 10:41:38 +08:00
Gea-Suan Lin
3dcd171227 Rename. 2024-02-16 21:02:40 +08:00
Gea-Suan Lin
1de46569e8 Add a simple test case for tokenizer. 2024-02-16 20:59:15 +08:00
Gea-Suan Lin
57c153a6c3 Add more test about bigram. 2024-02-16 20:55:23 +08:00
Gea-Suan Lin
55ad14e790 Add test cases for bigram. 2024-02-16 20:54:28 +08:00
Gea-Suan Lin
8c3985c386 Use testify. 2024-02-16 20:49:00 +08:00
Gea-Suan Lin
6247ed36cd Add a simple test case. 2024-02-16 20:43:26 +08:00
Gea-Suan Lin
18fbfa7292 Rename tokenize to tokenizer. 2024-02-09 11:47:13 +08:00
Gea-Suan Lin
ce79d2b245 Implement tokenize(). 2024-02-09 11:46:19 +08:00
Gea-Suan Lin
a5b6a3c7a1 Rewrite splitter.
Merge all english characters (like "apple", not "ap" "pp" "pl" "le"),
but keep splitting on Chinese words.
2024-02-09 11:25:26 +08:00
Gea-Suan Lin
9e455bb15a Implement gram-related functions. 2024-01-31 09:43:04 +08:00
Gea-Suan Lin
86bf78c762 Read artifact. 2024-01-31 09:42:42 +08:00