Michael McCandless: Lucene's new BlockPostingsFormat, thanks to Google Summer of Code

Another summer wraps up, another successful Google Summer of Codeproject finished in Apache Lucene!

We had two projects this summer; here I'll describe Han Jiang's project (I was his mentor). The second project, separating StorableFieldand IndexableField, is also very exciting and wrapping up now.

The goal of Han's project was to create a compelling replacement int-block postings format, improving on the aging variable-int postings format that Lucene has used forever. The project was definitely a success! This morning I merged over the feature branch we had been using all summer, committing to both Lucene's trunk (future 5.x) and Lucene 4.x branch (to be 4.0 GA). This means Lucene 4.0 GA will have the new BlockPostingsFormatavailable.

The new postings format shows incredible search performance gains versus the current default postings format. I tested with luceneutilon the full May 2012 English Wikipedia index (33.3 million documents, 15 GB index):

Task	QPS base	StdDev base	QPS packed	StdDev packed	% change
AndHighLow	353.09	9.16	317.06	2.53	13%-7%
Fuzzy2	53.08	1.02	49.58	3.41	14%-1%
Respell	41.23	0.56	39.67	1.37	8%-0%
PKLookup	169.41	1.94	166.69	1.57	3%-0%
Fuzzy1	33.07	0.82	35.24	0.93	1%-12%
IntNRQ	3.82	0.01	4.09	0.28	0%-14%
MedSpanNear	1.46	0.03	1.60	0.03	5%-13%
HighSpanNear	0.55	0.01	0.62	0.01	9%-17%
LowSpanNear	2.65	0.06	3.08	0.03	12%-19%
HighSloppyPhrase	0.63	0.02	0.74	0.02	9%-23%
OrHighHigh	3.04	0.17	3.55	0.35	0%-35%
HighPhrase	0.65	0.00	0.76	0.04	9%-24%
OrHighMed	6.62	0.39	7.75	0.80	0%-37%
LowSloppyPhrase	2.34	0.04	2.78	0.05	14%-23%
Wildcard	17.20	0.20	20.42	0.76	12%-24%
OrHighLow	7.29	0.44	8.67	0.92	0%-39%
AndHighMed	23.11	0.61	27.55	0.34	14%-23%
LowPhrase	7.44	0.10	9.01	0.44	13%-28%
Prefix3	24.65	0.35	29.94	1.23	14%-28%
MedPhrase	3.47	0.04	4.24	0.18	15%-28%
MedSloppyPhrase	2.03	0.04	2.51	0.04	19%-28%
LowTerm	145.81	0.36	182.14	2.02	23%-26%
MedTerm	51.64	0.25	65.57	1.14	24%-29%
HighTerm	8.49	0.03	10.94	0.23	25%-32%
AndHighHigh	6.52	0.19	8.70	0.19	26%-40%

The gains are awesome! After this change has baked for a while I expect we'll make it the default postings format, maybe in time for 4.1.

This effort has been a long time coming: ever since the creation of flexible indexing, allowing for full pluggability of all files in the index, we have wanted to improve on Lucene's postings format. Early int-block prototypes showed great performance.

The new format stores postings data in fixed blocks of 128 ints. Each block is encoded using a simple packed ints format. The .doc file interleaves docID delta and term frequency blocks, and also includes skip data, while the .pos file holds position deltas. Offsets and payloads, if present, are stored in the .pay file. On the full English Wikipedia index used in the above test this resulted in a nice reduction of disk usage: 13.7 GB (new) vs 14.7 GB (current). That's ~7% smaller.

The format only uses packed int blocks for high-frequency terms: low frequency terms (that appear in less than 128 documents), and the final partial-block of a high frequency term, are stored in the same variable-int format as today's default. Skip points are stored at the start of each block (except the first block).

Han built and tested a number of fun ideas, including skipping within blocks (with partial block decoding), using PForDeltainstead of simple packed ints, wrapping with PulsingPostingsFormat, all sorts of different ways to decode packed ints, different values for blockSizeand skipMultiplier, encoding all-single-values very compactly, etc. We spent lots of electricity running all sorts of benchmarks. Some ideas worked and some didn't, and things did not go according to the original plan! Such is the healthy unpredictable, zig-zag, iterative nature of open-source development/exploration, yet in the end we find ourselves with a great new postings format.

Thank you Han Jiang for such an important improvement to Lucene, and thank you Google for sponsoring this effort through Google Summer of Code.

Michael McCandless: Lucene's new BlockPostingsFormat, thanks to Google Summer of Code

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112