Seperating words algorithm doesn't consider non-space seperated languages

Words are currently separated by English punctuation like spaces, commas, etc., I suppose. This, however, makes it almost impossible to search for pages in other languages, like Chinese or Japanese where words are organized together without spaces.

Let's take this page https://movie.douban.com/subject/26897009/ for example. In the .subject div, there are some information about a TV show. If one were to search "制片国家" (country of production), the page would pop up in worldbrain's search. However, if "制片" (production) or "国家" (country) were searched, the page will not show up in the search results, since I believe worldbrain considers "制片国家" to be a single word.

Apparently, all the similar services like worldbrain I have tried have similar behaviors due to Latin (or similar) languages backgrounds. I sure hope this can be solved in some ways.

3 replies

Hello @Onlyqmqy2

Yeah we still run into some problems with non-latin characters, especially chinese, japanese etc. 

We are currently working hard on integrating the new indexing software (which will have a huge performance/speed boost), and afterwards go for these issues. 

Sorry for the problems. 
It will hopefully be over soon. 


Why worldbrain doesn't support Chinese?

I have tried Search Chinese words, but nothing appeared:(

Hello Kazekara, 

thanks for dropping by and taking your time to report the issue you have with Memex.
I think this is something fixable.

As you can see here, I posted it as an issue on GitHub, where you can follow the progress: https://github.com/WorldBrain/Memex/issues/277

I'll post you an update here as well, once it is fixed. 

I was intending to subscribe but suddenly find Chinese search doesn't work...

Then I found this web page.
Please fix it asap.