Last month I was working in Zhihu as an intern. I spent 9 hours in programming per day.

I am in anti-spam tech group, the duty of our group is to prevent and delete spam contents on Currently, the main strategy we use is data based behavior discrimination. That is, we collect behavior data of spammers, find their characteristics, then write programs to filter spammers. We may develop a more intelligent anti-spam system later, but now there is no advanced techniques used. Nevertheless, I learned lots of things from my work.

The first thing is how to work. Before I was learning and coding all by myself. Working in a company is quite different. The project I work with is much larger. The anti-spam system is a part of Zhihu website, which is a much larger system consists of many parts. I need collaborate with other programmers frequently to accomplish my job. Most of time I have several jobs to do, so I need to assess their priorities consistently and adjust my schedule dynamically.

The second thing is getting familiar with Unix shell and Git. These are two basic tools for programmers, I spent some time learning them, but some features are useless for me when I first knew them. I explored much more power of Unix shell and Git due to my job, they are fantastic tools!

I also get to know many open-source projects used in Zhihu like Tornado, SQLAlchemy, Buildout and so on. I didn’t know there are so many wonderful open-source projects in Python world, I hope I can do something for open-source community in the future.

I’m a lazy person, although I like my current job, I don’t want do anything repeatedly. So when I run into some similar works, I’ll try to automate them. I was assigned to write tests for several projects, so I wrote a script to generate code template for tests. Product managers asked me for data of users frequently, so I wrote a program to receive command from website and fetch data automatically. This kind of programs reduced my workload, and made me love coding more.



国庆期间白天继续看core java,晚上做LeetCode,把最简单的80道题做了。国庆结束我就开始投简历,一共投了15家公司,其中6家给了回复。这6家公司是:搜狐、明大启微、海豚浏览器、作业盒子、知乎、金山云。感谢这些公司给我机会。经过面试,后面5家公司给了offer。搜狐没有给,因为搜狐的面试还没有结束!





1. 认真做好实习工作。不仅要把工作完成,还要多思考多总结。为什么要做这个?我做的是整个项目中哪一部分的工作?有没有更好的方法完成这个工作?

2. 读完Computer Networking。做WEB开发不深入了解计算机网络怎么可以?这本书一定要读完。

3. 读完Core Java Volume II中跟WEB开发直接相关的部分,做一个安卓APP。虽然我用过五六种语言,但称得上熟悉的也就Python一种,这显然是不够的。接下来我准备继续学习Java。大概看了一下Core Java Volume II的目录,只有一半的章节是跟WEB开发直接相关的。但是做手机APP应该还会用到图形编程的知识,这部分知识就边做边学吧。

4. 读完Programming Pearls,做完LeetCode上剩下的题目。这部分既是学习也是消遣。LeetCode上的题目提交后会显示你的答案在全部提交中的运行时间排名(以百分比的形式),看到自己的答案超过了百分之九十多的人还是蛮有成就感的。不过目前这个时间还不太准确,感觉受网速和同时提交人数的影响比较大。如果你对你的答案很自信但却得到一个很长的运行时间,不妨换个时间再提交几次。



This month I learned Java programming language. The average time I spent on programming is 6 hours per day.

I’ve been reading Core Java by Horstmann and Corenell, and I almost finished Volume one(I skipped chapter 7 to chapter 10, these chapters are mainly about GUI development with Swing, a little bit outdated). Core Java is a great book for introducing Java, it’s plain and complete. One thing I like most is the notes throughout the book, which I found very helpful during my learning. Since the book is clear enough, I’d like to talk about Java itself.

Previously I use Python most of the time, so Java is really verbose to me at first. I don’t like the modifiers in Java, in Python everything is public, no need for modifiers. Then I find the reason is that Java follows different design concept from Python. Python tends to trust programmers, so it gives programmers more freedom and believe that they’ll do the right things. Java tends to not trust programmers, so it poses more constraints to prevent programmers from doing wrong things. Which one is better? It depends. If programmers collaborate well, Python is quicker. Otherwise, Java is safer.

I accept modifiers because they exist for decent reasons. However, there is one feature of Java that really makes me uncomfortable: you can’t write pure functions in Java! Classes are the only top-level construct in Java, you must put everything within classes. I don’t like this design. Yes, OOP is a good idea for many problems. But OOP is not silver bullet. There are many cases in which we just want pure procedures, that is, separate functions. Putting these functions into classes not only causes extra overhead during execution, but also adds more hierarchies in programs. Actually, Java uses primitive types for the reason of performance. You have to deal with these non-OOP things anyway. So I think there is no reason to forbid pure functions.

Other drawbacks of Java mostly arise from compatibility. Since Java chooses backward compatibility, it has to evolve in an awkward way. For instance, Java use for (variable: collection) instead of a “in” keyword because some legacy codes have used “in” as variable or method names. The inconsistency between “==” and equals also causes much trouble. But Java can’t introduce a new keyword to simplify comparisons. Compatible issues are common in programming languages. Python suffers from compatibility too, in a different way.

As a programming language, Java is ordinary. It does not have any striking feature(e.g. indentation in Python) or disastrous design(e.g. global name space in JavaScript) that makes it distinct. It’s not the fastest programming language, nor the simplest, nor the most advanced. Yet it becomes the most popular language. I’m not going to analyse how does Java get popular. The point is, since Java is so popular, many people devoted and contributed to Java world, now Java is not just a language, it’s a whole platform with huge libraries, rich frameworks and cross-platform execution environment. Many programmers in Zhihu despise Java for its cumbersome, then despise Java programmers because they use cumbersome language. It’s premature and unnecessary. There are many imperfect things in computer world, live with them, and try to improve them.

You can’t master a language without using it seriously. Next I plan to do some real work with Java, maybe a mobile app.



算法学习我之前看了Algorithms by Dasgupta这本书前五章。这本书是在课程讲义的基础上写的,所以讲得比较简略,还省去了一些常见的算法。结果我发现我连插入排序都不知道。于是我换了本基础点儿的书:Algorithms by Sedgewick。这本书确实很基础,有时候不免显得啰嗦。所以我是跳着看的。我写了两篇关于算法的文章,介绍了基础的排序搜索算法。



DRY是Don’t Repeat Yourself的缩写,意思是不要重复你的代码。这个原则很出名,最开始是在《程序员修炼之道》中提出的。作者对DRY的定义是:系统中的每一项知识都必须具有单一、无歧义、权威的表示。我在回顾小白导航的代码时发现了一处明显不符合DRY的地方:我在前端和后端都储存了搜索引擎的信息。这样一来如果搜索引擎的数据有变化(比如百度由http改为https),我就需要在两个地方进行修改,很不方便。所以我改成了只在后端储存搜索引擎信息,前端通过AJAX获取。









Algorithms(2): Search–Part I

Search is a wide-ranging problem, many problems can be viewed as certain type of search. For instance, sort can be viewed as searching a permutation which has non-decreasing order from all permutations of the input. The naive sort we discussed before is based on this idea.

So let’s give a exact definition of our problem first:
Input: a certain number of objects, each object has a unique key and some data(optional), and a key.
Output: an object has the exactly same key.

Linear search

A straightforward algorithm is linear search which has time complexity of O(n):

Binary search

O(n) is a desirable time complexity for many problems. This algorithm works well for small data sets. However, in practice, we often want to search items from huge data sets. O(n) becomes unacceptable in these cases. If we have a sorted array, search will be easier:

Binary search tree

Now we have an O(log n) search algorithm. This algorithm works well if the data set to be searched changes rarely. However(I hate this word), that’s not true in practical applications. Inserting an item to a large sorted array is rather inefficient. It appears that array is unsuitable for our problem. We can organize our data as a binary tree:


Another problem arises, if the input is almost sorted, the time complexity of binary search tree will be O(n). We need a self-balancing tree, that is, a tree will reshape itself if it’s unbalanced. There are four fundamental unbalanced cases(see this picture from wikipedia). If we carry out a bottom-up examination after each insertion or deletion, transform each unbalance in the way, we’ll have a strictly balanced binary search tree. This tree is called AVL-Tree, named after the inventors. Here is the code, note that we altered Node and BinarySearchTree first:

The AVL-Tree has an average and worst time complexity of O(log n). That’s pretty cool! Can we do better?

Hash table

If we want a better performance(O(1)) search algorithm, we have to go back to array. Array is the fastest way to index a large data set, the operation of accessing array elements can be directly translated to address computations in machine code. Since array use sequential natural numbers to index its elements, if we can build a one to one map between keys of our data and a certain range of numbers, we can build an O(1) search algorithm!

Now the problem turns to find a function to map keys to certain range of numbers, we call this type of function hash functions. Unfortunately, one to one map functions(perfect hash) are extremely difficult to find for large data sets. Nevertheless, usually we can build hash functions with very few collisions(different keys have same hash value). So what to do when collisions happen, we can build an linked list there, or we can use open addressing:

Since the hash functions we choose are not perfect, the time complexity of this HashTable is also not perfect. If an insertion or deletion triggers a resize operation, the time cost is O(n)! We can ease this pain by a more sophisticated algorithm called incremental resizing. We’ll discuss other advanced search algorithms such as Red-Black tree and B+ tree next time.

第 1 页,共 4 页1234