Summary-of-November-2015

Last month I was working in Zhihu as an intern. I spent 9 hours in programming per day.

I am in anti-spam tech group, the duty of our group is to prevent and delete spam contents on zhihu.com. Currently, the main strategy we use is data based behavior discrimination. That is, we collect behavior data of spammers, find their characteristics, then write programs to filter spammers. We may develop a more intelligent anti-spam system later, but now there is no advanced techniques used. Nevertheless, I learned lots of things from my work.

The first thing is how to work. Before I was learning and coding all by myself. Working in a company is quite different. The project I work with is much larger. The anti-spam system is a part of Zhihu website, which is a much larger system consists of many parts. I need collaborate with other programmers frequently to accomplish my job. Most of time I have several jobs to do, so I need to assess their priorities consistently and adjust my schedule dynamically.

The second thing is getting familiar with Unix shell and Git. These are two basic tools for programmers, I spent some time learning them, but some features are useless for me when I first knew them. I explored much more power of Unix shell and Git due to my job, they are fantastic tools!

I also get to know many open-source projects used in Zhihu like Tornado, SQLAlchemy, Buildout and so on. I didn’t know there are so many wonderful open-source projects in Python world, I hope I can do something for open-source community in the future.

I’m a lazy person, although I like my current job, I don’t want do anything repeatedly. So when I run into some similar works, I’ll try to automate them. I was assigned to write tests for several projects, so I wrote a script to generate code template for tests. Product managers asked me for data of users frequently, so I wrote a program to receive command from website and fetch data automatically. This kind of programs reduced my workload, and made me love coding more.

2015年10月总结——找实习

这个月找了一个实习工作并实习了一周。这个月平均每天花在编程上的时间是6个小时。

国庆期间白天继续看core java,晚上做LeetCode,把最简单的80道题做了。国庆结束我就开始投简历,一共投了15家公司,其中6家给了回复。这6家公司是:搜狐、明大启微、海豚浏览器、作业盒子、知乎、金山云。感谢这些公司给我机会。经过面试,后面5家公司给了offer。搜狐没有给,因为搜狐的面试还没有结束!

情况是这样,搜狐是最早回复我的,让我在线做一道题放到github上,第二天我就做好发过去了。过了几天我收到邮件:试题已收到,看过后给你回复。转眼到了下一周,我又收到邮件:你确定这是最终版本吗?我:是的。然后又过了一周:你的代码有两个小问题,请你解释一下(此时距我提交代码已经过去了15天)。我:不好意思,由于贵公司面试周期太长,我已经接受了别的公司的offer,暂时不能去贵公司实习了。我觉得搜狐的面试方式挺好的,可能他们不是很着急招实习生吧。

最后我接受的是知乎的offer,主要原因是知乎离学校近、工作环境舒适、工作氛围良好、工作时间有弹性、不加班、HR很专业等,总之各种好。经过一周的实际考察我觉得我的选择是对的。

稍微总结一下找实习的经验。首先是看清楚招聘要求,准确评估自己。在我投出去的简历中,抱着试一试的态度投的,都没有收到回复,觉得自己能够胜任然后投的,大多收到了回复。其次就是重视基础知识,主要是算法和计算机系统这两块,编程语言和软件工程还可以边做边学,但没有人会有耐心教你算法和计算机系统知识(所以面试时会根据算法和计算机系统知识筛除不满足要求的同学)。

然后制定一下接下来五个月的计划:

1. 认真做好实习工作。不仅要把工作完成,还要多思考多总结。为什么要做这个?我做的是整个项目中哪一部分的工作?有没有更好的方法完成这个工作?

2. 读完Computer Networking。做WEB开发不深入了解计算机网络怎么可以?这本书一定要读完。

3. 读完Core Java Volume II中跟WEB开发直接相关的部分,做一个安卓APP。虽然我用过五六种语言,但称得上熟悉的也就Python一种,这显然是不够的。接下来我准备继续学习Java。大概看了一下Core Java Volume II的目录,只有一半的章节是跟WEB开发直接相关的。但是做手机APP应该还会用到图形编程的知识,这部分知识就边做边学吧。

4. 读完Programming Pearls,做完LeetCode上剩下的题目。这部分既是学习也是消遣。LeetCode上的题目提交后会显示你的答案在全部提交中的运行时间排名(以百分比的形式),看到自己的答案超过了百分之九十多的人还是蛮有成就感的。不过目前这个时间还不太准确,感觉受网速和同时提交人数的影响比较大。如果你对你的答案很自信但却得到一个很长的运行时间,不妨换个时间再提交几次。

老实说这个计划很可能完成不了,因为接下来还要做毕业课题(后年毕业,开始的比较早)。尽可能多地完成吧,加油!

Summary-of-September-2015

This month I learned Java programming language. The average time I spent on programming is 6 hours per day.

I’ve been reading Core Java by Horstmann and Corenell, and I almost finished Volume one(I skipped chapter 7 to chapter 10, these chapters are mainly about GUI development with Swing, a little bit outdated). Core Java is a great book for introducing Java, it’s plain and complete. One thing I like most is the notes throughout the book, which I found very helpful during my learning. Since the book is clear enough, I’d like to talk about Java itself.

Previously I use Python most of the time, so Java is really verbose to me at first. I don’t like the modifiers in Java, in Python everything is public, no need for modifiers. Then I find the reason is that Java follows different design concept from Python. Python tends to trust programmers, so it gives programmers more freedom and believe that they’ll do the right things. Java tends to not trust programmers, so it poses more constraints to prevent programmers from doing wrong things. Which one is better? It depends. If programmers collaborate well, Python is quicker. Otherwise, Java is safer.

I accept modifiers because they exist for decent reasons. However, there is one feature of Java that really makes me uncomfortable: you can’t write pure functions in Java! Classes are the only top-level construct in Java, you must put everything within classes. I don’t like this design. Yes, OOP is a good idea for many problems. But OOP is not silver bullet. There are many cases in which we just want pure procedures, that is, separate functions. Putting these functions into classes not only causes extra overhead during execution, but also adds more hierarchies in programs. Actually, Java uses primitive types for the reason of performance. You have to deal with these non-OOP things anyway. So I think there is no reason to forbid pure functions.

Other drawbacks of Java mostly arise from compatibility. Since Java chooses backward compatibility, it has to evolve in an awkward way. For instance, Java use for (variable: collection) instead of a “in” keyword because some legacy codes have used “in” as variable or method names. The inconsistency between “==” and equals also causes much trouble. But Java can’t introduce a new keyword to simplify comparisons. Compatible issues are common in programming languages. Python suffers from compatibility too, in a different way.

As a programming language, Java is ordinary. It does not have any striking feature(e.g. indentation in Python) or disastrous design(e.g. global name space in JavaScript) that makes it distinct. It’s not the fastest programming language, nor the simplest, nor the most advanced. Yet it becomes the most popular language. I’m not going to analyse how does Java get popular. The point is, since Java is so popular, many people devoted and contributed to Java world, now Java is not just a language, it’s a whole platform with huge libraries, rich frameworks and cross-platform execution environment. Many programmers in Zhihu despise Java for its cumbersome, then despise Java programmers because they use cumbersome language. It’s premature and unnecessary. There are many imperfect things in computer world, live with them, and try to improve them.

You can’t master a language without using it seriously. Next I plan to do some real work with Java, maybe a mobile app.

2015年8月总结

这个月学习了一些算法和软件开发知识。这个月平均每天花在编程上的时间是5小时。

算法学习我之前看了Algorithms by Dasgupta这本书前五章。这本书是在课程讲义的基础上写的,所以讲得比较简略,还省去了一些常见的算法。结果我发现我连插入排序都不知道。于是我换了本基础点儿的书:Algorithms by Sedgewick。这本书确实很基础,有时候不免显得啰嗦。所以我是跳着看的。我写了两篇关于算法的文章,介绍了基础的排序搜索算法。

我还学习了一些软件开发的知识,看的是《代码大全》和《程序员修炼之道》,代码大全大概看了三分之一,修炼之道看了一半。这两本书其实都比较老了,但现在还在热卖,说明里面讲的的工程实践对于编程是有通用性的。看的时候我也在回顾我以前写的代码,尤其是小白导航的,进行了一些重构。简单说说两个常见的原则。

DRY

DRY是Don’t Repeat Yourself的缩写,意思是不要重复你的代码。这个原则很出名,最开始是在《程序员修炼之道》中提出的。作者对DRY的定义是:系统中的每一项知识都必须具有单一、无歧义、权威的表示。我在回顾小白导航的代码时发现了一处明显不符合DRY的地方:我在前端和后端都储存了搜索引擎的信息。这样一来如果搜索引擎的数据有变化(比如百度由http改为https),我就需要在两个地方进行修改,很不方便。所以我改成了只在后端储存搜索引擎信息,前端通过AJAX获取。

除了信息的重复,代码的重复也应该尽量减少。具体来说,如果你在函数中重复计算了一个结果不变的值,那就应该把这个值用临时变量储存起来;如果你用了几段重复的代码实现某一个逻辑,那么通常你可以用更简洁的办法(比如循环)实现这个逻辑。如果你重复使用一个函数(或者使用几个相似的函数)实现某一个功能,那么通常你可以通过重构你的函数直接实现这个功能。

模块化

模块化这个原则不如DRY这么有名,因为它听起来不是很酷。其实它比DRY用的更广泛,理论上几乎所有项目都是模块化的,否则那将是一个无比复杂的函数。但不同的模块化是有好坏之分的。

我觉得好的模块化应该包括两个层面,纵向上将系统从上往下分成不同的层次,每一层之间相互隔离。横向上将每一个层次分成不同的模块,不同模块间相互独立。一般的项目纵向层次不会很多,可能只有一两层。大部分工作是横向的模块化。横向的模块化需要不断的细分,直到每一个实现具体功能的函数为止。

模块化的关键是,在每一个层次上,每一个模块都要尽可能独立于其他模块。这又包括两个方面,第一,每个模块的功能是单一的不重复的,每次用到这个功能都只调用同一个模块。第二,每个模块不依赖其他模块,不与其他模块共享数据,同时向外提供最小的功能接口。

以小白导航为例,纵向上小白导航可以分为两层。底层是WSGI,负责HTTP请求和响应的封装,上层是Django,负责动态HTTP请求的处理。底层由Nginx和uWSGI组成,Nginx直接接受用户请求,如果是静态请求则直接返回文件给用户,如果是动态请求则转交给uWSGI,uWSGI再把请求转交给Django。从这个意义上可以认为Nginx比uWSGI更底层,但这两者都在我们的代码之下,可以把它们看成是一层。上层由Dango和我们写的代码组成。Django分成了几个模块,models负责储存数据,views负责处理数据,templates负责展示数据,urls负责匹配请求和数据处理函数。这几个模块的划分减少了无关代码之间的耦合,使我们可以专注于web开发的某个方面,从而降低了开发的复杂度。这几个大的模块还可以进一步的模块化,尤其是最复杂的view部分。

完全模块化的设计通常很难做到,但尽可能模块化应该成为我们追求的目标。好的设计源于对实际问题的适当抽象,对实际问题的理解越是深刻全面,越有可能做出良好的模块化设计。

下一个月的目标是学习Java语言。呃,我原来是计划学C++的,但我发现C++在Web开发领域应用很少,相反,Java用的很多。所以,我决定学习Java。这其实是一个很明显的决定,之前做计划的时候之所以选择C++排除Java主要是因为知乎上好多人吹C++黑Java…差点就掉坑里了。

Algorithms(2): Search–Part I

Search is a wide-ranging problem, many problems can be viewed as certain type of search. For instance, sort can be viewed as searching a permutation which has non-decreasing order from all permutations of the input. The naive sort we discussed before is based on this idea.

So let’s give a exact definition of our problem first:
Search
Input: a certain number of objects, each object has a unique key and some data(optional), and a key.
Output: an object has the exactly same key.

Linear search

A straightforward algorithm is linear search which has time complexity of O(n):

Binary search

O(n) is a desirable time complexity for many problems. This algorithm works well for small data sets. However, in practice, we often want to search items from huge data sets. O(n) becomes unacceptable in these cases. If we have a sorted array, search will be easier:

Binary search tree

Now we have an O(log n) search algorithm. This algorithm works well if the data set to be searched changes rarely. However(I hate this word), that’s not true in practical applications. Inserting an item to a large sorted array is rather inefficient. It appears that array is unsuitable for our problem. We can organize our data as a binary tree:

AVL-Tree

Another problem arises, if the input is almost sorted, the time complexity of binary search tree will be O(n). We need a self-balancing tree, that is, a tree will reshape itself if it’s unbalanced. There are four fundamental unbalanced cases(see this picture from wikipedia). If we carry out a bottom-up examination after each insertion or deletion, transform each unbalance in the way, we’ll have a strictly balanced binary search tree. This tree is called AVL-Tree, named after the inventors. Here is the code, note that we altered Node and BinarySearchTree first:

The AVL-Tree has an average and worst time complexity of O(log n). That’s pretty cool! Can we do better?

Hash table

If we want a better performance(O(1)) search algorithm, we have to go back to array. Array is the fastest way to index a large data set, the operation of accessing array elements can be directly translated to address computations in machine code. Since array use sequential natural numbers to index its elements, if we can build a one to one map between keys of our data and a certain range of numbers, we can build an O(1) search algorithm!

Now the problem turns to find a function to map keys to certain range of numbers, we call this type of function hash functions. Unfortunately, one to one map functions(perfect hash) are extremely difficult to find for large data sets. Nevertheless, usually we can build hash functions with very few collisions(different keys have same hash value). So what to do when collisions happen, we can build an linked list there, or we can use open addressing:

Since the hash functions we choose are not perfect, the time complexity of this HashTable is also not perfect. If an insertion or deletion triggers a resize operation, the time cost is O(n)! We can ease this pain by a more sophisticated algorithm called incremental resizing. We’ll discuss other advanced search algorithms such as Red-Black tree and B+ tree next time.

第 1 页,共 4 页1234