airbnb的面试题

创业久了，更多关注大的系统的设计，比如帐号系统呀，图片管理系统呀，商城系统呀之类的，很多算法长期不写，慢慢在淡忘了。网上很多所谓『为业务代码所累』，其实也差不多是这个意思，是时候需要捡一捡。

最近自找没趣地去做了一下airbnb的面试题，时间一个小时，只一道题目，我没做出来，有点受打击。当然题目是有一定难度的，同时大段的英文描述题意，也理解了好一会。自己最开始想错了，等调整时，发现时间已经来不及了。事后还是把自己的答案整理出来，时不时提醒一下自己。

先看输入和输出样例，再来说明题意：


perPage = 5
input = [
  "1,28,300.6,San Francisco",
  "4,5,209.1,San Francisco",
  "20,7,203.4,Oakland",
  "6,8,202.9,San Francisco",
  "6,10,199.8,San Francisco",
  "1,16,190.5,San Francisco",
  "6,29,185.3,San Francisco",
  "7,20,180.0,Oakland",
  "6,21,162.2,San Francisco",
  "2,18,161.7,San Jose",
  "2,30,149.8,San Jose",
  "3,76,146.7,San Francisco",
  "2,14,141.8,San Jose"]

output = [
  "1,28,300.6,San Francisco",
  "4,5,209.1,San Francisco",
  "20,7,203.4,Oakland",
  "6,8,202.9,San Francisco",
  "7,20,180.0,Oakland",
  "", # this is a page separator
  "6,10,199.8,San Francisco",
  "1,16,190.5,San Francisco",
  "2,18,161.7,San Jose",
  "3,76,146.7,San Francisco",
  "6,29,185.3,San Francisco",
  "", # this is a page separator
  "6,21,162.2,San Francisco",
  "2,30,149.8,San Jose",
  "2,14,141.8,San Jose"]

就是有好多字符串，每一串逗号分隔分四个字段，其实只需要理解第一个字段就行，题目里叫host_id，就是说用户搜索出一系列结果时，会有很多来自同一个网站的内容，这时候在分页展现给用户的时候，需要按照一定的规则来返回。比如perPage=5，就表示一页展示5条，每页尽量展示不一样的网站（即不同host_id），当一页内无法避免出现重复时，就依次展示余下的字符串即可。同时越在前的，越优先展示。

题意描述起来有些麻烦，还是看输入输出样例比较来得快。对了，从output也可以看出，每页末尾要用一个空字符串来分隔。总体来说，这是一道比较有现实意义的、业务逻辑性的算法题目，不知道是不是airbnb内哪个工程师曾经遇到过这样的需求，从而出了这么个题目。

很显然算法一定要提前考虑好复杂度，以常规搜索结果来看，算法所要针对的数据量不会小，搞不好背后有几百MB的测试数据等着我咧。

以我能想到的最优方案来处理一下这题，未必是标答，甚至是不正确的。大体分三步完成吧，第一步，对所有字符串进行按host_id进行聚类，同时记录数据的次序值，复杂度跟host_id的种类数量有关，最差情况下是个n*log(n)的复杂度。第二步，将聚类的各列表按列表头数据的次序值进行排序，然后取出前perPage个列表的头数据，剔除掉为空的列表，接着再排序，再取，再剔除，如此循环直到列表数量不足perPage。这一步复杂度也跟host_id的数量有关，最差也是个n*log(n)的复杂度。第三步，不足perPage个的列表，进行多个有序序列的归并操作，最坏复杂度为(n/perPage)*perPage*log(perPage)，即n*log(perPage)。

实现的Python代码如下：


import heapq
def paginate(perPage, input):
    #step1 - group by the host_id, store the seq number
    group = {}
    for i in xrange(len(input)):
        host_id, _ = input[i].split(',', 1)
        group.setdefault(int(host_id), []).append([i, input[i]])

    #step2 - consume the groups
    res = []
    vals = group.values()
    while len(vals) >= perPage:
        vals.sort()
        for i in xrange(perPage):
            res.append(vals[i][0][1])
            del vals[i][0]
        res.append('')
        for i in xrange(perPage-1, 0, -1):
            if len(vals[i]) == 0:
                del vals[i]
    vals.sort()
    cnt = len(vals)
    for i in xrange(cnt):
        res.append(vals[i][0][1])
        del vals[i][0]
    for i in xrange(cnt-1, 0, -1):
        if len(vals[i]) == 0:
            del vals[i]

    #step3 - merge multi list with heap
    heapq.heapify(vals)
    while len(vals) > 0:
        res.append(vals[0][0][1])
        cnt += 1
        if cnt == perPage:
            cnt = 0
            res.append('')
        del vals[0][0]
        if len(vals[0]) == 0:
            heapq.heappop(vals)

    if res[-1] == '':
        res.pop()

    return res

#测试
res = paginate(perPage, input)
print(res == output)

特别注意一些点：

1. 第二步中，在循环消耗列表头时，采用的是读取加del的方式，如果使用c/c++语言来写，考虑性能问题，比较适合用链表。


while len(arr) > 0:
    x = arr[0]
    del arr[0]

2. 第三步中，拉链归并，如果长期不写，或者选择c/c++来写，写起来还是很费劲的。这里用c++实现一遍。


#include <vector>
#include <iostream>
#include <queue>
using namespace std;

struct Node
{
    int val;
    struct Node *next;
};

Node *create(const vector<int>& l) {
    Node *head = NULL;
    Node *cur = head;
    for (auto i : l) {
        Node *tmp = new Node{i, NULL};
        if (NULL == head) {
            cur = head = tmp;
        } else {
            cur->next = tmp;
            cur = tmp;
        }
    }
    return head;
}
void output(Node *t) {
    while (NULL != t) {
        cout << t->val << "->";
        t = t->next;
    }
    cout << "NULL\n";
}

struct Cmp
{
    inline bool operator()(Node* x, Node *y) const {
        return x->val > y->val;
    }
};

Node *merge(const vector<Node*>& lists) {
    priority_queue<Node*, vector<Node*>, Cmp> pq;
    for (auto ptr : lists) {
        pq.push(ptr);
    }
    Node *head = NULL;
    Node *cur = NULL;
    while (!pq.empty()) {
        if (NULL == head) {
            cur = head = pq.top();
        } else {
            cur->next = pq.top();
            cur = cur->next;
        }
        pq.pop();
        if (NULL != cur->next) {
            pq.push(cur->next);
            cur->next = NULL;
        }
    }
    return head;
}

int main()
{
    Node *a = create({1,4,6,8,12});
    Node *b = create({5,7,13,14,15});
    Node *c = create({2,3,9,10,11});
    output(a);
    output(b);
    output(c);
    Node *res = merge({a,b,c});
    output(res);

    return 0;
}

发表于 2018年06月12日 17:01 评论：0 阅读：3016

上一篇：json编码为啥要转义斜杠?
下一篇：python实现字典树