Skip to main content

[转]一些第三方库

赞叹开源、共享的伟大

在工作和学习中,借助第三方开源代码库是常见的事情,“站在巨人的肩膀上”嘛,相信大家都不会陌生,赞叹开源、共享的伟大。

一方面为了做个总结,另一方面,就是好东西要与大家分享,我在 Github 上维护了一个页面 http://ift.tt/1QxQddX,包含了个人比较关注的第三方代码库,如下(持续更新中)

Google 开源库

  • zh-google-styleguide - Google 开源项目风格指南.
  • protobuf - Protocol Buffers - Google’s data interchange format.
  • gflags - Commandline flags module for C++.
  • glog - Logging library for C++.
  • gtest - Google C++ Testing Framework.
  • googlemock - Google C++ Mocking Framework.
  • leveldb - A fast and lightweight key/value database library by Google.
    cpy-leveldb - Python bindings for LevelDB using leveldb c api.
  • The Chromium Projects - The Chromium projects include Chromium and Chromium OS, the open-source projects behind the Google Chrome browser and Google Chrome OS, respectively.

C++ base 库

  • toft - C++ Base Library for Linux server side development.
  • thirdparty - Put thirdparty library here for toft ant foxy.
    chen3feng
  • folly - Folly is an open-source C++ library developed and used at Facebook.

算法和数据结构

  • darts-clone - A clone of the Darts (Double-ARray Trie System).
  • Darts - Double-ARray Trie System. 中文翻译文档
  • sparsehash - An extremely memory-efficient hash_map implementation。
  • cityhash - The CityHash family of hash functions.
  • stringencoders - A collection of high performance c-string transformations, frequently 2x faster than standard implementations (if they exist at all).
  • Numpy - NumPy is the fundamental package for scientific computing with Python.

自然语言处理库

  • NLTK - NLTK – the Natural Language Toolkit – is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing.
    NLTK Book
  • jieba - 结巴中文分词.
  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
  • LTP - 语言技术平台(Language Technology Platform,LTP)是哈工大社会计算与信息检索研究中心历时十年研制的一整套开放中文自然语言处理系统。
  • Stanford CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities.
  • openNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
  • SRILM - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
  • IRSTLM - The IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs.
  • KenLM - KenLM estimates unpruned language models with modified Kneser-Ney smoothing.
  • Moses - Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair.
  • GIZA++ - GIZA++ is a statical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model.
  • genius - genius中文分词,是基于crf条件随机场的分组件.
  • sego - Go中文分词.
  • pinyin - Go语言汉字转拼音工具.
  • ReVerb - ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.
  • Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources - 斯坦福自然语言组的NLP及计算语言学的资料汇总:包括各种工具,代码,语料库,字典,课程的链接及简单介绍。http://t.cn/zOfVAzs
  • webdict - WEBDICT 词表计划目标是通过机器学习算法以及人工标注构建一个包含大量网络词汇的、无版权限制的中文词库,从而提高中文网络文本自然语言分析以及开源中文输入法的效果。http://webdict.info/
  • sego - Go中文分词 词典用前缀树实现, 分词器算法为基于词频的最短路径加动态规划。支持普通和搜索引擎两种分词模式,支持用户词典、词性标注,可运行JSON RPC服务。

信息检索库

  • Lemur - The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.
  • Lucene - The Apache Lucene project develops open-source search software.
  • Solr - Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world’s largest internet sites.
  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
  • wukong - 悟空全文搜索引擎.
  • Scrapy - a fast high-level screen scraping and web crawling framework for Python.
  • distribute_crawler - 使用scrapy,redis, mongodb,graphite实现的一个分布式网络爬虫,底层存储mongodb集群,分布式使用redis实现, 爬虫状态显示使用graphite实现。

机器学习库

  • LASSO - LASSO is a parallel machine learning system that learns a regression model from large data. It works in either of two modes: IPM-mode and MPI-mode.
  • libsvm - A Library for Support Vector Machines.
    支持向量机通俗导论(理解SVM的三层境界) 来自研究者July. 在本文中,你将看到,理解SVM分三层境界,
    第一层: 了解SVM(你只需要对SVM有个大致的了解,知道它是个什么东西便已足够);
    第二层: 深入SVM(你将跟我一起深入SVM的内部原理,通晓其各处脉络,以为将来运用它时游刃有余);
    第三层: 证明SVM(当你了解了所有的原理之后,你会有大笔一挥,尝试证明它的冲动)。
  • liblinear - A Library for Large Linear Classification.
  • RankLib - RankLib is a library of learning to rank algorithms.
  • svmlight - SVMlight is an implementation of Support Vector Machines (SVMs) in C.
  • plda - A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation
  • GibbsLDA++ - A C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference.
  • Yahoo_LDA - Yahoo!’s topic modelling framework using Latent Dirichlet Allocation
  • word2vec - Tool for computing continuous distributed representations of words.
    Parallelizing word2vec in Python
  • Maximum Entropy Modeling Toolkit for Python and C++ - This package provides a (Conditional) Maximum Entropy Modeling Toolkit for Python and C++.
  • maxent - A simple C++ library for maximum entropy classification.
  • easyME - This is a simple implementation of Maximum Entropy model. Algorithms implemented include: GIS, SCGIS, LBFGS, Gaussian smoothing and Exponential smoothing.
  • libLBFGS - This library is a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal.
  • OWL-QN - The Orthant-Wise Limited-memory Quasi-Newton algorithm (OWL-QN) is a numerical optimization procedure for finding the optimum of an objective of the form {smooth function} plus {L1-norm of the parameters}. It has been used for training log-linear models (such as logistic regression) with L1-regularization.
  • CRF++ - CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
  • CRFsuite - A fast implementation of Conditional Random Fields (CRFs).
  • Wapiti - Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov models and linear-chain CRF and proposes various optimization and regularization methods to improve both the computational complexity and the prediction performance of standard models.
  • sofia-ml - Suite of Fast Incremental Algorithms for Machine Learning. Includes methods for learning classification and ranking models, using Pegasos SVM, SGD-SVM, ROMMA, Passive-Aggressive Perceptron, Perceptron with Margins, and Logistic Regression.
  • mahout - The Apache Mahout machine learning library’s goal is to build scalable machine learning libraries.
  • MLTK - MLTK – the Machine Learning Toolkit – is a suite of C++ open source modules of Machine Learning.
  • FP-growth - An implementation of the FP-growth algorithm in pure Python.
  • MLcomp - MLcomp is a free website for objectively comparing machine learning programs across various datasets for multiple problem domains.
  • PyBrain - PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.
  • parameter_server - A distributed machine learning framework.
  • vowpal_wabbit - John Langford’s original release of Vowpal Wabbit – a fast online learning algorithm.
  • Theano - Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
  • Caffe - Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2-Clause license.

数据交换协议

  • protobuf - Protocol Buffers - Google’s data interchange format.
  • jsoncpp - JSON data format manipulation library.
  • tinyxml2 - TinyXML-2 is a simple, small, efficient, C++ XML parser that can be easily integrating into other programs.
  • thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

数据库

  • MySQL++ - MySQL++ is a C++ wrapper for MySQL’s C API.
  • MongodDB - MongoDB (from “humongous”) is an open-source document database, and the leading NoSQL database. Written in C++.
  • memcached - Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
  • leveldb - A fast and lightweight key/value database library by Google.
  • SSDB - A fast NoSQL database server with zset data type, an alternative to Redis.
    SSDB is a high performace key-value(key-string, key-zset, key-hashmap) NoSQL persistent storage server, using Google LevelDB as storage engine. SSDB is stable, production-ready and is widely used by many Internet companies such as QIHU 360.
  • RocksDB - RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads.
    RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.
  • fatcache - Memcache on SSD. Think of fatcache as a cache for your big data.
  • THUIRDB - THUIRDB是一个C++语言实现的基础库,用于在单机上实现高性能key-value持久化存储和高速查询。THUIRDB Paper

网络编程

  • thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
  • server1 - a c++ network server/client framework.
  • muduo-protorpc - Google Prorobuf RPC based on Muduo.

Web 开发

  • Flask - Flask is a microframework for Python based on Werkzeug and Jinja2. It’s intended for getting started very quickly and was developed with best intentions in mind.
    中文docs
  • Bootstrap - Sleek, intuitive, and powerful front-end framework for faster and easier web development.
  • Django - Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.

分布式计算

  • Hadoop - The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • ZooKeeper - ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  • Storm - Distributed and fault-tolerant realtime computation.
    Storm 维基 - 提供了有关 Storm、它的理论基础的大量优秀文档,以及有关获取 Storm 和设置新项目的各种教程。您还将找到一些有关 Storm 的许多方面的实用文档,包括 Storm 在本地模式、集群模式和在 Amazon 上的使用。
    GitHub 上提供了 Storm 的一个 thorough class tree exists,详细介绍了 Storm 的类和接口。
    使用 Twitter Storm 处理实时的大数据 - 流式处理大数据简介 简介: Storm 是一个开源的、大数据处理系统,与其他系统不同,它旨在用于分布式实时处理且与语言无关。了解 Twitter Storm、它的架构,以及批处理和流式处理解决方案的发展形势。
    Storm 入门教程 - 来自量子恒道官方博客
    storm-starter - Learn to use Storm!
    StreamCpp - A small C++ wrapper for Storm. Some documentation can be found at http://ift.tt/1zTVmZI
    storm-kafka - storm-kafka provides a regular spout implementation and a TransactionalSpout implementation for Apache Kafka 0.7.
  • Spark - Lightning-Fast Cluster Computing.
  • Puppet - Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to orchestration and reporting. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.
  • Skynet - Skynet is a framework for distributed services in Go.
  • Kafka - 分布式消息队列系统,A high-throughput distributed messaging system. Kafka paper: Building LinkedIn’s Real-time Activity Data Pipeline
    Kafka Clients
    librdkafka
    kafka-python
    Kafka papers and presentations
  • METAQ - METAQ 是 alibaba 公司开发的 一款完全的队列模型消息中间件,服务器使用Java语言编写,可在多种软硬件平台上部署。客户端支持Java、C++编程语言。单台服务器可支持1万以上个消息队列,通过扩容服务器,队列数几乎可任意横向扩展。每个队列都是持久化、长度无限(取决于磁盘空间大小)、并且可从队列任意位置开始消费。
  • Celery — Distributed Task Queue - 这个框架几乎是 Python 下异步消息架构的终极解决方案.
  • mapreduce-lite - A C++ implementaton of MapReduce without distributed filesystem.
  • GraphChi - GraphChi[huahua] is a spin-off of the GraphLab[rador’s retriever] project.
    GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in similar vertex-centric model as GraphLab. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and changing the graph structure while computing.
    GraphChi ppt.
    GraphChi Paper.
    GraphChi Video.
    GraphChi’s C++ version. -disk-based large-scale graph computation. Big Data - small machine.
  • Giraph - Large-scale graph processing on Hadoop.
  • Celery — Distributed Task Queue - Celery is a simple, flexible and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.
    It’s a task queue with focus on real-time processing, while also supporting task scheduling.
    这个框架几乎是 Python 下异步消息架构的终极解决方案.

正则表达式

  • re2 - an efficient, principled regular expression library.

编译工具

  • SCons - SCons is an Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software.
  • CMake - the cross-platform, open-source build system.
  • blade - Blade is designed to be a modernize building system.
    Mac OS X port of Typhoon Blade
  • bobo - Bobo is an easy to use building tool inspired by blade.

Code Review

  • rietveld - Code Review, hosted on Google App Engine.
  • Review Board - Take the pain out of code review.

vim

  • spf13-vim - spf13-vim is a distribution of vim plugins and resources for Vim, GVim and MacVim. It is a completely cross platform distribution that stays true to the feel of vim while providing modern features like a plugin management system, autocomplete, tags and tons more.
  • Maximum Awesome - Config files for vim and tmux, lovingly tended by a small subculture of peace-loving hippies. Built for Mac OS X.
  • VimClojure - A filetype, syntax and indent plugin for Clojure.

Go 学习

  • glog - Leveled execution logs for Go.
  • groupcache - groupcache is a caching and cache-filling library, intended as a replacement for memcached in many cases.
  • go-slab - A slab allocator library in the Go Programming Language.
  • Go语言资料收集 -

Python 学习

  • pycrumbs - Bits and Bytes of Python from the Internet.

自动化部署引擎

  • Docker - Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more.
    Docker 是一个开源自动化部署引擎,它可以将任何应用封装成一个简单、便携、不依赖于其他组件的容器,从而轻松地将其部署在各种虚拟环境中,以便进行各种调试。它既保证了应用的私有性,同时缩短了调试部署的周期,使得测试-封装-部署变得更加容易和便捷。不过现在Docker还在加紧开发中,相信等它开发完毕后,它会给人们的开发带来前所未有的便捷。

其他

  • Valgrind - Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.
作者:u012176591 发表于2015/5/10 11:34:37 原文链接
阅读:2 评论:0 查看评论


from JeffHugh's broadcasted articles on Inoreader http://ift.tt/1QxQiOF
via IFTTT

Comments

Popular posts from this blog

使用PHP Webhook方式打造Telegram Bot

一、找BotFather拿到bot token     在telegram中私聊BotFather建立自己的bot,给bot取名,名字必须要以bot结尾。建好后自己的bot就有一个唯一的token,类似下面的一串字符 164354723:AAEjT6-IyNoXjt7miD0dwa-P5VmDTtHQC8 二、确认bot响应文件的位置     在写好bot响应文件后,要把bot放在网络上的一个位置,并且这个位置必须要加密的,即以https开头的一串网址。比如响应文件的名称为telbot.php,把它放在下面这个网址的位置: https://my.webhost.com/ 164354723:AAEjT6-IyNoXjt7miD0dwa-P5VmDTtHQC8 /telbot.php 上面网址中的红色设置和bot的token一样是为了确定这个唯一的位置,当然也可以任意设置。 三、告诉Telegram响应文件的位置 Telegram用下面网址的形式来设定webhook响应方式 https://api.telegram.org/bot [myauthorization-token] /setwebhook?url= [myboturl] 按照上面的网址形式,把自己创建的bot的token以及响应文件的位置填入,然后在浏览器中运行一下即可设置成功。比如: https://api.telegram.org/bot164354723:AAEjT6-IyNoXjt7miD0dwa-P5VmDTtHQC8/setwebhook?url=https://my.webhost.com/164354723:AAEjT6-IyNoXjt7miD0dwa-P5VmDTtHQC8/telbot.php 设置成功后,页面会显示下面的内容: {"ok":true,"result":true,"description":"Webhook is already set"} 四、在Telegram中给自己的bot发消息进行验证 php响应文件例子 <?php  define('BOT_TOKEN', 'YOURBOT:TOK...

telegram中的Sci-Hub机器人,又一文献下载利器

或许你看到标题会问什么是telegram,什么是Sci-Hub?请听我一一道来。 什么是Sci-Hub Sci-Hub是一个线上 数据库 ,其上提供48,000,000篇科学学术论文和文章。网站透过“.edu”代理服务器访问相关页面,每天会上传新的论文文章。2011年,哈萨克研究生亚历珊卓·艾尔巴金(Alexandra  Elbakyan)因为研究论文成本过高,每篇论文在付费墙机制下通常需要花费30美元,而决定成立Sci-Hub。2014年,学术界开始预测网站将会发展为类似Napster的服务。不过到了2015年,学术出版社爱思唯尔向纽约地方法院提交诉讼,指控Sci-Hub已经侵犯版权。纽约地方法院在2015年10月28日仍下令Sci-Hub原本使用的网域名称“Sci-Hub.org”必须终止。爱思唯尔在法院上获得胜诉后,一群研究人员、作家和艺术家则连署一封表态支持Sci-Hub和创世纪图书馆的公开信,声称这次诉讼对于世界各地的研究人员是“重大打击”,并指出:“它同样贬低我们、作者、编辑和读者。它寄生于我们的劳动,它阻挠我们为大众服务,它阻拦我们进入。”而该计划于11月因法院命令中止后,在同一个月内便改用网域名称“.io”重新上线,并开放使用Tor浏览。2016年1月时,Sci-Hub平均每天约有200,000人访问,Sci-Hub则声称网站服务每天平均有数十万次档案请求。  Sci-Hub是目前已知第一个提供大量自动且免费的付费学术论文的网站,使用者不需要事前订阅或付款,就能够使用原本存放在付费数据库的论文文章,并提供搜寻原先出版社网站内的文件档案服务。 以上介绍来源于维基百科词条 Sci-Hub Sci-Hub网站被屡次下线,但是又通过更换域名重新上线。以下三个网址经测试可以使用:  http://www.sci-hub.bz/   http://www.sci-hub.ac/   http://www.sci-hub.cc/   广大学者将自己的文章发表至学术期刊(免费或者支付版面费),然而当需要查看其他学者的文章时还需要向出版商付费,你是不是也觉得这完全阻碍了科学文化的传播。艾尔巴金在为自己辩护时援引联合国《世界人权宣言》第二十七条所提的:“人人有权自由参加社会之文化生活,欣赏艺...

MatLab中patch函数的基本用法

patch是用来构建多边形的一个基本函数。 用法一 patch(X,Y,C) patch(X,Y,Z,C) patch( 'XData' ,X, 'YData' ,Y) patch( 'XData' ,X, 'YData' ,Y, 'ZData' ,Z) 1.1 说明 patch(X,Y,C)用来构建一个或者多个可填充的多边形,其使用X和Y作为每个点的坐标值,patch将会按顺序连接每个点。如果要得到一个多边形,将X和Y设置为向量;如果要得到多个多边形,将X和Y设置为矩阵,没一列对应一个多边形。C决定多边形的颜色,可以是系统认定的字符,也可以是一个数值,也可以是RGB向量。 patch(X,Y,Z,C)用来构建三维坐标下的多边形。 patch(‘XData’,X,’YData’,Y)和patch(‘XData’,X,’YData’,Y,’ZData’,Z)的用法与patch(X,Y,C)和patch(X,Y,Z,C)的用法类似,只是不设定颜色。 1.2 例子 1.2.1 x = [ 0 1 1 0 ] ; y = [ 0 0 1 1 ] ; patch(x,y, 'red' ) x和y都是1*4的向量,表示将四个点(0,0)、(1,0)、(1,1)和(0,1)依次连接,最后闭合形成一个四边形,设定颜色为红色。 1.2.2 x2 = [ 2 5 ; 2 5 ; 8 8 ] ; y2 = [ 4 0 ; 8 2 ; 4 0 ] ; patch(x2,y2, 'green' ) x2和y2都是3*2的向量,两列表示画两个多边形。第一个多边形连接的点依次是(2,4)、(2,8)和(8,4),第二个多边形连接的点依次是(5,0)、(5,2)和(8,0),颜色设定为绿色。 1.2.3 如果上例的三角形第一个是红色,第二个是绿色,那么patch代码修改为 x2 = [ 2 5 ; 2 5 ; 8 8 ] ; y2 = [ 4 0 ; 8 2 ; 4 0 ] ; patch(x2(:, 1 ),y2(:, 1 ), 'red' ) pat...