首页 ›存档› 技术 › 查看内容

玩转大数据必备资源集合（一）

2018-3-30 13:00 |来自: 互联网 385 0

摘要: 关系数据库管理系统(RDBMS) MySQL23 世界最流行的开源数据库 PostgreSQL2 世界最先进的开源数据库. Oracle Database6 - 对象-关系型数据库管理系统。框架 Apache Hadoop7 - framework for distributed processi ...

关系数据库管理系统(RDBMS)

MySQL23 世界最流行的开源数据库
PostgreSQL2 世界最先进的开源数据库.
Oracle Database6 - 对象-关系型数据库管理系统。

框架

Apache Hadoop7 - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
Tigon4 - High Throughput Real-time Stream Processing Framework.

分布式编程

AddThis Hydra3 - distributed data processing and storage system originally developed at AddThis.
AMPLab SIMR2 - run Spark on Hadoop MapReduce v1.
Apache APEX3 - a unified, enterprise platform for big data stream and batch processing.
Apache Beam2 - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
Apache Crunch2 - a ** Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
Apache DataFu2 - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
Apache Flink2 - high-performance runtime, and automatic program optimization.
Apache Gora2 - framework for in-memory data model and persistence.
Apache Hama2 - BSP (Bulk Synchronous Parallel) computing framework.
Apache MapReduce3 - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
Apache Pig2 - high level language to express data analysis programs for Hadoop.
Apache REEF2 - retainable evaluator execution framework to simplify and unify the lower layers of big data systems.
Apache S42 - framework for stream processing, implementation of S4.
Apache Spark5 - framework for in-memory cluster computing.
Apache Spark Streaming4 - framework for stream processing, part of Spark.
Apache Storm2 - framework for stream processing by Twitter also on YARN.
Apache Samza2 - stream processing framework, based on Kafka and YARN.
Apache Tez2 - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
Apache Twill2 - abstraction over YARN that reduces the complexity of developing distributed applications.
Cascalog2 - data processing and querying library.
Cheetah2 - High Performance, Custom Data Warehouse on Top of MapReduce.
Concurrent Cascading2 - framework for data management/analytics on Hadoop.
Damballa Parkour2 - MapReduce library for Clojure.
Datasalt Pangool2 - alternative MapReduce paradigm.
DataTorrent StrAM2 - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
Facebook Corona2 - Hadoop enhancement which removes single point of failure.
Facebook Peregrine - Map Reduce framework.
Facebook Scuba3 - distributed in-memory datastore.
Google Dataflow2 - create data pipelines to help themingest, transform and analyze data.
Google MapReduce2 - map reduce framework.
Google MillWheel - fault tolerant stream processing framework.
JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
Kite - is a set of libraries, tools, examples, and documentation focused on ** it easier to build systems on top of the Hadoop ecosystem.
Metamarkets Druid - framework for real-time analysis of large datasets.
Netflix PigPen - map-reduce for Clojure which compiles to Apache Pig.
Nokia Disco - MapReduce framework developed by Nokia.
Onyx - Distributed computation for the cloud.
Pinterest Pinlater - asynchronous job execution system.
Pydoop1 - Python MapReduce and HDFS API for Hadoop.
Rackerlabs Blueflood - multi-tenant distributed metric processing system
Stratosphere - general purpose cluster computing framework.
Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
Tuktu - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
Twitter Heron1 - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
Twitter TSAR - TimeSeries AggregatoR by Twitter.

分布式文件系统

Apache HDFS7 - a way to store large files across multiple machines.
BeeGFS4 - formerly FhGFS, parallel distributed file system.
Ceph Filesystem4 - software storage platform designed.
Disco DDFS2 - distributed filesystem.
Facebook Haystack3 - object storage system.
Google Colossus1 - distributed filesystem (GFS2).
Google GFS - distributed filesystem.
Google Megastore1 - scalable, highly available storage.
GridGain - GGFS, Hadoop compliant in-memory file system.
Lustre file system1 - high-performance distributed filesystem.
Quantcast File System QFS - open-source distributed file system.
Red Hat GlusterFS - scale-out network-attached storage file system.
Seaweed-FS - ** and highly scalable distributed file system.
Alluxio - reliable file sharing at memory speed across cluster frameworks.
Tahoe-LAFS2 - decentralized cloud storage system.

文档数据模型

Actian Versant2 - commercial object-oriented database management systems .
Crate Data - is an open source massively scalable data store. It requires zero administration.
Facebook Apollo - Facebook’s Paxos-like NoSQL database.
jumboDB - document oriented datastore over Hadoop.
LinkedIn Espresso2 - horizontally scalable document-oriented NoSQL data store.
MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
MongoDB - Document-oriented database system.
RavenDB - A transactional, open-source Document Database.
RethinkDB - document database that supports queries like table joins and group by.

Key Map 数据模型

Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as "columns").

Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.

The former group is referred to as "key map data model" here. The line between these and the Key-value Data Model stores is fairly blurry.

The latter, being more about the storage format than about the data model, is listed under Columnar Databases.

You can read more about this distinction on Prof. Daniel Abadi's blog: Distinguishing two major types of Column Stores.

Apache Accumulo - distributed key/value store, built on Hadoop.
Apache Cassandra - column-oriented distributed datastore, inspired by BigTable.
Apache HBase - column-oriented distributed datastore, inspired by BigTable.
Facebook HydraBase - evolution of HBase made by Facebook.
Google BigTable - column-oriented distributed datastore.
Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
Hypertable - column-oriented distributed datastore, inspired by BigTable.
InfiniDB - is accessed through a MySQL inte**ce and use massive parallel processing to parallelize queries.
Tephra - Transactions for HBase.
Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.

键-值数据模型

Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
Amazon DynamoDB - distributed key/value store, implementation of Dynamo **.
Edis - is a protocol-compatible Server replacement for Redis.
ElephantDB - Distributed database specialized in exporting data from Hadoop.
EventStore - distributed time series database.
GridDB - suitable for sensor data stored in a timeseries.
LinkedIn Krati - is a ** persistent data store with very low latency and high throughput.
Linkedin Voldemort - distributed key/value storage system.
Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
Redis - in memory key value datastore.
Riak - a decentralized datastore.
Storehaus - library to work with asynchronous key value stores, by Twitter.
Tarantool - an efficient NoSQL database and a Lua application server.
TiKV - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.

图形数据模型

Apache Giraph3 - implementation of Pregel, based on Hadoop.
Apache Spark Bagel - implementation of Pregel, part of Spark.
ArangoDB - multi model distributed database.
DGraph1 - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
GCHQ Gaffer - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
Google Cayley - open-source graph database.
Google Pregel - graph processing framework.
GraphLab PowerGraph - a core C GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
GraphX - resilient Distributed Graph System on Spark.
Gremlin - graph traversal Language.
Infovore - RDF-centric Map/Reduce framework.
Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
MapGraph - Massively Parallel Graph processing on GPUs.
Neo4j - graph database writting entirely in Java.
OrientDB - document and graph database.
Phoebus - framework for large scale graph processing.
Titan - distributed graph database, built over Cassandra.
Twitter FlockDB - distributed graph database.

Columnar Databases

Note please read the note on Key-Map Data Model section.

Columnar Storage - an explanation of what columnar storage is and when you might want it.
Actian Vector - column-oriented analytic database.
C-Store - column oriented DBMS.
MonetDB - column store database.
Parquet - columnar storage format for Hadoop.
Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
Google BigQuery Google's cloud offering backed by their pioneering work on Dremel.
Amazon Redshift Amazon's cloud offering, also based on a columnar datastore backend.

NewSQL数据库

Actian Ingres2 - commercially supported, open-source SQL relational database management system.
Amazon RedShift - data warehouse service, based on PostgreSQL.
BayesDB - statistic oriented SQL database.
CitusDB - scales out PostgreSQL through sharding and replication.
Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
FoundationDB - distributed database, inspired by F1.
Google F1 - distributed SQL database built on Spanner.
Google Spanner - globally distributed semi-relational database.
H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
HandlerSocket - NoSQL plugin for MySQL/MariaDB.
InfiniSQL - infinity scalable RDBMS.
MemSQL - in memory SQL database witho optimized columnar storage on flash.
NuoDB - SQL/ACID compliant distributed database.
Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL inte**ce to in-memory table data, persistable in HDFS.
SAP HANA - is an in-memory, column-oriented, relational database management system.
SenseiDB - distributed, realtime, semi-structured database.
Sky - database used for flexible, high performance analysis of behavioral data.
SymmetricDS - open source software for both file and database synchronization.
Map-D - GPU in-memory database, big data analysis and visualization platform
TiDB - TiDB is a distributed SQL database. Inspired by the design of Google F1.
VoltDB - claims to be fastest in-memory database

时间序列数据库

Cube1 - uses MongoDB to store time series data.
Axibase Time Series Database - distributed time series database on top of HBase. Includes built-in Rule Engine, data forecasting and visualization.
Heroic - is a scalable time series database based on Cassandra and Elasticsearch.
InfluxDB - distributed time series database.
Kairosdb - similar to OpenTSDB but allows for Cassandra.
OpenTSDB1 - distributed time series database on top of HBase.
Prometheus - a time series database and service monitoring system
Newts - a time series database based on Apache Cassandra

SQL-like processing

Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
Apache Drill - framework for interactive analysis, inspired by Dremel.
Apache HCatalog - table and storage management layer for Hadoop.
Apache Hive - SQL-like data warehouse system for Hadoop.
Apache Optiq - framework that allows efficient translation of queries involving heterogeneous and federated data.
Apache Phoenix - SQL skin over HBase.
Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
Concurrent Lingual - SQL-like query language for Cascading.
Datasalt Splout SQL - full SQL query engine for big datasets.
Facebook PrestoDB - distributed SQL query engine.
Google BigQuery - framework for interactive analysis, implementation of Dremel.
Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
SparkSQL - Manipulating Structured Data Using Spark.
Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
Stinger - interactive query for Hive.
Tajo - distributed data warehouse system on Hadoop.
Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

Data Ingestion

Amazon Kinesis2 - real-time processing of streaming data at massive scale.
Apache Chukwa - data collection system.
Apache Flume - service to manage large amount of log data.
Apache Kafka - distributed publish-subscribe messaging system.
Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
Facebook Scribe - streamed log data aggregator.
Fluentd - tool to collect events and logs.
Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
Heka - open source stream processing software system.
HIHO - framework for connecting disparate data sources with Hadoop.
Kestrel - distributed message queue system.
LinkedIn Databus - stream of change capture events for a database.
LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
LinkedIn White Elephant - log aggregator and dashboard.
Logstash - a tool for managing events and logs.
Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
Pinterest Secor - is a service implementing Kafka log persistance.
Linkedin Gobblin - linkedin's universal data ingestion framework.
Skizze - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
StreamSets Data Collector - continuous big data ingest infrastructure with a ** to use IDE.