Top linked arXiv articles

View breakdown by

Markets are efficient if and only if P = NP ( Abstract ↓
I prove that if markets are weak-form efficient, meaning current prices fully reflect all information available in past prices, then P = NP, meaning every computational problem whose solution can be verified in polynomial time can also be solved in polynomial time. I also prove the converse by showing how we can "program" the market to solve NP-complete problems. Since P probably does not equal NP, markets are probably not efficient. Specifically, markets become increasingly inefficient as the time series lengthens or becomes more frequent. An illustration by way of partitioning the excess returns to momentum strategies based on data availability confirms this prediction.
A Neural Algorithm of Artistic Style ( Abstract ↓
In fine art, especially painting, humans have mastered the skill to create unique visual experiences through composing a complex interplay between the content and style of an image. Thus far the algorithmic basis of this process is unknown and there exists no artificial system with similar capabilities. However, in other key areas of visual perception such as object and face recognition near-human performance was recently demonstrated by a class of biologically inspired vision models called Deep Neural Networks. Here we introduce an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images. Moreover, in light of the striking similarities between performance-optimised artificial neural networks and biological vision, our work offers a path forward to an algorithmic understanding of how humans create and perceive artistic imagery.
Are Black Hole Starships Possible ( Abstract ↓
We investigate whether it is physically possible to build starships or power sources using the Hawking radiation of an artificial black hole as a power source. The proposal seems to be at the edge of possibility, but quantum gravity effects could change the picture.
Dissolving the Fermi Paradox ( Abstract ↓
The Fermi paradox is the conflict between an expectation of a high {\em ex ante} probability of intelligent life elsewhere in the universe and the apparently lifeless universe we in fact observe. The expectation that the universe should be teeming with intelligent life is linked to models like the Drake equation, which suggest that even if the probability of intelligent life developing at a given site is small, the sheer multitude of possible sites should nonetheless yield a large number of potentially observable civilizations. We show that this conflict arises from the use of Drake-like equations, which implicitly assume certainty regarding highly uncertain parameters. We examine these parameters, incorporating models of chemical and genetic transitions on paths to the origin of life, and show that extant scientific knowledge corresponds to uncertainties that span multiple orders of magnitude. This makes a stark difference. When the model is recast to represent realistic distributions of uncertainty, we find a substantial {\em ex ante} probability of there being no other intelligent life in our observable universe, and thus that there should be little surprise when we fail to detect any signs of it. This result dissolves the Fermi paradox, and in doing so removes any need to invoke speculative mechanisms by which civilizations would inevitably fail to have observable effects upon the universe.
Playing Atari with Deep Reinforcement Learning ( Abstract ↓
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
The network of global corporate control ( Abstract ↓
The structure of the control network of transnational corporations affects global market competition and financial stability. So far, only small national samples were studied and there was no appropriate methodology to assess control globally. We present the first investigation of the architecture of the international ownership network, along with the computation of the control held by each global player. We find that transnational corporations form a giant bow-tie structure and that a large portion of control flows to a small tightly-knit core of financial institutions. This core can be seen as an economic "super-entity" that raises new important issues both for researchers and policy makers.
Jewish Problems ( Abstract ↓
This is a special collection of problems that were given to select applicants during oral entrance exams to the math department of Moscow State University. These problems were designed to prevent Jews and other undesirables from getting a passing grade. Among problems that were used by the department to blackball unwanted candidate students, these problems are distinguished by having a simple solution that is difficult to find. Using problems with a simple solution protected the administration from extra complaints and appeals. This collection therefore has mathematical as well as historical value.
Explaining and Harnessing Adversarial Examples ( Abstract ↓
Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images ( Abstract ↓
Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify objects in images with near-human-level performance, questions naturally arise as to what differences remain between computer and human vision. A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects, which we call "fooling images" (more generally, fooling examples). Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.
Intriguing properties of neural networks ( Abstract ↓
Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks. Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. We can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.
Generative Adversarial Networks ( Abstract ↓
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Language Models are Few-Shot Learners ( Abstract ↓
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Neural Turing Machines ( Abstract ↓
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.
The Case for Learned Index Structures ( Abstract ↓
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.
On the genetic architecture of intelligence and other quantitative traits ( Abstract ↓
How do genes affect cognitive ability or other human quantitative traits such as height or disease risk? Progress on this challenging question is likely to be significant in the near future. I begin with a brief review of psychometric measurements of intelligence, introducing the idea of a "general factor" or g score. The main results concern the stability, validity (predictive power), and heritability of adult g. The largest component of genetic variance for both height and intelligence is additive (linear), leading to important simplifications in predictive modeling and statistical estimation. Due mainly to the rapidly decreasing cost of genotyping, it is possible that within the coming decade researchers will identify loci which account for a significant fraction of total g variation. In the case of height analogous efforts are well under way. I describe some unpublished results concerning the genetic architecture of height and cognitive ability, which suggest that roughly 10k moderately rare causal variants of mostly negative effect are responsible for normal population variation. Using results from Compressed Sensing (L1-penalized regression), I estimate the statistical power required to characterize both linear and nonlinear models for quantitative traits. The main unknown parameter s (sparsity) is the number of loci which account for the bulk of the genetic variation. The required sample size is of order 100s, or roughly a million in the case of cognitive ability.
Understanding deep learning requires rethinking generalization ( Abstract ↓
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.
Dynamic Routing Between Capsules ( Abstract ↓
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.
Ultimate physical limits to computation ( Abstract ↓
Computers are physical systems: what they can and cannot do is dictated by the laws of physics. In particular, the speed with which a physical device can process information is limited by its energy and the amount of information that it can process is limited by the number of degrees of freedom it possesses. This paper explores the physical limits of computation as determined by the speed of light $c$, the quantum scale $\hbar$ and the gravitational constant $G$. As an example, quantitative bounds are put to the computational power of an `ultimate laptop' with a mass of one kilogram confined to a volume of one liter.
Spectre is here to stay: An analysis of side-channels and speculative execution ( Abstract ↓
The recent discovery of the Spectre and Meltdown attacks represents a watershed moment not just for the field of Computer Security, but also of Programming Languages. This paper explores speculative side-channel attacks and their implications for programming languages. These attacks leak information through micro-architectural side-channels which we show are not mere bugs, but in fact lie at the foundation of optimization. We identify three open problems, (1) finding side-channels, (2) understanding speculative vulnerabilities, and (3) mitigating them. For (1) we introduce a mathematical meta-model that clarifies the source of side-channels in simulations and CPUs. For (2) we introduce an architectural model with speculative semantics to study recently-discovered vulnerabilities. For (3) we explore and evaluate software mitigations and prove one correct for this model. Our analysis is informed by extensive offensive research and defensive implementation work for V8, the production JavaScript virtual machine in Chrome. Straightforward extensions to model real hardware suggest these vulnerabilities present formidable challenges for effective, efficient mitigation. As a result of our work, we now believe that speculative vulnerabilities on today's hardware defeat all language-enforced confidentiality with no known comprehensive software mitigations, as we have discovered that untrusted code can construct a universal read gadget to read all memory in the same address space through side-channels. In the face of this reality, we have shifted the security model of the Chrome web browser and V8 to process isolation.
The Cellular Automaton Interpretation of Quantum Mechanics ( Abstract ↓
When investigating theories at the tiniest conceivable scales in nature, almost all researchers today revert to the quantum language, accepting the verdict from the Copenhagen doctrine that the only way to describe what is going on will always involve states in Hilbert space, controlled by operator equations. Returning to classical, that is, non quantum mechanical, descriptions will be forever impossible, unless one accepts some extremely contrived theoretical constructions that may or may not reproduce the quantum mechanical phenomena observed in experiments. Dissatisfied, this author investigated how one can look at things differently. This book is an overview of older material, but also contains many new observations and calculations. Quantum mechanics is looked upon as a tool, not as a theory. Examples are displayed of models that are classical in essence, but can be analysed by the use of quantum techniques, and we argue that even the Standard Model, together with gravitational interactions, might be viewed as a quantum mechanical approach to analyse a system that could be classical at its core. We explain how such thoughts can conceivably be reconciled with Bell's theorem, and how the usual objections voiced against the notion of `superdeterminism' can be overcome, at least in principle. Our proposal would eradicate the collapse problem and the measurement problem. Even the existence of an "arrow of time" can perhaps be explained in a more elegant way than usual. Discussions added in v3: the role of the gravitational force, a mathematical physics definition of free will, and an unconventional view on the arrow of time, amongst others.
Java Generics are Turing Complete ( Abstract ↓
This paper describes a reduction from the halting problem of Turing machines to subtype checking in Java. It follows that subtype checking in Java is undecidable, which answers a question posed by Kennedy and Pierce in 2007. It also follows that Java's type checker can recognize any recursive language, which improves a result of Gil and Levy from 2016. The latter point is illustrated by a parser generator for fluent interfaces.
Software Engineering at Google ( Abstract ↓
We catalog and describe Google's key software engineering practices.
Harmony Explained: Progress Towards A Scientific Theory of Music ( Abstract ↓
Most music theory books are like medieval medical textbooks: they contain unjustified superstition, non-reasoning, and funny symbols glorified by Latin phrases. How does music, in particular harmony, actually work, presented as a real, scientific theory of music? The core to our approach is to consider not only the Physical phenomena of nature but also the Computational phenomena of any machine that must make sense of sound, such as the human brain. In particular we derive the following three fundamental phenomena of music: * the Major Scale, * the Standard Chord Dictionary, and * the difference in feeling between the Major and Minor Triads. While the Major Scale has been independently derived before by others in a similar manner [Helmholtz1863, Birkhoff1933], I believe the derivation of the Standard Chord Dictionary as well as the difference in feeling between the Major and Minor Triads to be original. We show to be incomplete the theory of the heretofore agreed-upon authority on this subject, 19th-century Physicist Hermann Helmholtz [Helmholtz1863]: he says notes are in "concord" because the sound playing them together is "less worse" than that of some other notes. But note that, in this theory, more notes can only penalize, some merely less than others, and so the most harmonious sound should be a single note by itself(!) and harmony would not exist as a phenomenon of music at all. I intend this article to be satisfying to scientists as an original contribution to science and art, yet I also intend it to be approachable by musicians and other curious members of the general public who may have long wondered at the curious properties of tonal music and been frustrated by the lack of satisfying, readable exposition on the subject. Therefore I have written in a deliberately plain and conversational style, avoiding unnecessarily formal language.
Responses to Critiques on Machine Learning of Criminality Perceptions (Addendum of arXiv:1611.04135) ( Abstract ↓
In November 2016 we submitted to arXiv our paper "Automated Inference on Criminality Using Face Images". It generated a great deal of discussions in the Internet and some media outlets. Our work is only intended for pure academic discussions; how it has become a media consumption is a total surprise to us. Although in agreement with our critics on the need and importance of policing AI research for the general good of the society, we are deeply baffled by the ways some of them mispresented our work, in particular the motive and objective of our research.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ( Abstract ↓
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
A Neural Conversational Model ( Abstract ↓
Conversational modeling is an important task in natural language understanding and machine intelligence. Although previous approaches exist, they are often restricted to specific domains (e.g., booking an airline ticket) and require hand-crafted rules. In this paper, we present a simple approach for this task which uses the recently proposed sequence to sequence framework. Our model converses by predicting the next sentence given the previous sentence or sentences in a conversation. The strength of our model is that it can be trained end-to-end and thus requires much fewer hand-crafted rules. We find that this straightforward model can generate simple conversations given a large conversational training dataset. Our preliminary results suggest that, despite optimizing the wrong objective function, the model is able to converse well. It is able extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles. On a domain-specific IT helpdesk dataset, the model can find a solution to a technical problem via conversations. On a noisy open-domain movie transcript dataset, the model can perform simple forms of common sense reasoning. As expected, we also find that the lack of consistency is a common failure mode of our model.
Why does deep and cheap learning work so well? ( Abstract ↓
We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through "cheap learning" with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine-learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various "no-flattening theorems" showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss, for example, we show that $n$ variables cannot be multiplied using fewer than 2^n neurons in a single hidden layer.
Foldscope: Origami-based paper microscope ( Abstract ↓
Here we describe an ultra-low-cost origami-based approach for large-scale manufacturing of microscopes, specifically demonstrating brightfield, darkfield, and fluorescence microscopes. Merging principles of optical design with origami enables high-volume fabrication of microscopes from 2D media. Flexure mechanisms created via folding enable a flat compact design. Structural loops in folded paper provide kinematic constraints as a means for passive self-alignment. This light, rugged instrument can survive harsh field conditions while providing a diversity of imaging capabilities, thus serving wide-ranging applications for cost-effective, portable microscopes in science and education.
One weird trick for parallelizing convolutional neural networks ( Abstract ↓
I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modern convolutional neural networks.
On proof and progress in mathematics ( Abstract ↓
In response to Jaffe and Quinn [math.HO/9307227], the author discusses forms of progress in mathematics that are not captured by formal proofs of theorems, especially in his own work in the theory of foliations and geometrization of 3-manifolds and dynamical systems.
Universal Intelligence: A Definition of Machine Intelligence ( Abstract ↓
A fundamental problem in artificial intelligence is that nobody really knows what intelligence is. The problem is especially acute when we need to consider artificial systems which are significantly different to humans. In this paper we approach this problem in the following way: We take a number of well known informal definitions of human intelligence that have been given by experts, and extract their essential features. These are then mathematically formalised to produce a general measure of intelligence for arbitrary machines. We believe that this equation formally captures the concept of machine intelligence in the broadest reasonable sense. We then show how this formal definition is related to the theory of universal optimal learning agents. Finally, we survey the many other tests and definitions of intelligence that have been proposed for machines.
The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V ( Abstract ↓
This report makes the case that a well-designed Reduced Instruction Set Computer (RISC) can match, and even exceed, the performance and code density of existing commercial Complex Instruction Set Computers (CISC) while maintaining the simplicity and cost-effectiveness that underpins the original RISC goals. We begin by comparing the dynamic instruction counts and dynamic instruction bytes fetched for the popular proprietary ARMv7, ARMv8, IA-32, and x86-64 Instruction Set Architectures (ISAs) against the free and open RISC-V RV64G and RV64GC ISAs when running the SPEC CINT2006 benchmark suite. RISC-V was designed as a very small ISA to support a wide range of implementations, and has a less mature compiler toolchain. However, we observe that on SPEC CINT2006 RV64G executes on average 16% more instructions than x86-64, 3% more instructions than IA-32, 9% more instructions than ARMv8, but 4% fewer instructions than ARMv7. CISC x86 implementations break up complex instructions into smaller internal RISC-like micro-ops, and the RV64G instruction count is within 2% of the x86-64 retired micro-op count. RV64GC, the compressed variant of RV64G, is the densest ISA studied, fetching 8% fewer dynamic instruction bytes than x86-64. We observed that much of the increased RISC-V instruction count is due to a small set of common multi-instruction idioms. Exploiting this fact, the RV64G and RV64GC effective instruction count can be reduced by 5.4% on average by leveraging macro-op fusion. Combining the compressed RISC-V ISA extension with macro-op fusion provides both the densest ISA and the fewest dynamic operations retired per program, reducing the motivation to add more instructions to the ISA. This approach retains a single simple ISA suitable for both low-end and high-end implementations, where high-end implementations can boost performance through microarchitectural techniques.
Learning to learn by gradient descent by gradient descent ( Abstract ↓
The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.
Reconciling a Reactionless Propulsive Drive with the First Law of Thermodynamics ( Abstract ↓
A "space drive" is a hypothetical device that generates a propulsive force in free space using an input of power without the need for a reaction mass. Any device that generates photons (e.g., a laser) would qualify as a propellantless "photon rocket," but the force generated by emitting photons per power input (3.33 $\mu$N/kW) is too small to be a practical propulsion device. The ability to generate greater force per power input would be highly desirable, but, as demonstrated in this paper, such a device would be able to operate as a perpetual motion machine of the first kind. Since applying a constant force results in a constant acceleration, the kinetic energy of a mass driven by such a device increases quadratically with time, while the energy input increases only linearly with time. Thus, at some point, the kinetic energy of the device-driven mass exceeds the energy input, and if this energy is collected via decelerating the mass (via regenerative electromagnetic braking, for example), then there would be a net gain in energy. For devices with thrust-to-power ratios on the order of 1 N/kW that have been discussed recently in connection with the so-called EM drive, this breakeven occurs at velocities low enough to be feasible with current technology, clearly demonstrating the absurdity of such a device. When relativistic effects are taken into account, it is shown that the photon rocket can only reach energy breakeven as the accelerated mass asymptotically approaches the speed of light. Thus, any device with a thrust-to-power ratio greater than the photon rocket would be able to operate as a perpetual motion machine of the first kind, and thus should be excluded by the First Law of Thermodynamics.
The Loss Surfaces of Multilayer Networks ( Abstract ↓
We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.
Power-law distributions in empirical data ( Abstract ↓
Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the large fluctuations that occur in the tail of the distribution -- the part of the distribution representing large but rare events -- and by the difficulty of identifying the range over which power-law behavior holds. Commonly used methods for analyzing power-law data, such as least-squares fitting, can produce substantially inaccurate estimates of parameters for power-law distributions, and even in cases where such methods return accurate answers they are still unsatisfactory because they give no indication of whether the data obey a power law at all. Here we present a principled statistical framework for discerning and quantifying power-law behavior in empirical data. Our approach combines maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov-Smirnov statistic and likelihood ratios. We evaluate the effectiveness of the approach with tests on synthetic data and give critical comparisons to previous approaches. We also apply the proposed methods to twenty-four real-world data sets from a range of different disciplines, each of which has been conjectured to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data while in others the power law is ruled out.
Constraints on the Universe as a Numerical Simulation ( Abstract ↓
Observable consequences of the hypothesis that the observed universe is a numerical simulation performed on a cubic space-time lattice or grid are explored. The simulation scenario is first motivated by extrapolating current trends in computational resource requirements for lattice QCD into the future. Using the historical development of lattice gauge theory technology as a guide, we assume that our universe is an early numerical simulation with unimproved Wilson fermion discretization and investigate potentially-observable consequences. Among the observables that are considered are the muon g-2 and the current differences between determinations of alpha, but the most stringent bound on the inverse lattice spacing of the universe, b^(-1) >~ 10^(11) GeV, is derived from the high-energy cut off of the cosmic ray spectrum. The numerical simulation scenario could reveal itself in the distributions of the highest energy cosmic rays exhibiting a degree of rotational symmetry breaking that reflects the structure of the underlying lattice.
The Mathematical Universe ( Abstract ↓
I explore physics implications of the External Reality Hypothesis (ERH) that there exists an external physical reality completely independent of us humans. I argue that with a sufficiently broad definition of mathematics, it implies the Mathematical Universe Hypothesis (MUH) that our physical world is an abstract mathematical structure. I discuss various implications of the ERH and MUH, ranging from standard physics topics like symmetries, irreducible representations, units, free parameters, randomness and initial conditions to broader issues like consciousness, parallel universes and Godel incompleteness. I hypothesize that only computable and decidable (in Godel's sense) structures exist, which alleviates the cosmological measure problem and help explain why our physical laws appear so simple. I also comment on the intimate relation between mathematical structures, computations, simulations and physical systems.
Majority is not Enough: Bitcoin Mining is Vulnerable ( Abstract ↓
The Bitcoin cryptocurrency records its transactions in a public log called the blockchain. Its security rests critically on the distributed protocol that maintains the blockchain, run by participants called miners. Conventional wisdom asserts that the protocol is incentive-compatible and secure against colluding minority groups, i.e., it incentivizes miners to follow the protocol as prescribed. We show that the Bitcoin protocol is not incentive-compatible. We present an attack with which colluding miners obtain a revenue larger than their fair share. This attack can have significant consequences for Bitcoin: Rational miners will prefer to join the selfish miners, and the colluding group will increase in size until it becomes a majority. At this point, the Bitcoin system ceases to be a decentralized currency. Selfish mining is feasible for any group size of colluding miners. We propose a practical modification to the Bitcoin protocol that protects against selfish mining pools that command less than 1/4 of the resources. This threshold is lower than the wrongly assumed 1/2 bound, but better than the current reality where a group of any size can compromise the system.
Quantum Mechanics of Measurement ( Abstract ↓
An analysis of quantum measurement is presented that relies on an information-theoretic description of quantum entanglement. In a consistent quantum information theory of entanglement, entropies (uncertainties) conditional on measurement outcomes can be negative, implying that measurement can be described via unitary, entropy-conserving, interactions, while still producing randomness in a measurement device. In such a framework, quantum measurement is not accompanied by a wave-function collapse, or a quantum jump. The theory is applied to the measurement of incompatible variables, giving rise to a stronger entropic uncertainty relation than heretofore known. It is also applied to standard quantum measurement situations such as the Stern-Gerlach and double-slit experiments to illustrate how randomness, inherent in the conventional quantum probabilities, arises in a unitary framework. Finally, the present view clarifies the relationship between classical and quantum concepts.
Twitter mood predicts the stock market ( Abstract ↓
Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e., can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%.
Machine Learning on Sequential Data Using a Recurrent Weighted Average ( Abstract ↓
Recurrent Neural Networks (RNN) are a type of statistical model designed to handle sequential data. The model reads a sequence one symbol at a time. Each symbol is processed based on information collected from the previous symbols. With existing RNN architectures, each symbol is processed using only information from the previous processing step. To overcome this limitation, we propose a new kind of RNN model that computes a recurrent weighted average (RWA) over every past processing step. Because the RWA can be computed as a running average, the computational overhead scales like that of any other RNN architecture. The approach essentially reformulates the attention mechanism into a stand-alone model. The performance of the RWA model is assessed on the variable copy problem, the adding problem, classification of artificial grammar, classification of sequences by length, and classification of the MNIST images (where the pixels are read sequentially one at a time). On almost every task, the RWA model is found to outperform a standard LSTM model.
DolphinAtack: Inaudible Voice Commands ( Abstract ↓
Speech recognition (SR) systems such as Siri or Google Now have become an increasingly popular human-computer interaction method, and have turned various systems into voice controllable systems(VCS). Prior work on attacking VCS shows that the hidden voice commands that are incomprehensible to people can control the systems. Hidden voice commands, though hidden, are nonetheless audible. In this work, we design a completely inaudible attack, DolphinAttack, that modulates voice commands on ultrasonic carriers (e.g., f > 20 kHz) to achieve inaudibility. By leveraging the nonlinearity of the microphone circuits, the modulated low frequency audio commands can be successfully demodulated, recovered, and more importantly interpreted by the speech recognition systems. We validate DolphinAttack on popular speech recognition systems, including Siri, Google Now, Samsung S Voice, Huawei HiVoice, Cortana and Alexa. By injecting a sequence of inaudible voice commands, we show a few proof-of-concept attacks, which include activating Siri to initiate a FaceTime call on iPhone, activating Google Now to switch the phone to the airplane mode, and even manipulating the navigation system in an Audi automobile. We propose hardware and software defense solutions. We validate that it is feasible to detect DolphinAttack by classifying the audios using supported vector machine (SVM), and suggest to re-design voice controllable systems to be resilient to inaudible voice command attacks.
Attention Is All You Need ( Abstract ↓
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Array Layouts for Comparison-Based Searching ( Abstract ↓
We attempt to determine the best order and search algorithm to store $n$ comparable data items in an array, $A$, of length $n$ so that we can, for any query value, $x$, quickly find the smallest value in $A$ that is greater than or equal to $x$. In particular, we consider the important case where there are many such queries to the same array, $A$, which resides entirely in RAM. In addition to the obvious sorted order/binary search combination we consider the Eytzinger (BFS) layout normally used for heaps, an implicit B-tree layout that generalizes the Eytzinger layout, and the van Emde Boas layout commonly used in the cache-oblivious algorithms literature. After extensive testing and tuning on a wide variety of modern hardware, we arrive at the conclusion that, for small values of $n$, sorted order, combined with a good implementation of binary search is best. For larger values of $n$, we arrive at the surprising conclusion that the Eytzinger layout is usually the fastest. The latter conclusion is unexpected and goes counter to earlier experimental work by Brodal, Fagerberg, and Jacob (SODA~2003), who concluded that both the B-tree and van Emde Boas layouts were faster than the Eytzinger layout for large values of $n$. Our fastest C++ implementations, when compiled, use conditional moves to avoid branch mispredictions and prefetching to reduce cache latency.
Meta Math! The Quest for Omega ( Abstract ↓
This book presents a personal account of the mathematics and metamathematics of the 20th century leading up to the discovery of the halting probability Omega. The emphasis is on history of ideas and philosophical implications.
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks ( Abstract ↓
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
Asynchronous Methods for Deep Reinforcement Learning ( Abstract ↓
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation ( Abstract ↓
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
Magic: The Gathering is Turing Complete ( Abstract ↓
$\textit{Magic: The Gathering}$ is a popular and famously complicated trading card game about magical combat. In this paper we show that optimal play in real-world $\textit{Magic}$ is at least as hard as the Halting Problem, solving a problem that has been open for a decade. To do this, we present a methodology for embedding an arbitrary Turing machine into a game of $\textit{Magic}$ such that the first player is guaranteed to win the game if and only if the Turing machine halts. Our result applies to how real $\textit{Magic}$ is played, can be achieved using standard-size tournament-legal decks, and does not rely on stochasticity or hidden information. Our result is also highly unusual in that all moves of both players are forced in the construction. This shows that even recognising who will win a game in which neither player has a non-trivial decision to make for the rest of the game is undecidable. We conclude with a discussion of the implications for a unified computational theory of games and remarks about the playability of such a board in a tournament setting.
Stratified B-trees and versioning dictionaries ( Abstract ↓
A classic versioned data structure in storage and computer science is the copy-on-write (CoW) B-tree -- it underlies many of today's file systems and databases, including WAFL, ZFS, Btrfs and more. Unfortunately, it doesn't inherit the B-tree's optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO to scale. Yet, nothing better has been developed since. We describe the `stratified B-tree', which beats all known semi-external memory versioned B-trees, including the CoW B-tree. In particular, it is the first versioned dictionary to achieve optimal tradeoffs between space, query and update performance.
Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace ( Abstract ↓
We perform a comprehensive measurement analysis of Silk Road, an anonymous, international online marketplace that operates as a Tor hidden service and uses Bitcoin as its exchange currency. We gather and analyze data over eight months between the end of 2011 and 2012, including daily crawls of the marketplace for nearly six months in 2012. We obtain a detailed picture of the type of goods being sold on Silk Road, and of the revenues made both by sellers and Silk Road operators. Through examining over 24,400 separate items sold on the site, we show that Silk Road is overwhelmingly used as a market for controlled substances and narcotics, and that most items sold are available for less than three weeks. The majority of sellers disappears within roughly three months of their arrival, but a core of 112 sellers has been present throughout our measurement interval. We evaluate the total revenue made by all sellers, from public listings, to slightly over USD 1.2 million per month; this corresponds to about USD 92,000 per month in commissions for the Silk Road operators. We further show that the marketplace has been operating steadily, with daily sales and number of sellers overall increasing over our measurement interval. We discuss economic and policy implications of our analysis and results, including ethical considerations for future research in this area.
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm ( Abstract ↓
The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.
Distilling the Knowledge in a Neural Network ( Abstract ↓
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Fair anonymity for the Tor network ( Abstract ↓
Current anonymizing networks have become an important tool for guaranteeing users' privacy. However, these platforms can be used to perform illegitimate actions, which sometimes makes service providers see traffic coming from these networks as a probable threat. In order to solve this problem, we propose to add support for fairness mechanisms to the Tor network. Specifically, by introducing a slight modification to the key negotiation process with the entry and exit nodes, in the shape of group signatures. By means of these signatures, we set up an access control method to prevent misbehaving users to make use of the Tor network. Additionally, we establish a predefined method for denouncing illegitimate actions, which impedes the application of the proposed fairness mechanisms as a threat eroding users' privacy. As a direct consequence, traffic coming from Tor would be considered less suspicious by service providers.
ARC: A compact, high-field, fusion nuclear science facility and demonstration power plant with demountable magnets ( Abstract ↓
The affordable, robust, compact (ARC) reactor conceptual design study aims to reduce the size, cost, and complexity of a combined fusion nuclear science facility (FNSF) and demonstration fusion Pilot power plant. ARC is a 200-250 MWe tokamak reactor with a major radius of 3.3 m, a minor radius of 1.1 m, and an on-axis magnetic field of 9.2 T. ARC has rare earth barium copper oxide (REBCO) superconducting toroidal field coils, which have joints to enable disassembly. This allows the vacuum vessel to be replaced quickly, mitigating first wall survivability concerns, and permits a single device to test many vacuum vessel designs and divertor materials. The design point has a plasma fusion gain of Q_p~13.6, yet is fully non-inductive, with a modest bootstrap fraction of only ~63%. Thus ARC offers a high power gain with relatively large external control of the current profile. This highly attractive combination is enabled by the ~23 T peak field on coil with newly available REBCO superconductor technology. External current drive is provided by two innovative inboard RF launchers using 25 MW of lower hybrid and 13.6 MW of ion cyclotron fast wave power. The resulting efficient current drive provides a robust, steady state core plasma far from disruptive limits. ARC uses an all-liquid blanket, consisting of low pressure, slowly flowing fluorine lithium beryllium (FLiBe) molten salt. The liquid blanket is low-risk technology and provides effective neutron moderation and shielding, excellent heat removal, and a tritium breeding ratio >= 1.1. The large temperature range over which FLiBe is liquid permits blanket operation at 900 K with single phase fluid cooling and a high-efficiency Brayton cycle, allowing for net electricity generation when operating ARC as a Pilot power plant.
Many-Worlds and Schroedinger's First Quantum Theory ( Abstract ↓
Schroedinger's first proposal for the interpretation of quantum mechanics was based on a postulate relating the wave function on configuration space to charge density in physical space. Schroedinger apparently later thought that his proposal was empirically wrong. We argue here that this is not the case, at least for a very similar proposal with charge density replaced by mass density. We argue that when analyzed carefully this theory is seen to be an empirically adequate many-worlds theory and not an empirically inadequate theory describing a single world. Moreover, this formulation--Schroedinger's first quantum theory--can be regarded as a formulation of the many-worlds view of quantum mechanics that is ontologically clearer than Everett's.
Measuring the tendency of CNNs to Learn Surface Statistical Regularities ( Abstract ↓
Deep CNNs are known to exhibit the following peculiarity: on the one hand they generalize extremely well to a test set, while on the other hand they are extremely sensitive to so-called adversarial perturbations. The extreme sensitivity of high performance CNNs to adversarial examples casts serious doubt that these networks are learning high level abstractions in the dataset. We are concerned with the following question: How can a deep CNN that does not learn any high level semantics of the dataset manage to generalize so well? The goal of this article is to measure the tendency of CNNs to learn surface statistical regularities of the dataset. To this end, we use Fourier filtering to construct datasets which share the exact same high level abstractions but exhibit qualitatively different surface statistical regularities. For the SVHN and CIFAR-10 datasets, we present two Fourier filtered variants: a low frequency variant and a randomly filtered variant. Each of the Fourier filtering schemes is tuned to preserve the recognizability of the objects. Our main finding is that CNNs exhibit a tendency to latch onto the Fourier image statistics of the training dataset, sometimes exhibiting up to a 28% generalization gap across the various test sets. Moreover, we observe that significantly increasing the depth of a network has a very marginal impact on closing the aforementioned generalization gap. Thus we provide quantitative evidence supporting the hypothesis that deep CNNs tend to learn surface statistical regularities in the dataset rather than higher-level abstract concepts.
Modular implicits ( Abstract ↓
We present modular implicits, an extension to the OCaml language for ad-hoc polymorphism inspired by Scala implicits and modular type classes. Modular implicits are based on type-directed implicit module parameters, and elaborate straightforwardly into OCaml's first-class functors. Basing the design on OCaml's modules leads to a system that naturally supports many features from other languages with systematic ad-hoc overloading, including inheritance, instance constraints, constructor classes and associated types.
Tackling Climate Change with Machine Learning ( Abstract ↓
Climate change is one of the greatest challenges facing humanity, and we, as machine learning experts, may wonder how we can help. Here we describe how machine learning can be a powerful tool in reducing greenhouse gas emissions and helping society adapt to a changing climate. From smart grids to disaster management, we identify high impact problems where existing gaps can be filled by machine learning, in collaboration with other fields. Our recommendations encompass exciting research questions as well as promising business opportunities. We call on the machine learning community to join the global effort against climate change.
Universal Language Model Fine-tuning for Text Classification ( Abstract ↓
Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open-source our pretrained models and code.
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model ( Abstract ↓
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks ( Abstract ↓
Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural network that operates directly on the image pixels. We employ the DistBelief implementation of deep neural networks in order to train large, distributed neural networks on high quality images. We find that the performance of this approach increases with the depth of the convolutional network, with the best performance occurring in the deepest architecture we trained, with eleven hidden layers. We evaluate this approach on the publicly available SVHN dataset and achieve over $96\%$ accuracy in recognizing complete street numbers. We show that on a per-digit recognition task, we improve upon the state-of-the-art, achieving $97.84\%$ accuracy. We also evaluate this approach on an even more challenging dataset generated from Street View imagery containing several tens of millions of street number annotations and achieve over $90\%$ accuracy. To further explore the applicability of the proposed system to broader text recognition tasks, we apply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of the most secure reverse turing tests that uses distorted text to distinguish humans from bots. We report a $99.8\%$ accuracy on the hardest category of reCAPTCHA. Our evaluations on both tasks indicate that at specific operating thresholds, the performance of the proposed system is comparable to, and in some cases exceeds, that of human operators.
A Practical Quantum Instruction Set Architecture ( Abstract ↓
We introduce an abstract machine architecture for classical/quantum computations---including compilation---along with a quantum instruction language called Quil for explicitly writing these computations. With this formalism, we discuss concrete implementations of the machine and non-trivial algorithms targeting them. The introduction of this machine dovetails with ongoing development of quantum computing technology, and makes possible portable descriptions of recent classical/quantum algorithms.
Stealing Machine Learning Models via Prediction APIs ( Abstract ↓
Machine learning (ML) models may be deemed confidential due to their sensitive training data, commercial value, or use in security applications. Increasingly often, confidential ML models are being deployed with publicly accessible query interfaces. ML-as-a-service ("predictive analytics") systems are an example: Some allow users to train models on potentially sensitive data and charge others for access on a pay-per-query basis. The tension between model confidentiality and public access motivates our investigation of model extraction attacks. In such attacks, an adversary with black-box access, but no prior knowledge of an ML model's parameters or training data, aims to duplicate the functionality of (i.e., "steal") the model. Unlike in classical learning theory settings, ML-as-a-service offerings may accept partial feature vectors as inputs and include confidence values with predictions. Given these practices, we show simple, efficient attacks that extract target ML models with near-perfect fidelity for popular model classes including logistic regression, neural networks, and decision trees. We demonstrate these attacks against the online services of BigML and Amazon Machine Learning. We further show that the natural countermeasure of omitting confidence values from model outputs still admits potentially harmful model extraction attacks. Our results highlight the need for careful ML model deployment and new model extraction countermeasures.
twister - a P2P microblogging platform ( Abstract ↓
This paper proposes a new microblogging architecture based on peer-to-peer networks overlays. The proposed platform is comprised of three mostly independent overlay networks. The first provides distributed user registration and authentication and is based on the Bitcoin protocol. The second one is a Distributed Hash Table (DHT) overlay network providing key/value storage for user resources and tracker location for the third network. The last network is a collection of possibly disjoint "swarms" of followers, based on the Bittorrent protocol, which can be used for efficient near-instant notification delivery to many users. By leveraging from existing and proven technologies, twister provides a new microblogging platform offering security, scalability and privacy features. A mechanism provides incentive for entities that contribute processing time to run the user registration network, rewarding such entities with the privilege of sending a single unsolicited ("promoted") message to the entire network. The number of unsolicited messages per day is defined in order to not upset users.
Aristotle's Physics: a Physicist's Look ( Abstract ↓
I show that Aristotelian physics is a correct and non-intuitive approximation of Newtonian physics in the suitable domain (motion in fluids), in the same technical sense in which Newton theory is an approximation of Einstein's theory. Aristotelian physics lasted long not because it became dogma, but because it is a very good empirically grounded theory. The observation suggests some general considerations on inter-theoretical relations.
Talent vs Luck: the role of randomness in success and failure ( Abstract ↓
The largely dominant meritocratic paradigm of highly competitive Western cultures is rooted on the belief that success is due mainly, if not exclusively, to personal qualities such as talent, intelligence, skills, efforts or risk taking. Sometimes, we are willing to admit that a certain degree of luck could also play a role in achieving significant material success. But, as a matter of fact, it is rather common to underestimate the importance of external forces in individual successful stories. It is very well known that intelligence or talent exhibit a Gaussian distribution among the population, whereas the distribution of wealth - considered a proxy of success - follows typically a power law (Pareto law). Such a discrepancy between a Normal distribution of inputs, with a typical scale, and the scale invariant distribution of outputs, suggests that some hidden ingredient is at work behind the scenes. In this paper, with the help of a very simple agent-based model, we suggest that such an ingredient is just randomness. In particular, we show that, if it is true that some degree of talent is necessary to be successful in life, almost never the most talented people reach the highest peaks of success, being overtaken by mediocre but sensibly luckier individuals. As to our knowledge, this counterintuitive result - although implicitly suggested between the lines in a vast literature - is quantified here for the first time. It sheds new light on the effectiveness of assessing merit on the basis of the reached level of success and underlines the risks of distributing excessive honors or resources to people who, at the end of the day, could have been simply luckier than others. With the help of this model, several policy hypotheses are also addressed and compared to show the most efficient strategies for public funding of research in order to improve meritocracy, diversity and innovation.
One-shot Learning with Memory-Augmented Neural Networks ( Abstract ↓
Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of "one-shot learning." Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms.
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network ( Abstract ↓
Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.
A naturalist account of the limited, and hence reasonable, effectiveness of mathematics in physics ( Abstract ↓
The aim of this essay is to propose a conception of mathematics that is fully consonant with naturalism. By that I mean the hypothesis that everything that exists is part of the natural world, which makes up a unitary whole.
RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning ( Abstract ↓
Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings ( Abstract ↓
The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between between the words receptionist and female, while maintaining desired associations such as between the words queen and female. We define metrics to quantify both direct and indirect gender biases in embeddings, and develop algorithms to "debias" the embedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.
A Categorical Theory of Patches ( Abstract ↓
When working with distant collaborators on the same documents, one often uses a version control system, which is a program tracking the history of files and helping importing modifications brought by others as patches. The implementation of such a system requires to handle lots of situations depending on the operations performed by users on files, and it is thus difficult to ensure that all the corner cases have been correctly addressed. Here, instead of verifying the implementation of such a system, we adopt a complementary approach: we introduce a theoretical model, which is defined abstractly by the universal property that it should satisfy, and work out a concrete description of it. We begin by defining a category of files and patches, where the operation of merging the effect of two coinitial patches is defined by pushout. Since two patches can be incompatible, such a pushout does not necessarily exist in the category, which raises the question of which is the correct category to represent and manipulate files in conflicting state. We provide an answer by investigating the free completion of the category of files under finite colimits, and give an explicit description of this category: its objects are finite sets labeled by lines equipped with a transitive relation and morphisms are partial functions respecting labeling and relations.
Efficient Estimation of Word Representations in Vector Space ( Abstract ↓
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
On the Origin of Gravity and the Laws of Newton ( Abstract ↓
Starting from first principles and general assumptions Newton's law of gravitation is shown to arise naturally and unavoidably in a theory in which space is emergent through a holographic scenario. Gravity is explained as an entropic force caused by changes in the information associated with the positions of material bodies. A relativistic generalization of the presented arguments directly leads to the Einstein equations. When space is emergent even Newton's law of inertia needs to be explained. The equivalence principle leads us to conclude that it is actually this law of inertia whose origin is entropic.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer ( Abstract ↓
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
Physics, Topology, Logic and Computation: A Rosetta Stone ( Abstract ↓
In physics, Feynman diagrams are used to reason about quantum processes. In the 1980s, it became clear that underlying these diagrams is a powerful analogy between quantum physics and topology: namely, a linear operator behaves very much like a "cobordism". Similar diagrams can be used to reason about logic, where they represent proofs, and computation, where they represent programs. With the rise of interest in quantum cryptography and quantum computation, it became clear that there is extensive network of analogies between physics, topology, logic and computation. In this expository paper, we make some of these analogies precise using the concept of "closed symmetric monoidal category". We assume no prior knowledge of category theory, proof theory or computer science.
Indication of anomalous heat energy production in a reactor device ( Abstract ↓
An experimental investigation of possible anomalous heat production in a special type of reactor tube named E-Cat HT is carried out. The reactor tube is charged with a small amount of hydrogen loaded nickel powder plus some additives. The reaction is primarily initiated by heat from resistor coils inside the reactor tube. Measurement of the produced heat was performed with high-resolution thermal imaging cameras, recording data every second from the hot reactor tube. The measurements of electrical power input were performed with a large bandwidth three-phase power analyzer. Data were collected in two experimental runs lasting 96 and 116 hours, respectively. An anomalous heat production was indicated in both experiments. The 116-hour experiment also included a calibration of the experimental set-up without the active charge present in the E-Cat HT. In this case, no extra heat was generated beyond the expected heat from the electric input. Computed volumetric and gravimetric energy densities were found to be far above those of any known chemical source. Even by the most conservative assumptions as to the errors in the measurements, the result is still one order of magnitude greater than conventional energy sources.
The Mythos of Model Interpretability ( Abstract ↓
Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability, and offer myriad notions of what attributes render models interpretable. Despite this ambiguity, many papers proclaim interpretability axiomatically, absent further explanation. In this paper, we seek to refine the discourse on interpretability. First, we examine the motivations underlying interest in interpretability, finding them to be diverse and occasionally discordant. Then, we address model properties and techniques thought to confer interpretability, identifying transparency to humans and post-hoc explanations as competing notions. Throughout, we discuss the feasibility and desirability of different notions, and question the oft-made assertions that linear models are interpretable and that deep neural networks are not.
The "Chaotic Ball" model,local realism and the Bell test loopholes ( Abstract ↓
It has long been known that the "detection loophole", present when detector efficiencies are below a critical figure, could open the way for alternative "local realist" explanations for the violation of Bell tests. It has in recent years become common to assume the loophole can be ignored, regardless of which version of the Bell test is employed. A simple model is presented that illustrates that this may not be justified. Two of the versions -- the standard test of form -2 <= S <= 2 and the currently-popular "visibility" test -- are at grave risk of bias. Statements implying that experimental evidence "refutes local realism" or shows that the quantum world really is "weird" should be reviewed. The detection loophole is on its own unlikely to account for more than one or two test violations, but when taken in conjunction with other loopholes (briefly discussed) it is seen that the experiments refute only a narrow class of "local hidden variable" models, applicable to idealised situations, not to the real world. The full class of local realist models provides straightforward explanations not only for the publicised Bell-test violations but also for some lesser-known "anomalies".
Universal adversarial perturbations ( Abstract ↓
Given a state-of-the-art deep neural network classifier, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassified with high probability. We propose a systematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi-imperceptible to the human eye. We further empirically analyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals important geometric correlations among the high-dimensional decision boundary of classifiers. It further outlines potential security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classifier on most natural images.
An Evolutionary Theory for the Variability Hypothesis ( Abstract ↓
An elementary biostatistical theory based on a selectivity-variability principle is proposed to address a question raised by Charles Darwin, namely, how one sex of a sexually dimorphic species might tend to evolve with greater variability than the other sex. Briefly, the theory says that if one sex is relatively selective then from one generation to the next, more variable subpopulations of the opposite sex will generally tend to prevail over those with lesser variability. Moreover, the perhaps less intuitive converse also holds: if a sex is relatively non-selective, then less variable subpopulations of the opposite sex will prevail over those with greater variability. This theory requires certain regularity conditions on the distributions, but makes no assumptions about differences in means between the sexes, nor does it presume that one sex is selective and the other non-selective. Two mathematical models of the selectivity-variability principle are presented: a discrete-time one-step probabilistic model of short-term behavior with an example using normally distributed perceived fitness values; and a continuous-time deterministic model for the long-term asymptotic behavior of the expected sizes of the subpopulations with an example using exponentially distributed fitness levels.
Measurement of the neutrino velocity with the OPERA detector in the CNGS beam ( Abstract ↓
The OPERA neutrino experiment at the underground Gran Sasso Laboratory has measured the velocity of neutrinos from the CERN CNGS beam over a baseline of about 730 km. The measurement is based on data taken by OPERA in the years 2009, 2010 and 2011. Dedicated upgrades of the CNGS timing system and of the OPERA detector, as well as a high precision geodesy campaign for the measurement of the neutrino baseline, allowed reaching comparable systematic and statistical accuracies. An arrival time of CNGS muon neutrinos with respect to the one computed assuming the speed of light in vacuum of (6.5 +/- 7.4(stat.)((+8.3)(-8.0)sys.))ns was measured corresponding to a relative difference of the muon neutrino velocity with respect to the speed of light (v-c)/c =(2.7 +/-3.1(stat.)((+3.4)(-3.3)(sys.))x10^(-6). The above result, obtained by comparing the time distributions of neutrino interactions and of protons hitting the CNGS target in 10.5 microseconds long extractions, was confirmed by a test performed at the end of 2011 using a short bunch beam allowing to measure the neutrino time of flight at the single interaction level.
European Union regulations on algorithmic decision-making and a "right to explanation" ( Abstract ↓
We summarize the potential impact that the European Union's new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on user-level predictors) which "significantly affect" users. The law will also effectively create a "right to explanation," whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.
The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities ( Abstract ↓
Biological evolution provides a creative fount of complex and subtle adaptations, often surprising the scientists who discover them. However, because evolution is an algorithmic process that transcends the substrate in which it occurs, evolution's creativity is not limited to nature. Indeed, many researchers in the field of digital evolution have observed their evolving algorithms and organisms subverting their intentions, exposing unrecognized bugs in their code, producing unexpected adaptations, or exhibiting outcomes uncannily convergent with ones in nature. Such stories routinely reveal creativity by evolution in these digital worlds, but they rarely fit into the standard scientific narrative. Instead they are often treated as mere obstacles to be overcome, rather than results that warrant study in their own right. The stories themselves are traded among researchers through oral tradition, but that mode of information transmission is inefficient and prone to error and outright loss. Moreover, the fact that these stories tend to be shared only among practitioners means that many natural scientists do not realize how interesting and lifelike digital organisms are and how natural their evolution can be. To our knowledge, no collection of such anecdotes has been published before. This paper is the crowd-sourced product of researchers in the fields of artificial life and evolutionary computation who have provided first-hand accounts of such cases. It thus serves as a written, fact-checked collection of scientifically important and even entertaining stories. In doing so we also present here substantial evidence that the existence and importance of evolutionary surprises extends beyond the natural world, and may indeed be a universal property of all complex evolving systems.
Deep Residual Learning for Image Recognition ( Abstract ↓
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
End to End Learning for Self-Driving Cars ( Abstract ↓
We trained a convolutional neural network (CNN) to map raw pixels from a single front-facing camera directly to steering commands. This end-to-end approach proved surprisingly powerful. With minimum training data from humans the system learns to drive in traffic on local roads with or without lane markings and on highways. It also operates in areas with unclear visual guidance such as in parking lots and on unpaved roads. The system automatically learns internal representations of the necessary processing steps such as detecting useful road features with only the human steering angle as the training signal. We never explicitly trained it to detect, for example, the outline of roads. Compared to explicit decomposition of the problem, such as lane marking detection, path planning, and control, our end-to-end system optimizes all processing steps simultaneously. We argue that this will eventually lead to better performance and smaller systems. Better performance will result because the internal components self-optimize to maximize overall system performance, instead of optimizing human-selected intermediate criteria, e.g., lane detection. Such criteria understandably are selected for ease of human interpretation which doesn't automatically guarantee maximum system performance. Smaller networks are possible because the system learns to solve the problem with the minimal number of processing steps. We used an NVIDIA DevBox and Torch 7 for training and an NVIDIA DRIVE(TM) PX self-driving car computer also running Torch 7 for determining where to drive. The system operates at 30 frames per second (FPS).
Antisocial Behavior in Online Discussion Communities ( Abstract ↓
User contributions in the form of posts, comments, and votes are essential to the success of online communities. However, allowing user participation also invites undesirable behavior such as trolling. In this paper, we characterize antisocial behavior in three large online discussion communities by analyzing users who were banned from these communities. We find that such users tend to concentrate their efforts in a small number of threads, are more likely to post irrelevantly, and are more successful at garnering responses from other users. Studying the evolution of these users from the moment they join a community up to when they get banned, we find that not only do they write worse than other users over time, but they also become increasingly less tolerated by the community. Further, we discover that antisocial behavior is exacerbated when community feedback is overly harsh. Our analysis also reveals distinct groups of users with different levels of antisocial behavior that can change over time. We use these insights to identify antisocial users early on, a task of high practical importance to community maintainers.
Engineering Record And Replay For Deployability: Extended Technical Report ( Abstract ↓
The ability to record and replay program executions with low overhead enables many applications, such as reverse-execution debugging, debugging of hard-to-reproduce test failures, and "black box" forensic analysis of failures in deployed systems. Existing record-and-replay approaches limit deployability by recording an entire virtual machine (heavyweight), modifying the OS kernel (adding deployment and maintenance costs), requiring pervasive code instrumentation (imposing significant performance and complexity overhead), or modifying compilers and runtime systems (limiting generality). We investigated whether it is possible to build a practical record-and-replay system avoiding all these issues. The answer turns out to be yes - if the CPU and operating system meet certain non-obvious constraints. Fortunately modern Intel CPUs, Linux kernels and user-space frameworks do meet these constraints, although this has only become true recently. With some novel optimizations, our system 'rr' records and replays real-world low-parallelism workloads with low overhead, with an entirely user-space implementation, using stock hardware, compilers, runtimes and operating systems. "rr" forms the basis of an open-source reverse-execution debugger seeing significant use in practice. We present the design and implementation of 'rr', describe its performance on a variety of workloads, and identify constraints on hardware and operating system design required to support our approach.
Seven Sketches in Compositionality: An Invitation to Applied Category Theory ( Abstract ↓
This book is an invitation to discover advanced topics in category theory through concrete, real-world examples. It aims to give a tour: a gentle, quick introduction to guide later exploration. The tour takes place over seven sketches, each pairing an evocative application, such as databases, electric circuits, or dynamical systems, with the exploration of a categorical structure, such as adjoint functors, enriched categories, or toposes. No prior knowledge of category theory is assumed. A feedback form for typos, comments, questions, and suggestions is available here:
A quantum-inspired classical algorithm for recommendation systems ( Abstract ↓
We give a classical analogue to Kerenidis and Prakash's quantum recommendation system, previously believed to be one of the strongest candidates for provably exponential speedups in quantum machine learning. Our main result is an algorithm that, given an $m \times n$ matrix in a data structure supporting certain $\ell^2$-norm sampling operations, outputs an $\ell^2$-norm sample from a rank-$k$ approximation of that matrix in time $O(\text{poly}(k)\log(mn))$, only polynomially slower than the quantum algorithm. As a consequence, Kerenidis and Prakash's algorithm does not in fact give an exponential speedup over classical algorithms. Further, under strong input assumptions, the classical recommendation system resulting from our algorithm produces recommendations exponentially faster than previous classical systems, which run in time linear in $m$ and $n$. The main insight of this work is the use of simple routines to manipulate $\ell^2$-norm sampling distributions, which play the role of quantum superpositions in the classical setting. This correspondence indicates a potentially fruitful framework for formally comparing quantum machine learning algorithms to classical machine learning algorithms.
Visualizing and Understanding Convolutional Networks ( Abstract ↓
Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
NP-complete Problems and Physical Reality ( Abstract ↓
Can NP-complete problems be solved efficiently in the physical universe? I survey proposals including soap bubbles, protein folding, quantum computing, quantum advice, quantum adiabatic algorithms, quantum-mechanical nonlinearities, hidden variables, relativistic time dilation, analog computing, Malament-Hogarth spacetimes, quantum gravity, closed timelike curves, and "anthropic computing." The section on soap bubbles even includes some "experimental" results. While I do not believe that any of the proposals will let us solve NP-complete problems efficiently, I argue that by studying them, we can learn something not only about computation but also about physics.
A Study of MAC Address Randomization in Mobile Devices and When it Fails ( Abstract ↓
MAC address randomization is a privacy technique whereby mobile devices rotate through random hardware addresses in order to prevent observers from singling out their traffic or physical location from other nearby devices. Adoption of this technology, however, has been sporadic and varied across device manufacturers. In this paper, we present the first wide-scale study of MAC address randomization in the wild, including a detailed breakdown of different randomization techniques by operating system, manufacturer, and model of device. We then identify multiple flaws in these implementations which can be exploited to defeat randomization as performed by existing devices. First, we show that devices commonly make improper use of randomization by sending wireless frames with the true, global address when they should be using a randomized address. We move on to extend the passive identification techniques of Vanhoef et al. to effectively defeat randomization in ~96% of Android phones. Finally, we show a method that can be used to track 100% of devices using randomization, regardless of manufacturer, by exploiting a previously unknown flaw in the way existing wireless chipsets handle low-level control frames.
Learning to Execute ( Abstract ↓
Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks' performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9-digit numbers with 99% accuracy.
Concrete Problems in AI Safety ( Abstract ↓
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer ( Abstract ↓
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
Two notes on notation ( Abstract ↓
The author advocates two specific mathematical notations from his popular course and joint textbook, "Concrete Mathematics". The first of these, extending an idea of Iverson, is the notation "[P]" for the function which is 1 when the Boolean condition P is true and 0 otherwise. This notation can encourage and clarify the use of characteristic functions and Kronecker deltas in sums and integrals. The second notation puts Stirling numbers on the same footing as binomial coefficients. Since binomial coefficients are written on two lines in parentheses and read "n choose k", Stirling numbers of the first kind should be written on two lines in brackets and read "n cycle k", while Stirling numbers of the second kind should be written in braces and read "n subset k". (I might say "n partition k".) The written form was first suggested by Imanuel Marx. The virtues of this notation are that Stirling partition numbers frequently appear in combinatorics, and that it more clearly presents functional relations similar to those satisfied by binomial coefficients.
Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding ( Abstract ↓
The modern data compression is mainly based on two approaches to entropy coding: Huffman (HC) and arithmetic/range coding (AC). The former is much faster, but approximates probabilities with powers of 2, usually leading to relatively low compression rates. The latter uses nearly exact probabilities - easily approaching theoretical compression rate limit (Shannon entropy), but at cost of much larger computational cost. Asymmetric numeral systems (ANS) is a new approach to accurate entropy coding, which allows to end this trade-off between speed and rate: the recent implementation [1] provides about $50\%$ faster decoding than HC for 256 size alphabet, with compression rate similar to provided by AC. This advantage is due to being simpler than AC: using single natural number as the state, instead of two to represent a range. Beside simplifying renormalization, it allows to put the entire behavior for given probability distribution into a relatively small table: defining entropy coding automaton. The memory cost of such table for 256 size alphabet is a few kilobytes. There is a large freedom while choosing a specific table - using pseudorandom number generator initialized with cryptographic key for this purpose allows to simultaneously encrypt the data. This article also introduces and discusses many other variants of this new entropy coding approach, which can provide direct alternatives for standard AC, for large alphabet range coding, or for approximated quasi arithmetic coding.

Last Update 2020-08-09 00:11

Not affiliated with Hacker News or YCombinator