]> Pileus Git - ~andy/linux/log
~andy/linux
12 years agobootmem/sparsemem: remove limit constraint in alloc_bootmem_section
Nishanth Aravamudan [Wed, 21 Mar 2012 23:34:07 +0000 (16:34 -0700)]
bootmem/sparsemem: remove limit constraint in alloc_bootmem_section

While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
Overcommit) on powerpc, we tripped the following:

  kernel BUG at mm/bootmem.c:483!
  cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
      pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
      lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
      sp: c000000000c03bc0
     msr: 8000000000021032
    current = 0xc000000000b0cce0
    paca    = 0xc000000001d80000
      pid   = 0, comm = swapper
  kernel BUG at mm/bootmem.c:483!
  enter ? for help
  [c000000000c03c80c000000000a64bcc
  .sparse_early_usemaps_alloc_node+0x84/0x29c
  [c000000000c03d50c000000000a64f10 .sparse_init+0x12c/0x28c
  [c000000000c03e20c000000000a474f4 .setup_arch+0x20c/0x294
  [c000000000c03ee0c000000000a4079c .start_kernel+0xb4/0x460
  [c000000000c03f90c000000000009670 .start_here_common+0x1c/0x2c

This is

        BUG_ON(limit && goal + size > limit);

and after some debugging, it seems that

goal = 0x7ffff000000
limit = 0x80000000000

and sparse_early_usemaps_alloc_node ->
sparse_early_usemaps_alloc_pgdat_section calls

return alloc_bootmem_section(usemap_size() * count, section_nr);

This is on a system with 8TB available via the AMS pool, and as a quirk
of AMS in firmware, all of that memory shows up in node 0.  So, we end
up with an allocation that will fail the goal/limit constraints.

In theory, we could "fall-back" to alloc_bootmem_node() in
sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
defined, we'll BUG_ON() instead.  A simple solution appears to be to
unconditionally remove the limit condition in alloc_bootmem_section,
meaning allocations are allowed to cross section boundaries (necessary
for systems of this size).

Johannes Weiner pointed out that if alloc_bootmem_section() no longer
guarantees section-locality, we need check_usemap_section_nr() to print
possible cross-dependencies between node descriptors and the usemaps
allocated through it.  That makes the two loops in
sparse_early_usemaps_alloc_node() identical, so re-factor the code a
bit.

[akpm@linux-foundation.org: code simplification]
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.3.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: drain percpu lru add/rotate page-vectors on cpu hot-unplug
Konstantin Khlebnikov [Wed, 21 Mar 2012 23:34:06 +0000 (16:34 -0700)]
mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug

This cpu hotplug hook was accidentally removed in commit 00a62ce91e55
("mm: fix Committed_AS underflow on large NR_CPUS environment")

The visible effect of this accident: some pages are borrowed in per-cpu
page-vectors.  Truncate can deal with it, but these pages cannot be
reused while this cpu is offline.  So this is like a temporary memory
leak.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: fix move/migrate_pages() race on task struct
Christoph Lameter [Wed, 21 Mar 2012 23:34:06 +0000 (16:34 -0700)]
mm: fix move/migrate_pages() race on task struct

Migration functions perform the rcu_read_unlock too early.  As a result
the task pointed to may change from under us.  This can result in an oops,
as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.

The following patch extend the period of the rcu_read_lock until after the
permissions checks are done.  We also take a refcount so that the task
reference is stable when calling security check functions and performing
cpuset node validation (which takes a mutex).

The refcount is dropped before actual page migration occurs so there is no
change to the refcounts held during page migration.

Also move the determination of the mm of the task struct to immediately
before the do_migrate*() calls so that it is clear that we switch from
handling the task during permission checks to the mm for the actual
migration.  Since the determination is only done once and we then no
longer use the task_struct we can be sure that we operate on a specific
address space that will not change from under us.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agothp: allow a hwpoisoned head page to be put back to LRU
Dean Nelson [Wed, 21 Mar 2012 23:34:05 +0000 (16:34 -0700)]
thp: allow a hwpoisoned head page to be put back to LRU

Andrea Arcangeli pointed out to me that a check in __memory_failure()
which was intended to prevent THP tail pages from being checked for the
absence of the PG_lru flag (something that is always the case), was also
preventing THP head pages from being checked.

A THP head page could actually benefit from the call to shake_page() by
ending up being put back to a LRU, provided it had been waiting in a
pagevec array.

Andrea suggested that the "!PageTransCompound(p)" in the if-statement
should be replaced by a "!PageTransTail(p)", thus allowing THP head pages
to be checked and possibly shaken.

Signed-off-by: Dean Nelson <dnelson@redhat.com>
Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agotmpfs: security xattr setting on inode creation
Jarkko Sakkinen [Wed, 21 Mar 2012 23:34:05 +0000 (16:34 -0700)]
tmpfs: security xattr setting on inode creation

Adds to generic xattr support introduced in Linux 3.0 by implementing
initxattrs callback.  This enables consulting of security attributes from
LSM and EVM when inode is created.

[hughd@google.com: moved under CONFIG_TMPFS_XATTR, with memcpy in shmem_xattr_alloc]
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm, oom: force oom kill on sysrq+f
David Rientjes [Wed, 21 Mar 2012 23:34:04 +0000 (16:34 -0700)]
mm, oom: force oom kill on sysrq+f

The oom killer chooses not to kill a thread if:

 - an eligible thread has already been oom killed and has yet to exit,
   and

 - an eligible thread is exiting but has yet to free all its memory and
   is not the thread attempting to currently allocate memory.

SysRq+F manually invokes the global oom killer to kill a memory-hogging
task.  This is normally done as a last resort to free memory when no
progress is being made or to test the oom killer itself.

For both uses, we always want to kill a thread and never defer.  This
patch causes SysRq+F to always kill an eligible thread and can be used to
force a kill even if another oom killed thread has failed to exit.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agoprocfs: mark thread stack correctly in proc/<pid>/maps
Siddhesh Poyarekar [Wed, 21 Mar 2012 23:34:04 +0000 (16:34 -0700)]
procfs: mark thread stack correctly in proc/<pid>/maps

Stack for a new thread is mapped by userspace code and passed via
sys_clone.  This memory is currently seen as anonymous in
/proc/<pid>/maps, which makes it difficult to ascertain which mappings
are being used for thread stacks.  This patch uses the individual task
stack pointers to determine which vmas are actually thread stacks.

For a multithreaded program like the following:

#include <pthread.h>

void *thread_main(void *foo)
{
while(1);
}

int main()
{
pthread_t t;
pthread_create(&t, NULL, thread_main, NULL);
pthread_join(t, NULL);
}

proc/PID/maps looks like the following:

    00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
    7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
    7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
    7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
    7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
    7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
    7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
    7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
that is not always a reliable way to find out which vma is a thread
stack.  Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
content.

With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
as the task would see it' and hence, only the vma that that task uses as
stack is marked as [stack].  All other 'stack' vmas are marked as
anonymous memory.  /proc/PID/maps acts as a thread group level view,
where all thread stack vmas are marked as [stack:TID] where TID is the
process ID of the task that uses that vma as stack, while the process
stack is marked as [stack].

So /proc/PID/maps will look like this:

    00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
    7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
    7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
    7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
    7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
    7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
    7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
    7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Thus marking all vmas that are used as stacks by the threads in the
thread group along with the process stack.  The task level maps will
however like this:

    00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
    019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
    7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
    7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
    7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
    7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
    7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
    7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
    7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
    7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
    7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
    7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

where only the vma that is being used as a stack by *that* task is
marked as [stack].

Analogous changes have been made to /proc/PID/smaps,
/proc/PID/numa_maps, /proc/PID/task/TID/smaps and
/proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
numa_maps:

    [siddhesh@localhost ~ ]$ pgrep a.out
    1441
    [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
    [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
    7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
    [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
    7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
    [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
    7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
    7fff6273a000 default stack anon=3 dirty=3 N0=3
    [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
    7f8a44492000 default stack anon=2 dirty=2 N0=2
    [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
    7fff6273a000 default stack anon=3 dirty=3 N0=3

[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: hugetlb: bail out unmapping after serving reference page
Hillf Danton [Wed, 21 Mar 2012 23:34:03 +0000 (16:34 -0700)]
mm: hugetlb: bail out unmapping after serving reference page

When unmapping a given VM range, we could bail out if a reference page is
supplied and is unmapped, which is a minor optimization.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agothp: documentation: 'transparent_hugepage=' can also be specified on cmdline
Jiri Kosina [Wed, 21 Mar 2012 23:34:02 +0000 (16:34 -0700)]
thp: documentation: 'transparent_hugepage=' can also be specified on cmdline

The behavior of THP can either be toggled through sysfs in runtime or
using a kernel cmdline parameter 'transparent_hugepage='.  Document the
latter in kernel-parameters.txt

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agovmscan: handle isolated pages with lru lock released
Hillf Danton [Wed, 21 Mar 2012 23:34:02 +0000 (16:34 -0700)]
vmscan: handle isolated pages with lru lock released

When shrinking inactive lru list, isolated pages are queued on locally
private list, so the lock-hold time could be reduced if pages are counted
without lock protection.

To achieve that, firstly updating reclaim stat is delayed until the
putback stage, after reacquiring the lru lock.

Secondly, operations related to vm and zone stats are now proteced with
preemption disabled as they are per-cpu operations.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agormap: remove __anon_vma_link() declaration
Xiao Guangrong [Wed, 21 Mar 2012 23:34:01 +0000 (16:34 -0700)]
rmap: remove __anon_vma_link() declaration

This declaration is not used anymore, remove it.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agormap: anon_vma_prepare: Reduce code duplication by calling anon_vma_chain_link
Kautuk Consul [Wed, 21 Mar 2012 23:34:01 +0000 (16:34 -0700)]
rmap: anon_vma_prepare: Reduce code duplication by calling anon_vma_chain_link

Reduce code duplication by calling anon_vma_chain_link() from
anon_vma_prepare().

Also move anon_vmal_chain_link() to a more suitable location in the file.

Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: KAMEZWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: hugetlb: defer freeing pages when gathering surplus pages
Hillf Danton [Wed, 21 Mar 2012 23:34:00 +0000 (16:34 -0700)]
mm: hugetlb: defer freeing pages when gathering surplus pages

When gathering surplus pages, the number of needed pages is recomputed
after reacquiring hugetlb lock to catch changes in resv_huge_pages and
free_huge_pages.  Plus it is recomputed with the number of newly allocated
pages involved.

Thus freeing pages can be deferred a bit to see if the final page request
is satisfied, though pages could be allocated less than needed.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem
Mel Gorman [Wed, 21 Mar 2012 23:34:00 +0000 (16:34 -0700)]
mm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem

Stuart Foster reported on bugzilla that copying large amounts of data
from NTFS caused an OOM kill on 32-bit X86 with 16G of memory.  Andrew
Morton correctly identified that the problem was NTFS was using 512
blocks meaning each page had 8 buffer_heads in low memory pinning it.

In the past, direct reclaim used to scan highmem even if the allocating
process did not specify __GFP_HIGHMEM but not any more.  kswapd no longer
will reclaim from zones that are above the high watermark.  The intention
in both cases was to minimise unnecessary reclaim.  The downside is on
machines with large amounts of highmem that lowmem can be fully consumed
by buffer_heads with nothing trying to free them.

The following patch is based on a suggestion by Andrew Morton to extend
the buffer_heads_over_limit case to force kswapd and direct reclaim to
scan the highmem zone regardless of the allocation request or watermarks.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=42578

[hughd@google.com: move buffer_heads_over_limit check up]
[akpm@linux-foundation.org: buffer_heads_over_limit is unlikely]
Reported-by: Stuart Foster <smf.linux@ntlworld.com>
Tested-by: Stuart Foster <smf.linux@ntlworld.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: replace PAGE_MIGRATION with IS_ENABLED(CONFIG_MIGRATION)
Konstantin Khlebnikov [Wed, 21 Mar 2012 23:33:59 +0000 (16:33 -0700)]
mm: replace PAGE_MIGRATION with IS_ENABLED(CONFIG_MIGRATION)

Since commit 2a11c8ea20bf ("kconfig: Introduce IS_ENABLED(),
IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method
for checking config options in C expressions.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agopagemap: introduce data structure for pagemap entry
Naoya Horiguchi [Wed, 21 Mar 2012 23:33:59 +0000 (16:33 -0700)]
pagemap: introduce data structure for pagemap entry

Currently a local variable of pagemap entry in pagemap_pte_range() is
named pfn and typed with u64, but it's not correct (pfn should be unsigned
long.)

This patch introduces special type for pagemap entries and replaces code
with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agopagemap: document KPF_THP and make page-types aware of it
Naoya Horiguchi [Wed, 21 Mar 2012 23:33:58 +0000 (16:33 -0700)]
pagemap: document KPF_THP and make page-types aware of it

page-types, which is a common user of pagemap, gets aware of thp with this
patch.  This helps system admins and kernel hackers know about how thp
works.  Here is a sample output of page-types over a thp:

  $ page-types -p <pid> --raw --list

  voffset offset  len     flags
  ...
  7f9d40200       3f8400  1       ___U_lA____Ma_bH______t____________
  7f9d40201       3f8401  1ff     ________________T_____t____________

               flags      page-count       MB  symbolic-flags                     long-symbolic-flags
  0x0000000000410000             511        1  ________________T_____t____________        compound_tail,thp
  0x000000000040d868               1        0  ___U_lA____Ma_bH______t____________        uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agopagemap: export KPF_THP
Naoya Horiguchi [Wed, 21 Mar 2012 23:33:58 +0000 (16:33 -0700)]
pagemap: export KPF_THP

This flag shows that a given page is a subpage of a transparent hugepage.
It helps us debug and test the kernel by showing physical address of thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agothp: optimize away unnecessary page table locking
Naoya Horiguchi [Wed, 21 Mar 2012 23:33:57 +0000 (16:33 -0700)]
thp: optimize away unnecessary page table locking

Currently when we check if we can handle thp as it is or we need to split
it into regular sized pages, we hold page table lock prior to check
whether a given pmd is mapping thp or not.  Because of this, when it's not
"huge pmd" we suffer from unnecessary lock/unlock overhead.  To remove it,
this patch introduces a optimized check function and replace several
similar logics with it.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agopagemap: avoid splitting thp when reading /proc/pid/pagemap
Naoya Horiguchi [Wed, 21 Mar 2012 23:33:57 +0000 (16:33 -0700)]
pagemap: avoid splitting thp when reading /proc/pid/pagemap

Thp split is not necessary if we explicitly check whether pmds are mapping
thps or not.  This patch introduces this check and adds code to generate
pagemap entries for pmds mapping thps, which results in less performance
impact of pagemap on thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: search from free_area_cache for the bigger size
Xiao Guangrong [Wed, 21 Mar 2012 23:33:56 +0000 (16:33 -0700)]
mm: search from free_area_cache for the bigger size

If the required size is bigger than cached_hole_size it is better to
search from free_area_cache - it is easier to get a free region,
specifically for the 64 bit process whose address space is large enough

Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: do not reset cached_hole_size when vma is unmapped
Xiao Guangrong [Wed, 21 Mar 2012 23:33:56 +0000 (16:33 -0700)]
mm: do not reset cached_hole_size when vma is unmapped

In the current code, cached_hole_size is set to the maximum value if the
unmapped vma is less that free_area_cache so the next search will search
from the base address.

Actually, we can keep cached_hole_size so that if the next required size
is more than cached_hole_size, it can search from free_area_cache.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agohugetlb: try to search again if it is really needed
Xiao Guangrong [Wed, 21 Mar 2012 23:33:55 +0000 (16:33 -0700)]
hugetlb: try to search again if it is really needed

Search again only if some holes may be skipped in the first pass.

[akpm@linux-foundation.org: clean up crazy compound definition]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agohugetlbfs: fix hugetlb_get_unmapped_area()
Xiao Guangrong [Wed, 21 Mar 2012 23:33:54 +0000 (16:33 -0700)]
hugetlbfs: fix hugetlb_get_unmapped_area()

Use/update cached_hole_size and free_area_cache properly to speedup
finding of a free region.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: compaction: make compact_control order signed
Dan Carpenter [Wed, 21 Mar 2012 23:33:54 +0000 (16:33 -0700)]
mm: compaction: make compact_control order signed

"order" is -1 when compacting via /proc/sys/vm/compact_memory.  Making
it unsigned causes a bug in __compact_pgdat() when we test:

if (cc->order < 0 || !compaction_deferred(zone, cc->order))
compact_zone(zone, cc);

[akpm@linux-foundation.org: make __compact_pgdat()'s comparison match other code sites]
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agocompact_pgdat: workaround lockdep warning in kswapd
Hugh Dickins [Wed, 21 Mar 2012 23:33:53 +0000 (16:33 -0700)]
compact_pgdat: workaround lockdep warning in kswapd

I get this lockdep warning from swapping load on linux-next, due to
"vmscan: kswapd carefully call compaction".

=================================
[ INFO: inconsistent lock state ]
3.3.0-rc2-next-20120201 #5 Not tainted
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
{RECLAIM_FS-ON-W} state was registered at:
  [<ffffffff81099b75>] mark_held_locks+0xd7/0x103
  [<ffffffff8109a13c>] lockdep_trace_alloc+0x85/0x9e
  [<ffffffff810f6bdc>] __kmalloc+0x6c/0x14b
  [<ffffffff810d57fd>] pcpu_mem_zalloc+0x59/0x62
  [<ffffffff810d5d16>] pcpu_extend_area_map+0x26/0xb1
  [<ffffffff810d679f>] pcpu_alloc+0x182/0x325
  [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
  [<ffffffff8142ebfd>] snmp_mib_init+0x1e/0x2e
  [<ffffffff8185cd8d>] ipv4_mib_init_net+0x7a/0x184
  [<ffffffff813dc963>] ops_init.clone.0+0x6b/0x73
  [<ffffffff813dc9cc>] register_pernet_operations+0x61/0xa0
  [<ffffffff813dca8e>] register_pernet_subsys+0x29/0x42
  [<ffffffff8185d044>] inet_init+0x1ad/0x252
  [<ffffffff810002e3>] do_one_initcall+0x7a/0x12f
  [<ffffffff81832bc5>] kernel_init+0x9d/0x11e
  [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
irq event stamp: 656613
hardirqs last  enabled at (656613): [<ffffffff814e0ddc>] __mutex_unlock_slowpath+0x104/0x128
hardirqs last disabled at (656612): [<ffffffff814e0d34>] __mutex_unlock_slowpath+0x5c/0x128
softirqs last  enabled at (655568): [<ffffffff8105b4a5>] __do_softirq+0x120/0x136
softirqs last disabled at (654757): [<ffffffff814e52dc>] call_softirq+0x1c/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(pcpu_alloc_mutex);
  <Interrupt>
    lock(pcpu_alloc_mutex);

 *** DEADLOCK ***

no locks held by kswapd0/28.

stack backtrace:
Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5
Call Trace:
 [<ffffffff810981f4>] print_usage_bug+0x1bf/0x1d0
 [<ffffffff81096c3e>] ? print_irq_inversion_bug+0x1d9/0x1d9
 [<ffffffff810982c0>] mark_lock_irq+0xbb/0x22e
 [<ffffffff810c5399>] ? free_hot_cold_page+0x13d/0x14f
 [<ffffffff81098684>] mark_lock+0x251/0x331
 [<ffffffff81098893>] mark_irqflags+0x12f/0x141
 [<ffffffff81098e32>] __lock_acquire+0x58d/0x753
 [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
 [<ffffffff81099433>] lock_acquire+0x54/0x6a
 [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
 [<ffffffff8107a5b8>] ? add_preempt_count+0xa9/0xae
 [<ffffffff814e0a21>] mutex_lock_nested+0x5e/0x315
 [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
 [<ffffffff81098f81>] ? __lock_acquire+0x6dc/0x753
 [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
 [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
 [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
 [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
 [<ffffffff8106c35e>] schedule_on_each_cpu+0x23/0x110
 [<ffffffff810c9fcb>] lru_add_drain_all+0x10/0x12
 [<ffffffff810f126f>] __compact_pgdat+0x20/0x182
 [<ffffffff810f15c2>] compact_pgdat+0x27/0x29
 [<ffffffff810c306b>] ? zone_watermark_ok+0x1a/0x1c
 [<ffffffff810cdf6f>] balance_pgdat+0x732/0x751
 [<ffffffff810ce0ed>] kswapd+0x15f/0x178
 [<ffffffff810cdf8e>] ? balance_pgdat+0x751/0x751
 [<ffffffff8106fd11>] kthread+0x84/0x8c
 [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
 [<ffffffff810787ed>] ? finish_task_switch+0x85/0xea
 [<ffffffff814e3861>] ? retint_restore_args+0xe/0xe
 [<ffffffff8106fc8d>] ? __init_kthread_worker+0x56/0x56
 [<ffffffff814e51e0>] ? gs_change+0xb/0xb

The RECLAIM_FS notations indicate that it's doing the GFP_FS checking that
Nick hacked into lockdep a while back: I think we're intended to read that
"<Interrupt>" in the DEADLOCK scenario as "<Direct reclaim>".

I'm hazy, I have not reached any conclusion as to whether it's right to
complain or not; but I believe it's uneasy about kswapd now doing the
mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails.  Nor have
I reached any conclusion as to whether it's important for kswapd to do
that draining or not.

But so as not to get blocked on this, with lockdep disabled from giving
further reports, here's a patch which removes the lru_add_drain_all() from
kswapd's callpath (and calls it only once from compact_nodes(), instead of
once per node).

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agovmscan: only defer compaction for failed order and higher
Rik van Riel [Wed, 21 Mar 2012 23:33:52 +0000 (16:33 -0700)]
vmscan: only defer compaction for failed order and higher

Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone.  Even if
we only need compaction for an order 2 allocation, eg.  for jumbo frames
networking.

The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agovmscan: kswapd carefully call compaction
Rik van Riel [Wed, 21 Mar 2012 23:33:52 +0000 (16:33 -0700)]
vmscan: kswapd carefully call compaction

With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous
free pages, even when it is woken for a higher order request.

This could be bad for eg.  jumbo frame network allocations, which are done
from interrupt context and cannot compact memory themselves.  Higher than
before allocation failure rates in the network receive path have been
observed in kernels with compaction enabled.

Teach kswapd to defragment the memory zones in a node, but only if
required and compaction is not deferred in a zone.

[akpm@linux-foundation.org: reduce scope of zones_need_compaction]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agovmscan: reclaim at order 0 when compaction is enabled
Rik van Riel [Wed, 21 Mar 2012 23:33:51 +0000 (16:33 -0700)]
vmscan: reclaim at order 0 when compaction is enabled

When built with CONFIG_COMPACTION, kswapd should not try to free
contiguous pages, because it is not trying hard enough to have a real
chance at being successful, but still disrupts the LRU enough to break
other things.

Do not do higher order page isolation unless we really are in lumpy
reclaim mode.

Stop reclaiming pages once we have enough free pages that compaction can
deal with things, and we hit the normal order 0 watermarks used by kswapd.

Also remove a line of code that increments balanced right before exiting
the function.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: make swapin readahead skip over holes
Rik van Riel [Wed, 21 Mar 2012 23:33:50 +0000 (16:33 -0700)]
mm: make swapin readahead skip over holes

Ever since abandoning the virtual scan of processes, for scalability
reasons, swap space has been a little more fragmented than before.  This
can lead to the situation where a large memory user is killed, swap space
ends up full of "holes" and swapin readahead is totally ineffective.

On my home system, after killing a leaky firefox it took over an hour to
page just under 2GB of memory back in, slowing the virtual machines down
to a crawl.

This patch makes swapin readahead simply skip over holes, instead of
stopping at them.  This allows the system to swap things back in at rates
of several MB/second, instead of a few hundred kB/second.

The checks done in valid_swaphandles are already done in
read_swap_cache_async as well, allowing us to remove a fair amount of
code.

[akpm@linux-foundation.org: fix it for page_cluster >= 32]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Adrian Drzewiecki <z@drze.net>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone()
Hillf Danton [Wed, 21 Mar 2012 23:33:50 +0000 (16:33 -0700)]
mm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone()

The value of nr_reclaimed is the number of pages reclaimed in the current
round of the loop, whereas nr_to_reclaim should be compared with the
number of pages reclaimed in all rounds.

In each round of the loop, reclaimed pages are cut off from the reclaim
goal, and the loop stops once the goal achieved.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: make get_mm_counter static-inline
Konstantin Khlebnikov [Wed, 21 Mar 2012 23:33:49 +0000 (16:33 -0700)]
mm: make get_mm_counter static-inline

Make get_mm_counter() always static inline, it is simple enough for that.
And remove unused set_mm_counter()

bloat-o-meter:

add/remove: 0/1 grow/shrink: 4/12 up/down: 99/-341 (-242)
function                                     old     new   delta
try_to_unmap_one                             886     952     +66
sys_remap_file_pages                        1214    1230     +16
dup_mm                                      1684    1700     +16
do_exit                                     2277    2278      +1
zap_page_range                               208     205      -3
unmap_region                                 304     296      -8
static.oom_kill_process                      554     546      -8
try_to_unmap_file                           1716    1700     -16
getrusage                                    925     909     -16
flush_old_exec                              1704    1688     -16
static.dump_header                           416     390     -26
acct_update_integrals                        218     187     -31
do_task_stat                                2986    2954     -32
get_mm_counter                                34       -     -34
xacct_add_tsk                                371     334     -37
task_statm                                   172     118     -54
task_mem                                     383     323     -60

try_to_unmap_one() grows because update_hiwater_rss() now completely inline.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm/vmscan.c: cleanup with s/reclaim_mode/isolate_mode/
Hillf Danton [Wed, 21 Mar 2012 23:33:48 +0000 (16:33 -0700)]
mm/vmscan.c: cleanup with s/reclaim_mode/isolate_mode/

With tons of reclaim_mode (defined as one field of struct scan_control)
already in the file, it is clearer to rename the local reclaim_mode when
setting up the isolation mode.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: add rss counters consistency check
Konstantin Khlebnikov [Wed, 21 Mar 2012 23:33:48 +0000 (16:33 -0700)]
mm: add rss counters consistency check

Warn about non-zero rss counters at final mmdrop.

This check will prevent reoccurences of bugs such as that fixed in "mm:
fix rss count leakage during migration".

I didn't hide this check under CONFIG_VM_DEBUG because it rather small and
rss counters cover whole page-table management, so this is a good
invariant.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm, oom: introduce independent oom killer ratelimit state
David Rientjes [Wed, 21 Mar 2012 23:33:47 +0000 (16:33 -0700)]
mm, oom: introduce independent oom killer ratelimit state

printk_ratelimit() uses the global ratelimit state for all printks.  The
oom killer should not be subjected to this state just because another
subsystem or driver may be flooding the kernel log.

This patch introduces printk ratelimiting specifically for the oom killer.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm, oom: do not emit oom killer warning if chosen thread is already exiting
David Rientjes [Wed, 21 Mar 2012 23:33:47 +0000 (16:33 -0700)]
mm, oom: do not emit oom killer warning if chosen thread is already exiting

If a thread is chosen for oom kill and is already PF_EXITING, then the oom
killer simply sets TIF_MEMDIE and returns.  This allows the thread to have
access to memory reserves so that it may quickly exit.  This logic is
preceeded with a comment saying there's no need to alarm the sysadmin.
This patch adds truth to that statement.

There's no need to emit any warning about the oom condition if the thread
is already exiting since it will not be killed.  In this condition, just
silently return the oom killer since its only giving access to memory
reserves and is otherwise a no-op.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm, oom: fold oom_kill_task() into oom_kill_process()
David Rientjes [Wed, 21 Mar 2012 23:33:46 +0000 (16:33 -0700)]
mm, oom: fold oom_kill_task() into oom_kill_process()

oom_kill_task() has a single caller, so fold it into its parent function,
oom_kill_process().  Slightly reduces the number of lines in the oom
killer.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm, oom: avoid looping when chosen thread detaches its mm
David Rientjes [Wed, 21 Mar 2012 23:33:46 +0000 (16:33 -0700)]
mm, oom: avoid looping when chosen thread detaches its mm

oom_kill_task() returns non-zero iff the chosen process does not have any
threads with an attached ->mm.

In such a case, it's better to just return to the page allocator and retry
the allocation because memory could have been freed in the interim and the
oom condition may no longer exist.  It's unnecessary to loop in the oom
killer and find another thread to kill.

This allows both oom_kill_task() and oom_kill_process() to be converted to
void functions.  If the oom condition persists, the oom killer will be
recalled.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agosparc: use block_sigmask()
Matt Fleming [Wed, 21 Mar 2012 23:33:46 +0000 (16:33 -0700)]
sparc: use block_sigmask()

Use the new helper function introduced in commit 5e6292c0f28f ("signal:
add block_sigmask() for adding sigmask to current->blocked") which
centralises the code for updating current->blocked after successfully
delivering a signal and reduces the amount of duplicate code across
architectures.  In the past some architectures got this code wrong, so
using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agoxtensa: use set_current_blocked() and block_sigmask()
Matt Fleming [Wed, 21 Mar 2012 23:33:45 +0000 (16:33 -0700)]
xtensa: use set_current_blocked() and block_sigmask()

As described in commit e6fa16ab9c1e ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures.  In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agoxtensa: don't mask signals if we fail to setup signal stack
Matt Fleming [Wed, 21 Mar 2012 23:33:45 +0000 (16:33 -0700)]
xtensa: don't mask signals if we fail to setup signal stack

setup_frame() needs to return an indication of whether it succeeded or
failed in setting up the signal stack frame.  If setup_frame() fails then
we must not modify current->blocked.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agoxtensa: no need to reset handler if SA_ONESHOT
Matt Fleming [Wed, 21 Mar 2012 23:33:44 +0000 (16:33 -0700)]
xtensa: no need to reset handler if SA_ONESHOT

get_signal_to_deliver() already resets the signal handler if SA_ONESHOT
is set in ka->sa.sa_flags, there's no need to do it again in
handle_signal().

Furthermore, because we were modifying ka->sa.sa_handler (which is a
copy of sighand->action[]) instead of sighand->action[] the original
code actually had no effect on signal delivery.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agoxtensa: don't reimplement force_sigsegv()
Matt Fleming [Wed, 21 Mar 2012 23:33:44 +0000 (16:33 -0700)]
xtensa: don't reimplement force_sigsegv()

Instead of open coding the sequence from force_sigsegv() just call it.
This also fixes a bug because we were modifying ka->sa.sa_handler (which
is a copy of sighand->action[]), whereas the intention of the code was to
modify sighand->action[] directly.

As the original code was working with a copy it had no effect on signal
delivery.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agoseq_file: fix mishandling of consecutive pread() invocations.
Earl Chew [Wed, 21 Mar 2012 23:33:43 +0000 (16:33 -0700)]
seq_file: fix mishandling of consecutive pread() invocations.

The following program illustrates the problem:

    char buf[8192];

    int fd = open("/proc/self/maps", O_RDONLY);

    n = pread(fd, buf, sizeof(buf), 0);
    printf("%d\n", n);

    /* lseek(fd, 0, SEEK_CUR); */ /* Uncomment to work around */

    n = pread(fd, buf, sizeof(buf), 0);
    printf("%d\n", n);

The second printf() prints zero, but uncommenting the lseek() corrects its
behaviour.

To fix, make seq_read() mirror seq_lseek() when processing changes in
*ppos.  Restore m->version first, then if required traverse and update
read_pos on success.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=11856

Signed-off-by: Earl Chew <echew@ixiacom.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agodrivers/idle/intel_idle.c: fix confusing code identation
Marcos Paulo de Souza [Wed, 21 Mar 2012 23:33:43 +0000 (16:33 -0700)]
drivers/idle/intel_idle.c: fix confusing code identation

Fix a code indentation in the function intel_idle_cpu_init that looks
confusing.o

Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Cc: "Brown, Len" <len.brown@intel.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agofs/namei.c: fix warnings on 32-bit
Andrew Morton [Wed, 21 Mar 2012 23:33:42 +0000 (16:33 -0700)]
fs/namei.c: fix warnings on 32-bit

i386 allnoconfig:

  fs/namei.c: In function 'has_zero':
  fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type
  fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type
  fs/namei.c: In function 'hash_name':
  fs/namei.c:1635: warning: integer constant is too large for 'unsigned long' type

There must be a tidier way of doing this.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agomm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode
Andrea Arcangeli [Wed, 21 Mar 2012 23:33:42 +0000 (16:33 -0700)]
mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode

In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode.  In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.

It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds).  The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().

Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously.  This is
probably why it wasn't common to run into this.  For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.

Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).

The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value.  Even if the real pmd is changing under the
value we hold on the stack, we don't care.  If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).

All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd.  The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds).  I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).

if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))

Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.

The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.

====== start quote =======
      mapcount 0 page_mapcount 1
      kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

      mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

        143 void pmd_clear_bad(pmd_t *pmd)
        144 {
    ->  145         pmd_ERROR(*pmd);
        146         pmd_clear(pmd);
        147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

       1381         if (mapcount != page_mapcount(page))
       1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
       1383                        mapcount, page_mapcount(page));
    -> 1384         BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

               virtual address space
              .---------------------.
              |                     |
              |                     |
            .-|---------------------|
            | |                     |
            | |                     |<-- B(fault)
            | |                     |
      2 MB  | |/////////////////////|-.
      huge <  |/////////////////////|  > A(range)
      page  | |/////////////////////|-'
            | |                     |
            | |                     |
            '-|---------------------|
              |                     |
              |                     |
              '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
      on the virtual address range "A(range)" shown in the picture.

    sys_madvise
      // Acquire the semaphore in shared mode.
      down_read(&current->mm->mmap_sem)
      ...
      madvise_vma
        switch (behavior)
        case MADV_DONTNEED:
             madvise_dontneed
               zap_page_range
                 unmap_vmas
                   unmap_page_range
                     zap_pud_range
                       zap_pmd_range
                         //
                         // Assume that this huge page has never been accessed.
                         // I.e. content of the PMD entry is zero (not mapped).
                         //
                         if (pmd_trans_huge(*pmd)) {
                             // We don't get here due to the above assumption.
                         }
                         //
                         // Assume that Thread B incurred a page fault and
             .---------> // sneaks in here as shown below.
             |           //
             |           if (pmd_none_or_clear_bad(pmd))
             |               {
             |                 if (unlikely(pmd_bad(*pmd)))
             |                     pmd_clear_bad
             |                     {
             |                       pmd_ERROR
             |                         // Log "bad pmd ..." message here.
             |                       pmd_clear
             |                         // Clear the page's PMD entry.
             |                         // Thread B incremented the map count
             |                         // in page_add_new_anon_rmap(), but
             |                         // now the page is no longer mapped
             |                         // by a PMD entry (-> inconsistency).
             |                     }
             |               }
             |
             v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
      in the picture.

    ...
    do_page_fault
      __do_page_fault
        // Acquire the semaphore in shared mode.
        down_read_trylock(&mm->mmap_sem)
        ...
        handle_mm_fault
          if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
              // We get here due to the above assumption (PMD entry is zero).
              do_huge_pmd_anonymous_page
                alloc_hugepage_vma
                  // Allocate a new transparent huge page here.
                ...
                __do_huge_pmd_anonymous_page
                  ...
                  spin_lock(&mm->page_table_lock)
                  ...
                  page_add_new_anon_rmap
                    // Here we increment the page's map count (starts at -1).
                    atomic_set(&page->_mapcount, 0)
                  set_pmd_at
                    // Here we set the page's PMD entry which will be cleared
                    // when Thread A calls pmd_clear_bad().
                  ...
                  spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read).  Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated.  However, Thread A
    does not synchronize on that lock.

====== end quote =======

[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 years agoMerge tag 'dlm-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Linus Torvalds [Wed, 21 Mar 2012 20:54:22 +0000 (13:54 -0700)]
Merge tag 'dlm-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates for 3.4 from David Teigland:
 "This set includes one trivial fix, and one simple recovery speed up.
  Directory recovery can use the standard hash table to find resources
  rather than always searching the linear recovery list."

* tag 'dlm-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: last element of dlm_local_addr[] never used
  dlm: fix slow rsb search in dir recovery

12 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Wed, 21 Mar 2012 20:36:41 +0000 (13:36 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs pile 1 from Al Viro:
 "This is _not_ all; in particular, Miklos' and Jan's stuff is not there
  yet."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
  ext4: initialization of ext4_li_mtx needs to be done earlier
  debugfs-related mode_t whack-a-mole
  hfsplus: add an ioctl to bless files
  hfsplus: change finder_info to u32
  hfsplus: initialise userflags
  qnx4: new helper - try_extent()
  qnx4: get rid of qnx4_bread/qnx4_getblk
  take removal of PF_FORKNOEXEC to flush_old_exec()
  trim includes in inode.c
  um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
  um: embed ->stub_pages[] into mmu_context
  gadgetfs: list_for_each_safe() misuse
  ocfs2: fix leaks on failure exits in module_init
  ecryptfs: make register_filesystem() the last potential failure exit
  ntfs: forgets to unregister sysctls on register_filesystem() failure
  logfs: missing cleanup on register_filesystem() failure
  jfs: mising cleanup on register_filesystem() failure
  make configfs_pin_fs() return root dentry on success
  configfs: configfs_create_dir() has parent dentry in dentry->d_parent
  configfs: sanitize configfs_create()
  ...

12 years agoMerge branch 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Wed, 21 Mar 2012 20:32:19 +0000 (13:32 -0700)]
Merge branch 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull munmap/truncate race fixes from Al Viro:
 "Fixes for racy use of unmap_vmas() on truncate-related codepaths"

* 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VM: make zap_page_range() callers that act on a single VMA use separate helper
  VM: make unmap_vmas() return void
  VM: don't bother with feeding upper limit to tlb_finish_mmu() in exit_mmap()
  VM: make zap_page_range() return void
  VM: can't go through the inner loop in unmap_vmas() more than once...
  VM: unmap_page_range() can return void

12 years agoMerge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux...
Linus Torvalds [Wed, 21 Mar 2012 20:25:04 +0000 (13:25 -0700)]
Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

Pull security subsystem updates for 3.4 from James Morris:
 "The main addition here is the new Yama security module from Kees Cook,
  which was discussed at the Linux Security Summit last year.  Its
  purpose is to collect miscellaneous DAC security enhancements in one
  place.  This also marks a departure in policy for LSM modules, which
  were previously limited to being standalone access control systems.
  Chromium OS is using Yama, and I believe there are plans for Ubuntu,
  at least.

  This patchset also includes maintenance updates for AppArmor, TOMOYO
  and others."

Fix trivial conflict in <net/sock.h> due to the jumo_label->static_key
rename.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
  AppArmor: Fix location of const qualifier on generated string tables
  TOMOYO: Return error if fails to delete a domain
  AppArmor: add const qualifiers to string arrays
  AppArmor: Add ability to load extended policy
  TOMOYO: Return appropriate value to poll().
  AppArmor: Move path failure information into aa_get_name and rename
  AppArmor: Update dfa matching routines.
  AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
  AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
  AppArmor: Add const qualifiers to generated string tables
  AppArmor: Fix oops in policy unpack auditing
  AppArmor: Fix error returned when a path lookup is disconnected
  KEYS: testing wrong bit for KEY_FLAG_REVOKED
  TOMOYO: Fix mount flags checking order.
  security: fix ima kconfig warning
  AppArmor: Fix the error case for chroot relative path name lookup
  AppArmor: fix mapping of META_READ to audit and quiet flags
  AppArmor: Fix underflow in xindex calculation
  AppArmor: Fix dropping of allowed operations that are force audited
  AppArmor: Add mising end of structure test to caps unpacking
  ...

12 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Wed, 21 Mar 2012 20:20:43 +0000 (13:20 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto update from Herbert Xu:
 "* sha512 bug fixes (already in your tree).
  * SHA224/SHA384 AEAD support in caam.
  * X86-64 optimised version of Camellia.
  * Tegra AES support.
  * Bulk algorithm registration interface to make driver registration easier.
  * padata race fixes.
  * Misc fixes."

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (31 commits)
  padata: Fix race on sequence number wrap
  padata: Fix race in the serialization path
  crypto: camellia - add assembler implementation for x86_64
  crypto: camellia - rename camellia.c to camellia_generic.c
  crypto: camellia - fix checkpatch warnings
  crypto: camellia - rename camellia module to camellia_generic
  crypto: tcrypt - add more camellia tests
  crypto: testmgr - add more camellia test vectors
  crypto: camellia - simplify key setup and CAMELLIA_ROUNDSM macro
  crypto: twofish-x86_64/i586 - set alignmask to zero
  crypto: blowfish-x86_64 - set alignmask to zero
  crypto: serpent-sse2 - combine ablk_*_init functions
  crypto: blowfish-x86_64 - use crypto_[un]register_algs
  crypto: twofish-x86_64-3way - use crypto_[un]register_algs
  crypto: serpent-sse2 - use crypto_[un]register_algs
  crypto: serpent-sse2 - remove dead code from serpent_sse2_glue.c::serpent_sse2_init()
  crypto: twofish-x86 - Remove dead code from twofish_glue_3way.c::init()
  crypto: In crypto_add_alg(), 'exact' wants to be initialized to 0
  crypto: caam - fix gcc 4.6 warning
  crypto: Add bulk algorithm registration interface
  ...

12 years agoMerge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck...
Linus Torvalds [Wed, 21 Mar 2012 17:37:25 +0000 (10:37 -0700)]
Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon changes for v3.4 from Guenter Roeck:
 "Mostly cleanup.  No new drivers this time around, but support for
  several chips added to existing drivers: TPS40400, TPS40422, MTD040,
  MAX34446, ZL9101M, ZL9117M, and LM96080.  Also, added watchdog support
  for SCH56xx, and additional attributes for a couple of drivers."

* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (137 commits)
  hwmon: (sch56xx) Add support for the integrated watchdog (v2)
  hwmon: (w83627ehf) Add support for temperature offset registers
  hwmon: (jc42) Remove unnecessary device IDs
  hwmon: (zl6100) Add support for ZL9101M and ZL9117M
  hwmon: (adm1275) Add support for ADM1075
  hwmon: (max34440) Add support for MAX34446
  hwmon: (pmbus) Add more virtual registers
  hwmon: (pmbus) Add support for Lineage Power MDT040
  hwmon: (pmbus) Add support for TI TPS40400 and TPS40422
  hwmon: (max34440) Add support for 'lowest' output voltage attribute
  hwmon: (jc42) Convert to use devm_kzalloc
  hwmon: (max16065) Convert to use devm_kzalloc
  hwmon: (smm665) Convert to use devm_kzalloc
  hwmon: (ltc4261) Convert to use devm_kzalloc
  hwmon: (pmbus) Simplify remove functions
  hwmon: (pmbus) Convert pmbus drivers to use devm_kzalloc
  hwmon: (lineage-pem) Convert to use devm_kzalloc
  hwmon: (hwmon-vid) Fix checkpatch issues
  hwmon: (hwmon-vid) Add new entries to VRM model table
  hwmon: (lm80) Add detection of NatSemi/TI LM96080
  ...

12 years agoMerge tag 'regulator-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie...
Linus Torvalds [Wed, 21 Mar 2012 17:34:56 +0000 (10:34 -0700)]
Merge tag 'regulator-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator updates for 3.4 from Mark Brown:
 "This has been a fairly quiet release from a regulator point of view,
  the only real framework features added were devm support and a
  convenience helper for setting up fixed voltage regulators.

  We also added a couple of drivers (but will drop the BQ240022 driver
  via the arm-soc tree as it's been replaced by the more generic
  gpio-regulator driver) and Axel Lin continued his relentless and
  generally awesome stream of fixes and cleanups."

* tag 'regulator-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (93 commits)
  regulator: Fix up a confusing dev_warn when DT lookup fails
  regulator: Convert tps6507x to set_voltage_sel
  regulator: Refactor tps6507x to use one tps6507x_pmic_ops for all LDOs and DCDCs
  regulator: Make s5m8767_get_voltage_register always return correct register
  regulator: s5m8767: Check pdata->buck[2|3|4]_gpiodvs earlier
  regulator: tps65910: Provide settling time for DCDC voltage change
  regulator: Add Anatop regulator driver
  regulator: Simplify implementation of tps65912_get_voltage_dcdc
  regulator: Use tps65912_set_voltage_sel for both DCDCx and LDOx
  regulator: tps65910: Provide settling time for enabling rails
  regulator: max8925: Use DIV_ROUND_UP macro
  regulator: tps65912: Use simple equations to get register address
  regulator: Fix the logic of tps65910_get_mode
  regulator: Merge tps65217_pmic_ldo234_ops and tps65217_pmic_dcdc_ops to tps65217_pmic_ops
  regulator: Use DIV_ROUND_CLOSEST in wm8350_isink_get_current
  regulator: Use array to store dcdc_range settings for tps65912
  regulator: Rename s5m8767_convert_voltage to s5m8767_convert_voltage_to_sel
  regulator: tps6524x: Remove unneeded comment for N_REGULATORS
  regulator: Rename set_voltage_sel callback function name to *_sel
  regulator: Fix s5m8767_set_voltage_time_sel calculation value
  ...

12 years agoMerge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland...
Linus Torvalds [Wed, 21 Mar 2012 17:33:42 +0000 (10:33 -0700)]
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

Pull InfiniBand/RDMA changes for the 3.4 merge window from Roland Dreier:
 "Nothing big really stands out; by patch count lots of fixes to the
  mlx4 driver plus some cleanups and fixes to the core and other
  drivers."

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (28 commits)
  mlx4_core: Scale size of MTT table with system RAM
  mlx4_core: Allow dynamic MTU configuration for IB ports
  IB/mlx4: Fix info returned when querying IBoE ports
  IB/mlx4: Fix possible missed completion event
  mlx4_core: Report thermal error events
  mlx4_core: Fix one more static exported function
  IB: Change CQE "csum_ok" field to a bit flag
  RDMA/iwcm: Reject connect requests if cmid is not in LISTEN state
  RDMA/cxgb3: Don't pass irq flags to flush_qp()
  mlx4_core: Get rid of redundant ext_port_cap flags
  RDMA/ucma: Fix AB-BA deadlock
  IB/ehca: Fix ilog2() compile failure
  IB: Use central enum for speed instead of hard-coded values
  IB/iser: Post initial receive buffers before sending the final login request
  IB/iser: Free IB connection resources in the proper place
  IB/srp: Consolidate repetitive sysfs code
  IB/srp: Use pr_fmt() and pr_err()/pr_warn()
  IB/core: Fix SDR rates in sysfs
  mlx4: Enforce device max FMR maps in FMR alloc
  IB/mlx4: Set bad_wr for invalid send opcode
  ...

12 years agoMerge tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6
Linus Torvalds [Wed, 21 Mar 2012 17:32:00 +0000 (10:32 -0700)]
Merge tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6

Pull SPI changes for v3.4 from Grant Likely:
 "Mostly a bunch of new drivers and driver bug fixes; but this also
  includes a few patches that create a core message queue infrastructure
  for the spi subsystem instead of making each driver open code it."

* tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6: (34 commits)
  spi/fsl-espi: Make sure pm is within 2..32
  spi/fsl-espi: make the clock computation easier to read
  spi: sh-hspi: modify write/read method
  spi: sh-hspi: control spi clock more correctly
  spi: sh-hspi: convert to using core message queue
  spi: s3c64xx: Fix build
  spi: s3c64xx: remove unnecessary callback msg->complete
  spi: remove redundant variable assignment
  spi: release lock on error path in spi_pump_messages()
  spi: Compatibility with direction which is used in samsung DMA operation
  spi-topcliff-pch: add recovery processing in case wait-event timeout
  spi-topcliff-pch: supports a spi mode setup and bit order setup by IO control
  spi-topcliff-pch: Fix issue for transmitting over 4KByte
  spi-topcliff-pch: Modify pci-bus number dynamically to get DMA device info
  spi/imx: simplify error handling to free gpios
  spi: Convert to DEFINE_PCI_DEVICE_TABLE
  spi: add Broadcom BCM63xx SPI controller driver
  SPI: add CSR SiRFprimaII SPI controller driver
  spi-topcliff-pch: fix -Wuninitialized warning
  spi: Mark spi_register_board_info() __devinit
  ...

12 years agoMerge tag 'dt-for-linus' of git://git.secretlab.ca/git/linux-2.6
Linus Torvalds [Wed, 21 Mar 2012 17:30:03 +0000 (10:30 -0700)]
Merge tag 'dt-for-linus' of git://git.secretlab.ca/git/linux-2.6

Pull core device tree changes for Linux v3.4 from Grant Likely:
 "This branch contains a minor documentation addition, a utility
  function for parsing string properties needed by some of the new ARM
  platforms, disables dynamic DT code that isn't used anywhere but on a
  few PPC machines, and exports DT node compatible data to userspace via
  UEVENT properties.  Nothing earth shattering here."

* tag 'dt-for-linus' of git://git.secretlab.ca/git/linux-2.6:
  of: Only compile OF_DYNAMIC on PowerPC pseries and iseries
  arm/dts: OMAP3: Add omap3evm and am335xevm support
  drivercore: Output common devicetree information in uevent
  of: Add of_property_match_string() to find index into a string list

12 years agoMerge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Linus Torvalds [Wed, 21 Mar 2012 17:27:19 +0000 (10:27 -0700)]
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6

Pull irq_domain support for all architectures from Grant Likely:
 "Generialize powerpc's irq_host as irq_domain

  This branch takes the PowerPC irq_host infrastructure (reverse mapping
  from Linux IRQ numbers to hardware irq numbering), generalizes it,
  renames it to irq_domain, and makes it available to all architectures.

  Originally the plan has been to create an all-new irq_domain
  implementation which addresses some of the powerpc shortcomings such
  as not handling 1:1 mappings well, but doing that proved to be far
  more difficult and invasive than generalizing the working code and
  refactoring it in-place.  So, this branch rips out the 'new'
  irq_domain and replaces it with the modified powerpc version (in a
  fully bisectable way of course).  It converts all users over to the
  new API and makes irq_domain selectable on any architecture.

  No architecture is forced to enable irq_domain, but the infrastructure
  is required for doing OpenFirmware style irq translations.  It will
  even work on SPARC even though SPARC has it's own mechanism for
  translating irqs at boot time.  MIPS, microblaze, embedded x86 and c6x
  are converted too.

  The resulting irq_domain code is probably still too verbose and can be
  optimized more, but that can be done incrementally and is a task for
  follow-on patches."

* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: (31 commits)
  dt: fix twl4030 for non-dt compile on x86
  mfd: twl-core: Add IRQ_DOMAIN dependency
  devicetree: Add empty of_platform_populate() for !CONFIG_OF_ADDRESS (sparc)
  irq_domain: Centralize definition of irq_dispose_mapping()
  irq_domain/mips: Allow irq_domain on MIPS
  irq_domain/x86: Convert x86 (embedded) to use common irq_domain
  ppc-6xx: fix build failure in flipper-pic.c and hlwd-pic.c
  irq_domain/microblaze: Convert microblaze to use irq_domains
  irq_domain/powerpc: Replace custom xlate functions with library functions
  irq_domain/powerpc: constify irq_domain_ops
  irq_domain/c6x: Use library of xlate functions
  irq_domain/c6x: constify irq_domain structures
  irq_domain/c6x: Convert c6x to use generic irq_domain support.
  irq_domain: constify irq_domain_ops
  irq_domain: Create common xlate functions that device drivers can use
  irq_domain: Remove irq_domain_add_simple()
  irq_domain: Remove 'new' irq_domain in favour of the ppc one
  mfd: twl-core.c: Fix the number of interrupts managed by twl4030
  of/address: add empty static inlines for !CONFIG_OF
  irq_domain: Add support for base irq and hwirq in legacy mappings
  ...

12 years agoMerge tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Wed, 21 Mar 2012 17:15:51 +0000 (10:15 -0700)]
Merge tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates for 3.4 from Rafael Wysocki:
 "Assorted extensions and fixes including:

  * Introduction of early/late suspend/hibernation device callbacks.
  * Generic PM domains extensions and fixes.
  * devfreq updates from Axel Lin and MyungJoo Ham.
  * Device PM QoS updates.
  * Fixes of concurrency problems with wakeup sources.
  * System suspend and hibernation fixes."

* tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits)
  PM / Domains: Check domain status during hibernation restore of devices
  PM / devfreq: add relation of recommended frequency.
  PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on()
  PM / shmobile: Make CMT driver use pm_genpd_dev_always_on()
  PM / shmobile: Make TMU driver use pm_genpd_dev_always_on()
  PM / Domains: Introduce "always on" device flag
  PM / Domains: Fix hibernation restore of devices, v2
  PM / Domains: Fix handling of wakeup devices during system resume
  sh_mmcif / PM: Use PM QoS latency constraint
  tmio_mmc / PM: Use PM QoS latency constraint
  PM / QoS: Make it possible to expose PM QoS latency constraints
  PM / Sleep: JBD and JBD2 missing set_freezable()
  PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case
  PM / Freezer: Remove references to TIF_FREEZE in comments
  PM / Sleep: Add more wakeup source initialization routines
  PM / Hibernate: Enable usermodehelpers in hibernate() error path
  PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
  PM / Sleep: Fix race conditions related to wakeup source timer function
  PM / Sleep: Fix possible infinite loop during wakeup source destruction
  PM / Hibernate: print physical addresses consistently with other parts of kernel
  ...

12 years agoMerge branch 'kmap_atomic' of git://github.com/congwang/linux
Linus Torvalds [Wed, 21 Mar 2012 16:40:26 +0000 (09:40 -0700)]
Merge branch 'kmap_atomic' of git://github.com/congwang/linux

Pull kmap_atomic cleanup from Cong Wang.

It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().

Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.

* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
  feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
  highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
  drbd: remove the second argument of k[un]map_atomic()
  zcache: remove the second argument of k[un]map_atomic()
  gma500: remove the second argument of k[un]map_atomic()
  dm: remove the second argument of k[un]map_atomic()
  tomoyo: remove the second argument of k[un]map_atomic()
  sunrpc: remove the second argument of k[un]map_atomic()
  rds: remove the second argument of k[un]map_atomic()
  net: remove the second argument of k[un]map_atomic()
  mm: remove the second argument of k[un]map_atomic()
  lib: remove the second argument of k[un]map_atomic()
  power: remove the second argument of k[un]map_atomic()
  kdb: remove the second argument of k[un]map_atomic()
  udf: remove the second argument of k[un]map_atomic()
  ubifs: remove the second argument of k[un]map_atomic()
  squashfs: remove the second argument of k[un]map_atomic()
  reiserfs: remove the second argument of k[un]map_atomic()
  ocfs2: remove the second argument of k[un]map_atomic()
  ntfs: remove the second argument of k[un]map_atomic()
  ...

12 years agodlm: last element of dlm_local_addr[] never used
David Teigland [Wed, 21 Mar 2012 14:18:34 +0000 (09:18 -0500)]
dlm: last element of dlm_local_addr[] never used

The last element of dlm_local_addr[DLM_MAX_ADDR_COUNT]
was not used because the loop ended at COUNT - 1.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Teigland <teigland@redhat.com>
12 years agoRevert "video:uvesafb: Fix oops that uvesafb try to execute NX-protected page"
Florian Tobias Schandinat [Wed, 21 Mar 2012 13:22:01 +0000 (13:22 +0000)]
Revert "video:uvesafb: Fix oops that uvesafb try to execute NX-protected page"

This reverts commit ec0d22e4d563e7cce9f6678e2000900755c2989d.

This patch requires exporting 'pcibios_enabled' to avoid breaking
modular uvesafb builds. As this gets some opposition by Alan Cox it
needs more discussion, revert the patch for now.

Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
12 years agoOMAPDSS: register dss drivers in module init
Tomi Valkeinen [Mon, 19 Mar 2012 13:05:02 +0000 (15:05 +0200)]
OMAPDSS: register dss drivers in module init

We do the dss driver registration in a rather strange way: we have the
higher level omapdss driver, and we use that driver's probe function to
register the drivers for the rest of the dss devices.

There doesn't seem to be any reason for that, and additionally the
soon-to-be-merged patch "ARM: OMAP: omap_device: remove
omap_device_parent" will break omapdss initialization with the current
registration model.

This patch changes the registration for all drivers to happen at the
same place, in the init of the module.

Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
12 years agovideo: pxafb: add clk_prepare/clk_unprepare calls
Philipp Zabel [Thu, 15 Mar 2012 18:12:00 +0000 (19:12 +0100)]
video: pxafb: add clk_prepare/clk_unprepare calls

This patch adds clk_prepare/clk_unprepare calls to the pxafb
driver by using the helper functions clk_prepare_enable and
clk_disable_unprepare.

Signed-off-by: Philipp Zabel <philipp.zabel@gmail.com>
Cc: Haojian Zhuang <haojian.zhuang@marvell.com>
Cc: Eric Miao <eric.y.miao@gmail.com>
Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
12 years agofbdev: bfin_adv7393fb: Drop needless include
Jean Delvare [Thu, 15 Mar 2012 09:14:49 +0000 (10:14 +0100)]
fbdev: bfin_adv7393fb: Drop needless include

Kernel drivers don't need <linux/i2c-dev.h>.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
12 years agodrm/i915: use DDC_ADDR instead of hard-coding it
Matt Turner [Fri, 23 Sep 2011 17:18:40 +0000 (13:18 -0400)]
drm/i915: use DDC_ADDR instead of hard-coding it

Signed-off-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon: use DDC_ADDR instead of hard-coding it
Matt Turner [Fri, 23 Sep 2011 17:18:16 +0000 (13:18 -0400)]
drm/radeon: use DDC_ADDR instead of hard-coding it

Signed-off-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm: remove unneeded redefinition of DDC_ADDR
Matt Turner [Fri, 23 Sep 2011 17:18:15 +0000 (13:18 -0400)]
drm: remove unneeded redefinition of DDC_ADDR

It's already defined in drm_edid.h.

Signed-off-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/exynos: added virtual display driver.
Inki Dae [Wed, 21 Mar 2012 01:55:26 +0000 (10:55 +0900)]
drm/exynos: added virtual display driver.

this driver would be used for wireless display. virtual display
driver has independent crtc, encoder and connector and to use
this driver, user application should send edid data to this driver
from wireless display.

Signed-off-by: Inki Dae <inki.dae@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
12 years agofbdev: sh_mipi_dsi: add extra phyctrl for sh_mipi_dsi_info
Kuninori Morimoto [Wed, 21 Mar 2012 01:34:10 +0000 (18:34 -0700)]
fbdev: sh_mipi_dsi: add extra phyctrl for sh_mipi_dsi_info

sh_mipi uses some clocks, but the method of setup depends on CPU.

Current SuperH (like sh73a0) can control all of these clocks
by CPG (Clock Pulse Generator).
It means we can control it by clock framework only.
But on sh7372, it needs CPG settings AND sh_mipi PHYCTRL::PLLDS,
and only sh7372 has PHYCTRL::PLLDS.

But on current sh_mipi driver, PHYCTRL::PLLDS of sh7372 was
overwrote since the callback timing of clock setting was changed
by c2658b70f06108361aa5024798f9c1bf47c73374
(fbdev: sh_mipi_dsi: fixup setup timing of sh_mipi_setup()).
To solve this issue, this patch adds extra .phyctrl.

This patch adds detail explanation for unclear mipi settings
and fixup wrong PHYCTRL::PLLDS value for ap4evb (0xb -> 0x6).

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
12 years agofbdev: remove dependency of FB_SH_MOBILE_MERAM from FB_SH_MOBILE_LCDC
Kuninori Morimoto [Wed, 21 Mar 2012 01:27:08 +0000 (18:27 -0700)]
fbdev: remove dependency of FB_SH_MOBILE_MERAM from FB_SH_MOBILE_LCDC

MERAM can be used for other IP blocks as well in the future.
It doesn't necessarily mean that the MERAM driver depends on the LCDC.
This patch corrects dependency.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
12 years agoMerge tag 'asoc-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound...
Takashi Iwai [Wed, 21 Mar 2012 07:06:05 +0000 (08:06 +0100)]
Merge tag 'asoc-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

Last minute ASoC updates for 3.4

There's a couple of small features here that were added late on but have
been in -next in my tree and some bug fixes.  The wm_hubs stuff is
actually bug fixes - the stuff that's currently in 3.4 is a half way
house between the two solutions that the latest change allows the
machine to select between.

12 years agodrm/radeon/kms: update duallink checks for DCE6
Alex Deucher [Tue, 20 Mar 2012 21:18:42 +0000 (17:18 -0400)]
drm/radeon/kms: update duallink checks for DCE6

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add trinity pci ids
Alex Deucher [Tue, 20 Mar 2012 21:18:41 +0000 (17:18 -0400)]
drm/radeon/kms: add trinity pci ids

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add radeon_asic struct for trinity
Alex Deucher [Tue, 20 Mar 2012 21:18:40 +0000 (17:18 -0400)]
drm/radeon/kms: add radeon_asic struct for trinity

Trinity (TN) is an APU with:
- Cayman 3D
- DCE6.1 display

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add support for ucode loading on trinity (v2)
Alex Deucher [Tue, 20 Mar 2012 21:18:39 +0000 (17:18 -0400)]
drm/radeon/kms: add support for ucode loading on trinity (v2)

v2: fix check for MC ucode from Tom.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms/vm: set vram base offset properly for TN
Alex Deucher [Tue, 20 Mar 2012 21:18:38 +0000 (17:18 -0400)]
drm/radeon/kms/vm: set vram base offset properly for TN

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: Update evergreen functions for trinity
Alex Deucher [Tue, 20 Mar 2012 21:18:37 +0000 (17:18 -0400)]
drm/radeon/kms: Update evergreen functions for trinity

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: cayman gpu init updates for trinity
Alex Deucher [Tue, 20 Mar 2012 21:18:36 +0000 (17:18 -0400)]
drm/radeon/kms: cayman gpu init updates for trinity

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: Add checks for TN in the DP bridge code
Alex Deucher [Tue, 20 Mar 2012 21:18:35 +0000 (17:18 -0400)]
drm/radeon/kms: Add checks for TN in the DP bridge code

TN (trinity) uses DP bridges for LVDS and VGA just like llano.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms/DCE6.1: ss is not supported on the internal pplls
Alex Deucher [Tue, 20 Mar 2012 21:18:34 +0000 (17:18 -0400)]
drm/radeon/kms/DCE6.1: ss is not supported on the internal pplls

It's handled via external clock. It should already be protected
by the external ss flag, but add an explicit check just in case.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: disable PPLL0 on DCE6.1 when not in use
Alex Deucher [Tue, 20 Mar 2012 21:18:33 +0000 (17:18 -0400)]
drm/radeon/kms: disable PPLL0 on DCE6.1 when not in use

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: Adjust pll picker for DCE6.1
Alex Deucher [Tue, 20 Mar 2012 21:18:32 +0000 (17:18 -0400)]
drm/radeon/kms: Adjust pll picker for DCE6.1

On TN, UNIPHYA always uses PPLL2, UNIPHYB/C/D/E/F
can use either PPLL1 or PPLL0.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: DCE6.1 disp eng pll updates
Alex Deucher [Tue, 20 Mar 2012 21:18:31 +0000 (17:18 -0400)]
drm/radeon/kms: DCE6.1 disp eng pll updates

DCE6.1 uses EXT_PLL1 for disp eng.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: DCE6.1 watermark updates for TN
Alex Deucher [Tue, 20 Mar 2012 21:18:30 +0000 (17:18 -0400)]
drm/radeon/kms: DCE6.1 watermark updates for TN

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: no support for internal thermal sensor on TN yet
Alex Deucher [Tue, 20 Mar 2012 21:18:29 +0000 (17:18 -0400)]
drm/radeon/kms: no support for internal thermal sensor on TN yet

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add trinity (TN) chip family
Alex Deucher [Tue, 20 Mar 2012 21:18:28 +0000 (17:18 -0400)]
drm/radeon/kms: add trinity (TN) chip family

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: Add SI pci ids
Alex Deucher [Tue, 20 Mar 2012 21:18:27 +0000 (17:18 -0400)]
drm/radeon/kms: Add SI pci ids

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon: Update radeon_info_ioctl for SI. (v2)
Michel Dänzer [Tue, 20 Mar 2012 21:18:26 +0000 (17:18 -0400)]
drm/radeon: Update radeon_info_ioctl for SI. (v2)

v2: agd5f: add new MAX_PIPES param

Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add radeon_asic struct for SI
Alex Deucher [Tue, 20 Mar 2012 21:18:25 +0000 (17:18 -0400)]
drm/radeon/kms: add radeon_asic struct for SI

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add support for compute rings in CS ioctl on SI
Alex Deucher [Tue, 20 Mar 2012 21:18:24 +0000 (17:18 -0400)]
drm/radeon/kms: add support for compute rings in CS ioctl on SI

Very basic implementation for picking the ring priority.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: fill in startup/shutdown callbacks for SI
Alex Deucher [Tue, 20 Mar 2012 21:18:23 +0000 (17:18 -0400)]
drm/radeon/kms: fill in startup/shutdown callbacks for SI

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add support for interrupts on SI
Alex Deucher [Tue, 20 Mar 2012 21:18:22 +0000 (17:18 -0400)]
drm/radeon/kms: add support for interrupts on SI

This is mostly identical to evergreen/ni, however
there are some additional fields in the IV vector
for RINGID and VMID.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: Add support for RLC init on SI
Alex Deucher [Tue, 20 Mar 2012 21:18:21 +0000 (17:18 -0400)]
drm/radeon/kms: Add support for RLC init on SI

RLC handles the interrupt controller and other tasks
on the GPU.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add IB and fence dispatch functions for SI
Alex Deucher [Tue, 20 Mar 2012 21:18:20 +0000 (17:18 -0400)]
drm/radeon/kms: add IB and fence dispatch functions for SI

Support both IBs (DE) and CONST IBs (CE).

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add support for CP setup on SI
Alex Deucher [Tue, 20 Mar 2012 21:18:19 +0000 (17:18 -0400)]
drm/radeon/kms: add support for CP setup on SI

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add support for MC ucode loading on SI
Alex Deucher [Tue, 20 Mar 2012 21:18:18 +0000 (17:18 -0400)]
drm/radeon/kms: add support for MC ucode loading on SI

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add ucode loading for SI
Alex Deucher [Tue, 20 Mar 2012 21:18:17 +0000 (17:18 -0400)]
drm/radeon/kms: add ucode loading for SI

Currently the driver required 5 sets of ucode:
1. pfp - pre-fetch parser, part of the CP
2. me - micro engine, part of the CP
3. ce - constant engine, part of the CP
4. rlc - interrupt controller
5. mc - memory controller

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: Only VM CS ioctl is supported on SI (v2)
Alex Deucher [Tue, 20 Mar 2012 21:18:16 +0000 (17:18 -0400)]
drm/radeon/kms: Only VM CS ioctl is supported on SI (v2)

v2: avoid double free.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
12 years agodrm/radeon/kms: add VM CS checker for SI
Alex Deucher [Tue, 20 Mar 2012 21:18:15 +0000 (17:18 -0400)]
drm/radeon/kms: add VM CS checker for SI

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>