Re: [PATCH 2/2] memory-hotplug: fix kswapd looping forever problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Mon, Jun 25, 2012 at 01:59:27PM +0900, Minchan Kim wrote:
> When hotplug offlining happens on zone A, it starts to mark freed page
> as MIGRATE_ISOLATE type in buddy for preventing further allocation.
> (MIGRATE_ISOLATE is very irony type because it's apparently on buddy
> but we can't allocate them).
> When the memory shortage happens during hotplug offlining,
> current task starts to reclaim, then wake up kswapd.
> Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe
> doesn't consider MIGRATE_ISOLATE freed page count.
> Current task continue to reclaim in direct reclaim path without kswapd's helping.
> The problem is that zone->all_unreclaimable is set by only kswapd
> so that current task would be looping forever like below.
> 
> __alloc_pages_slowpath
> restart:
> 	wake_all_kswapd
> rebalance:
> 	__alloc_pages_direct_reclaim
> 		do_try_to_free_pages
> 			if global_reclaim && !all_unreclaimable
> 				return 1; /* It means we did did_some_progress */
> 	skip __alloc_pages_may_oom
> 	should_alloc_retry
> 		goto rebalance;
> 
> If we apply KOSAKI's patch[1] which doesn't depends on kswapd
> about setting zone->all_unreclaimable, we can solve this problem
> by killing some task in direct reclaim path. But it doesn't wake up kswapd, still.
> It could be a problem still if other subsystem needs GFP_ATOMIC request.
> So kswapd should consider MIGRATE_ISOLATE when it calculate free pages
> BEFORE going sleep.
> 
> This patch counts the number of MIGRATE_ISOLATE page block and
> zone_watermark_ok_safe will consider it if the system has such blocks
> (fortunately, it's very rare so no problem in POV overhead and kswapd is never
> hotpath).
> 
> [1] http://lkml.org/lkml/2012/6/14/74
> 

I have not been following this discussion at all but reading through the
patch I wonder again why memory hotplug is not "allocating" the pageblocks
when they are fully isolated like a balloon driver. This would keep the
free space accounting as it is currently without introducing something
memory hotplug specific to kswapd.

I think historically that memory hotplug did not allocate pages because
it would be difficult to detect if a pageblock was isolated or if part of
some balloon. Allocating just full pageblocks would work around this.
However, it would play very badly with CMA.

It'd be worth mentioning this in the changelog in case someone tries to
"fix" it.

> Suggested-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> Cc: Aaditya Kumar <aaditya.kumar.30@xxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> ---
> 
> Aaditya, coul you confirm this patch solve your problem and 
> make sure nr_migrate_isolate is zero after hotplug end?
> 
> Thanks!
> 
>  include/linux/mmzone.h |    8 ++++++++
>  mm/page_alloc.c        |   36 ++++++++++++++++++++++++++++++++++++
>  mm/page_isolation.c    |   43 +++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 85 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bf3404e..290e186 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -474,6 +474,14 @@ struct zone {
>  	 * rarely used fields:
>  	 */
>  	const char		*name;
> +#ifdef CONFIG_MEMORY_ISOLATION
> +	/*
> +	 * the number of MIGRATE_ISOLATE pageblock
> +	 * We need this for accurate free page counting.
> +	 * It's protected by zone->lock
> +	 */
> +	atomic_t		nr_migrate_isolate;
> +#endif

If it's protected by the zone->lock then it does not need to be atomic.

>  } ____cacheline_internodealigned_in_smp;
>  
>  typedef enum {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c175fa9..626f877 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -218,6 +218,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +/*
> + * NOTE:
> + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) direclty.

s/direclty/directly/

> + * Instead, use {un}set_pageblock_isolate.
> + */
>  void set_pageblock_migratetype(struct page *page, int migratetype)
>  {
>  	if (unlikely(page_group_by_mobility_disabled))
> @@ -1614,6 +1619,28 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  	return true;
>  }
>  
> +static unsigned long migrate_isolate_pages(struct zone *zone)
> +{

This name is very misleading as nothing is being migrated. The
other zone-based counters are stored in vmstat[] and accessed
with zone_page_state(). I doubt you want to use vmstat but
nr_isolated_zone_pages() would have been a better name.

> +	unsigned long nr_pages = 0;
> +
> +	if (unlikely(atomic_read(&zone->nr_migrate_isolate))) {
> +		unsigned long flags;
> +		int order;
> +		spin_lock_irqsave(&zone->lock, flags);
> +		for (order = 0; order < MAX_ORDER; order++) {
> +			struct free_area *area = &zone->free_area[order];
> +			long count = 0;
> +			struct list_head *curr;
> +
> +			list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> +				count++;
> +			nr_pages += (count << order);
> +		}
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +	}

We have a zone->nr_migrate_isolate counter but have to search the buddy
lists to count how many pages are isolated? Don't bother. If the pageblocks
really have been isolated just assume that they are fully isolated for the
purposes of figuring out if kswapd should wake up or not. The consequences
of an inaccurate count is that kswapd wakes up when it potentially could
have stayed asleep but for memory hotplug that is desirable.

> +	return nr_pages;
> +}
> +
>  bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  		      int classzone_idx, int alloc_flags)
>  {
> @@ -1629,6 +1656,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>  	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>  		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>  
> +	/*
> +	 * If the zone has MIGRATE_ISOLATE type free page,
> +	 * we should consider it, too. Otherwise, kswapd can sleep forever.
> +	 */
> +	free_pages -= migrate_isolate_pages(z);
> +	if (free_pages < 0)
> +		free_pages = 0;
> +

You are already taking into account that the numbet of isolated pages may
be inaccurate so an exact count in migrate_isolate_pages is unnecessary.

>  	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>  								free_pages);
>  }
> @@ -4407,6 +4442,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  		lruvec_init(&zone->lruvec, zone);
>  		zap_zone_vm_stats(zone);
>  		zone->flags = 0;
> +		atomic_set(&zone->nr_migrate_isolate, 0);
>  		if (!size)
>  			continue;
>  
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 1a9cb36..e95a792 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -8,6 +8,45 @@
>  #include <linux/memory.h>
>  #include "internal.h"
>  
> +static void set_pageblock_isolate(struct zone *zone, struct page *page)
> +{
> +	int old_migratetype;
> +	assert_spin_locked(&zone->lock);
> +
> +	if (unlikely(page_group_by_mobility_disabled)) {
> +		set_pageblock_flags_group(page, MIGRATE_UNMOVABLE,
> +				PB_migrate, PB_migrate_end);
> +		return;
> +	}

Not sure why this is necessary.

> +
> +	old_migratetype = get_pageblock_migratetype(page);
> +	set_pageblock_flags_group(page, MIGRATE_ISOLATE,
> +			PB_migrate, PB_migrate_end);
> +
> +	if (old_migratetype != MIGRATE_ISOLATE)
> +		atomic_inc(&zone->nr_migrate_isolate);
> +}

If the old type was MIGRATE_ISOLATE then it was also unnecessary to call
set_pageblock_flags_group() or anything else. It's also unnecessary to
pass in zone because you can figure it out from the page but no harm.

This could have been a lot easier to read with something like;

static void set_pageblock_isolate(struct zone *zone, struct page *page)
{
	BUG_ON(page_zone(page) ! = zone);
	if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE)
		return;

	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
	zone->nr_pageblock_isolate++;
}

If set_migratetype_isolate is the only caller then it barely warrents a
helper functions because it simply looks like

if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) {
	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
	zone->nr_pageblock_isolate++;
}
	

> +
> +static void unset_pageblock_isolate(struct zone *zone, struct page *page,
> +		unsigned long migratetype)
> +{

The word "unset" in this context would normally refer to a boolean but
you're actually restoring an old value. Hence reset_pageblock_isolate or
restore_pageblock_migratetype would have been more appropriate.

Oh, I see you are matching the naming of unset_migratetype_isolate().
That sucks, hope it was not me that suggested that name originally :/

migratetype is almost always int too, not sure where that unsigned long
came out of.

> +	assert_spin_locked(&zone->lock);
> +
> +	if (unlikely(page_group_by_mobility_disabled)) {
> +		set_pageblock_flags_group(page, migratetype,
> +				PB_migrate, PB_migrate_end);
> +		return;
> +	}
> +
> +	BUG_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
> +	BUG_ON(migratetype == MIGRATE_ISOLATE);
> +
> +	set_pageblock_flags_group(page, migratetype,
> +			PB_migrate, PB_migrate_end);
> +	atomic_dec(&zone->nr_migrate_isolate);
> +	BUG_ON(atomic_read(&zone->nr_migrate_isolate) < 0);
> +}
> +

static void reset_pageblock_isolate(struct zone *zone, struct page *page,
				    int migratetype)
{
	if (WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE))
		return;

	BUG_ON(zone->nr_pageblock_isolate == 0);
	set_pageblock_migratetype(page, migratetype);
	zone->nr_pageblock_isolate--;
}


>  int set_migratetype_isolate(struct page *page)
>  {
>  	struct zone *zone;
> @@ -54,7 +93,7 @@ int set_migratetype_isolate(struct page *page)
>  
>  out:
>  	if (!ret) {
> -		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> +		set_pageblock_isolate(zone, page);
>  		move_freepages_block(zone, page, MIGRATE_ISOLATE);
>  	}
>  
> @@ -72,8 +111,8 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>  	spin_lock_irqsave(&zone->lock, flags);
>  	if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>  		goto out;
> -	set_pageblock_migratetype(page, migratetype);
>  	move_freepages_block(zone, page, migratetype);
> +	unset_pageblock_isolate(zone, page, migratetype);
>  out:
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  }

While this patch looks like it would work as advertised I also think that
it is more complicated than it needs to be. The use of atomics and exact
counts of isolated pages look unnecessary.  set_migeratetype_isolate and
unset_pageblock_isolate could have been a lot easier to read.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Other Archives]     [Linux Kernel Newbies]     [Linux Driver Development]     [Linux Kbuild]     [Fedora Kernel]     [Linux Kernel Testers]     [Linux SH]     [Linux Omap]     [Linux Tape]     [Linux Input]     [Linux Kernel Janitors]     [Linux Kernel Packagers]     [Linux Doc]     [Linux Man Pages]     [Linux API]     [Linux Memory Management]     [Linux Modules]     [Linux Standards]     [Kernel Announce]     [Netdev]     [Git]     [Linux PCI]     Linux CAN Development     [Linux I2C]     [Linux RDMA]     [Linux NUMA]     [Netfilter]     [Netfilter Devel]     [SELinux]     [Bugtraq]     [FIO]     [Linux Perf Users]     [Linux Serial]     [Linux PPP]     [Linux ISDN]     [Linux Next]     [Kernel Stable Commits]     [Linux Tip Commits]     [Kernel MM Commits]     [Linux Security Module]     [AutoFS]     [Filesystem Development]     [Ext3 Filesystem]     [Linux bcache]     [Ext4 Filesystem]     [Linux BTRFS]     [Linux CEPH Filesystem]     [Linux XFS]     [XFS]     [Linux NFS]     [Linux CIFS]     [Ecryptfs]     [Linux NILFS]     [Linux Cachefs]     [Reiser FS]     [Initramfs]     [Linux FB Devel]     [Linux OpenGL]     [DRI Devel]     [Fastboot]     [Linux RT Users]     [Linux RT Stable]     [eCos]     [Corosync]     [Linux Clusters]     [LVS Devel]     [Hot Plug]     [Linux Virtualization]     [KVM]     [KVM PPC]     [KVM ia64]     [Linux Containers]     [Linux Hexagon]     [Linux Cgroups]     [Util Linux]     [Wireless]     [Linux Bluetooth]     [Bluez Devel]     [Ethernet Bridging]     [Embedded Linux]     [Barebox]     [Linux MMC]     [Linux IIO]     [Sparse]     [Smatch]     [Linux Arch]     [x86 Platform Driver]     [Linux ACPI]     [Linux IBM ACPI]     [LM Sensors]     [CPU Freq]     [Linux Power Management]     [Linmodems]     [Linux DCCP]     [Linux SCTP]     [ALSA Devel]     [Linux USB]     [Linux PA RISC]     [Linux Samsung SOC]     [MIPS Linux]     [IBM S/390 Linux]     [ARM Linux]     [ARM Kernel]     [ARM MSM]     [Tegra Devel]     [Sparc Linux]     [Linux Security]     [Linux Sound]     [Linux Media]     [Video 4 Linux]     [Linux IRDA Users]     [Linux for the blind]     [Linux RAID]     [Linux ATA RAID]     [Device Mapper]     [Linux SCSI]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Linux IDE]     [Linux SMP]     [Linux AXP]     [Linux Alpha]     [Linux M68K]     [Linux ia64]     [Linux 8086]     [Linux x86_64]     [Linux Config]     [Linux Apps]     [Linux MSDOS]     [Linux X.25]     [Linux Crypto]     [DM Crypt]     [Linux Trace Users]     [Linux Btrace]     [Linux Watchdog]     [Utrace Devel]     [Linux C Programming]     [Linux Assembly]     [Dash]     [DWARVES]     [Hail Devel]     [Linux Kernel Debugger]     [Linux gcc]     [Gcc Help]     [X.Org]     [Wine]

Add to Google Powered by Linux

[Older Kernel Discussion]     [Yosemite National Park Forum]     [Large Format Photos]     [Gimp]     [Yosemite Photos]     [Stuff]