1月 21 週一 201320:23
Overview of Linux Memory Management Concepts: Slabs

Overview of Linux Memory Management Concepts: Slabs

Jim Blakey

Note: In the Linux 2.6 kernels, a new cache manager called the slub allocator is available and may replace the slab allocator described here. Either can be used through kernel configuration parameters. This paper was written a while ago and does not contain the most current kernel information. I will write an update soon - jimb

Introduction to slabs and kernel cache

Kernel modules and drivers often need to allocate temporary storage for non-persistent structures and objects, such as inodes, task structures, and device structures. These objects are uniform in size and are allocated and released many times during the life of the kernel. In earlier Unix and Linux implementations, the usual mechanisms for creating and releasing these objects were the kmalloc() and kfree() kernel calls.

However, these used an allocation scheme that was optimized for allocating and releasing pages in multiples of the hardware page size. For the small transient objects often required by the kernel and drivers, these page allocation routines were horribly inefficient, leaving the individual kernel modules and drivers responsible for optimizing their own memory usage.

One solution was to create a global kernel caching allocator which manages individual caches of identical objects on behalf of kernel modules and drivers. Each module or driver can ask the kernel cache allocator to create a private cache of a specific object type. The cache allocator handles growing each cache as needed on behalf of the module, and more importantly, the cache allocator can release unused pages back to the free pool in times when memory is in a crunch. The cache allocator works with the rest of the memory system to maintain a balance between the memory needs of each driver or module and the system as a whole.

The Linux 2.4 kernel implements a caching memory allocator to hold caches (called slabs) of identical objects. This slab allocator is basically an implementation of the "Slab Allocator" as described in UNIX Internals: The New Frontiers by Uresh Vahalia, Prentice Hall, ISBN 0-13-101908-2, with further tweaks based on a The Slab Allocator: An Object-Caching Kernel Memory Allocator, Jeff Bonwick (Sun Microsystems). presented at: USENIX Summer 1994 Technical Conference.

The following is a partial list of caches maintained by the Slab Allocator on a typical Linux 2.4 system. This information comes from the /proc/slabinfo file. Note that most kernel modules have requested their own caches. The columns are cache name, active objects, total number of objects, object size, number of full or partial pages, total allocated pages, and pages per slab.


 slabinfo - version: 1.1
 kmem_cache 59 78 100 2 2 1
 ip_fib_hash 10 113 32 1 1 1
 ip_conntrack 0 0 384 0 0 1
 urb_priv 0 0 64 0 0 1
 clip_arp_cache 0 0 128 0 0 1
 ip_mrt_cache 0 0 96 0 0 1
 tcp_tw_bucket 0 30 128 0 1 1
 tcp_bind_bucket 5 113 32 1 1 1
 tcp_open_request 0 0 96 0 0 1
 inet_peer_cache 0 0 64 0 0 1
 ip_dst_cache 23 100 192 5 5 1
 arp_cache 2 30 128 1 1 1
 blkdev_requests 256 520 96 7 13 1
 dnotify cache 0 0 20 0 0 1
 file lock cache 2 42 92 1 1 1
 fasync cache 1 202 16 1 1 1
 uid_cache 4 113 32 1 1 1
 skbuff_head_cache 93 96 160 4 4 1
 sock 115 126 1280 40 42 1
 sigqueue 0 29 132 0 1 1
 cdev_cache 156 177 64 3 3 1
 bdev_cache 69 118 64 2 2 1
 mnt_cache 13 40 96 1 1 1
 inode_cache 5561 5580 416 619 620 1
 dentry_cache 7599 7620 128 254 254 1
 dquot 0 0 128 0 0 1
 filp 1249 1280 96 32 32 1
 names_cache 0 8 4096 0 8 1
 buffer_head 15303 16920 96 422 423 1
 mm_struct 47 72 160 2 3 1
 vm_area_struct 1954 2183 64 34 37 1
 fs_cache 46 59 64 1 1 1
 files_cache 46 54 416 6 6 1
 
 <snip>

Figure 1: Typical list of caches maintained by kernel

Slab Overview

A slab is a set of one or more contiguous pages of memory set aside by the slab allocator for an individual cache. This memory is further divided into equal segments the size of the object type that the cache is managing.

Figure 2: Two page slab with 6 objects

As an example, assume a file-system driver wishes to create a cache of inodes that it can pull from. Through the kmem_cache_create() call, the slab allocator will calculate the optimal number of pages (in powers of 2) required for each slab given the inode size and other parameters. A kmem_cache_t pointer to this new inode cache is returned to the file-system driver.

When the file-system driver needs a new inode, it calls kmem_cache_alloc() with the kmem_cache_t pointer. The slab allocator will attempt to find a free inode object within the slabs currently allocated to that cache. If there are no free objects, or no slabs, then the slab allocator will grow the cache by fetching a new slab from the free page memory and returning an inode object from that.

When the file-system driver is finished with the inode object, it calls kmem_cache_free() to release the inode. The slab allocator will then mark that object within the slab as free and available.

If all objects within a slab are free, the pages that make up the slab are available to be returned to the free page pool if memory becomes tight. If more inodes are required at a later time, the slab allocator will re-grow the cache by fetching more slabs from free page memory. All of this is completely transparent to the file-system driver.

Creation and management of slab pages

Each slab of pages has an associated slab management structure. The slab_t struct is defined in /usr/src/linux/mm/slab.c has the following format:


 typedef struct slab_s {
 struct list_head list;
 unsigned long colouroff;
 void *s_mem; /* including colour offset */
 unsigned int inuse; /* num of objs active in slab */
 kmem_bufctl_t free;
 } slab_t;

Where:

list is a generic linked list structure found in /usr/include/linux/list.h. This type of list structure is used through out the Linux kernel. It contains a prev and next which are used to track of where this slab is being used. The various lists the slab_t can be on is described in the next section.

colouroff is an offset within the slab where the slab_t and allocated objects begin. This is part of the cache coloring described a little later.

s_mem is a pointer to the first object in the slab. The cache objects are contiguous from this point. Any object in the slab can be referenced by (object number * object size) + s_mem

inuse is a counter of the number of objects currently in use.

free is an integer index of the current next free object within the slab.

The slab_t structure is immediately followed by an array of kmem_bufctl_ts. There is one of these for for each object in the slab. The purpose of the kmem_bufctl_t array is to keep track of allocated and free objects within the slab. Implementations of slab allocators often have these structs containing forward/backward pointers, reference counts, pointers to the slabs they're in, etc. However, the Linux 2.4 implementation of virtual memory uses the page struct to keep track of which caches and slabs cache objects belong to, so the kmem_bufctl_t can be very simple. In fact, it is a single integer index. This array, along with the slab_t>-free member in effect implement the cache heap.

Figure 3. Off slab slab_t structure with 8 kmem_bufctl_t structs

The above figure shows the bufctl in use after several slab objects have been allocated and deleted. The next free object to be allocated will be in slot 3. Then free will assume the value in slot 3's bufctl or the value 1. So slot 1 will get the following next object allocated. Etc.

A slab_t structure may reside in the slab it describes or it may be allocated as part of the kmem_cache cache, depending on the size of the objects in the slab. Figure 2 shows an on-slab slab_t structure, where figure 3 shows on off-slab slab_t. The assumption is that if the objects in the cache are large, it is better to place the slab_t off-slab for less fragmentation and better packing of the objects. The rule is if the object size is greater than 1/8th the page size, the slab_t is allocated off-slab

The number of objects per slab (and therefore the number of pages per slab) is calculated when the cache is initially created. The driver or kernel module that asks for the cache only supplies the size of the objects to be cached, and should never have to care about how many objects there are in the cache.

The basic algorithm for calculating the number of objects in a slab is: First see how many objects fit on one page. If the left over is less than 1/8th the total slab size, then the fragmentation (or wastage) is acceptable and we're done.

If there is too much left over empty space on the first try, we try again with a larger slab. Slabs are grown in multiples of powers of 2 of the page size (kept internally as gfporder). So the first try, gfporder is 0 for 1 page, second try gfporder is 1 for 2 pages, next try gets 4 pages, next gets 8 pages, etc.

The actual algorithm used is obviously a little more complex than what I've just described. There are a lot of other factors and limits to take into account, such as whether the slab_t is kept on the slab (so its size figures into the equation), cache coloring, L1 cache alignment of the objects to reduce cache hits, etc. But the general idea is to find a balance of the number of pages in the slab and the number of objects in the slab for most efficient use of the space available.

The cache

A cache is a group of one or more slabs of object type. The structure that maintains each cache is the kmem_cache_t. The following is a partial list of important fields within the kmem_cache_t structure. This is not a complete structure, as all SMP and statistics related fields were removed for brevity. The full structure is defined in /usr/src/linux/mm/slab.c.


 struct kmem_cache_s {
 struct list_head slabs_full;
 struct list_head slabs_partial;
 struct list_head slabs_free;
 unsigned int objsize;
 unsigned int flags; /* constant flags */
 unsigned int num; /* # of objs per slab */
 spinlock_t spinlock;
 ...
 unsigned int gfporder;
 unsigned int gfpflags;
 size_t colour; /* cache colouring range */
 unsigned int colour_off; /* colour offset */
 unsigned int colour_next; /* cache colouring */
 ...
 kmem_cache_t *slabp_cache;
 unsigned int growing;
 ...
 void (*ctor)(void *, kmem_cache_t *, unsigned long);
 void (*dtor)(void *, kmem_cache_t *, unsigned long);
 char name[CACHE_NAMELEN];
 struct list_head next;
 ...
 };

Where: Where:

slabs_full, slabs_partial,and slabs_free are lists of slabs associated with this cache.

objsize is the size of the objects contained within the cache.

num is the number of cache objects per slab. This is calculated when the cache is created as a function of the size of each object and the best fit for a given number of memory pages per slab.

gfporder is the number of pages per slab as a power of two. For example, for 1 page/slab, gfporder is 0, for 2 pages/slab, gfporder is 1, for 4 pages, 2, for 8 pages, 3, etc... Pages are always allocated in multiples of powers of two.

colour, colour_off and colour_next are used to maintain the cache coloring offset for each slab. This will be discussed later.

*slabp_cache is a pointer to the kernel cache that is used for the slab_t, if the slab_t is maintained off-slab.

growing is a flag that the cache is growing so that the memory pages won't be de-allocated.

ctor and dtor are driver callback routines. These are constructor and destructor routines that the driver can supply that will be called when an object is allocated or released. This allows a driver to have a custom initialization or validation routine on object allocation, or to perform any cleanup before an object is released. These are also very useful for driver debugging.

name is a string describing the cache. This will be printed as part of the /proc/slabinfo file. See Slab Figure 1.

next is a list_struct pointer to the next cache in the kernel cache chain.

Figure 4. kmem_cache_t and its associated slabs

TODO: Add a couple of paragraphs describing use here

Cache Coloring of Slabs

Cache Coloring is a method to ensure that access to the slabs in kernel memory make the best use of the processor L1 cache. This is a performance tweak to try to ensure that we take as few cache hits as possible. Since slabs begin on page boundaries, it is likely that the objects within several different slab pages map to the same cache line, called 'false sharing'. This leads to less than optimal hardware cache performance. By offsetting each beginning of the first object within each slab by some fragment of the hardware cache line size, processor cache hits are reduced.

Figure 5. Cache Coloring within slabs

The `kmalloc()` Interface

Of course, there are times when a kernel module or driver needs to allocate memory for an object that doesn't fit one of the uniform types of the other caches, for example string buffers, one-off structures, temporary storage, etc. For those instances drivers and kernel modules use the kmalloc() and kfree() routines. The Linux kernel ties these calls into the slab allocator too.

On initialization, the kernel asks the slab allocator to create several caches of varying sizes for this purpose. Caches for generic objects of 32, 64, 128, 256, all the way to 131072 bytes are created for both the GFP_NORMAL and GFP_DMA zones of memory.

Figure 6: Use of the cache_sizes array

When a kernel module or driver needs memory, the cache_sizes array is searched to find the cache with the size appropriate to fit the requested object. For example, if a driver requests 166 bytes of GFP_NORMAL memory through kmalloc(), an object from the 256 byte cache would be returned.

When kfree() is called to release the object, the page the object resides in is calculated. Then the page struct for that page is referenced from mem_map (which was set up to point to our kmem_cache_t and slab_t pointers when the slab was allocated). Since we now have the slab and cache for the object, we an release it with __kmem_cache_free().

TODO: Add a summary paragraph here

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(116)

個人分類：Linux kernel

▲top

12月 01 週四 201120:35
Linux网络性能优化方法简析(3)

2010-12-20 10:56 赵军 IBMDW 我要评论(3) 字号：T | T

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(41)

個人分類：Linux kernel

▲top

5月 20 週五 201117:06
the problem of dma_cache_maint for upgrading linux kernel 2.6.31 to 2.6.35

Porting WIFI driver suffers a problem that compile error message " error: implicit declaration of function 'dma_cache_maint' ".
I found the function "dma_map_single" can relace by "dma_cache_maint".

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(114)

個人分類：Linux kernel

▲top

3月 23 週三 201115:44
the conntrack entries

The conntrack entries
Let's take a brief look at a conntrack entry and how to read them in /proc/net/ip_conntrack. This gives a list of all the current entries in your conntrack database. If you have the ip_conntrack module loaded, a cat of /proc/net/ip_conntrack might look like:

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(82)

個人分類：Linux kernel

▲top

3月 22 週二 201122:40
轉貼netfilter hook範例

2009/1/7

LINUX内核中Netfilter Hook的使用

LINUX内核中Netfilter Hook的使用作者：JuKevin

Hook是Linux Netfilter中重要技术，使用hook可以轻松开发内核下的多种网络处理程序。下面简单介绍一下hook及其使用。

1. hook相关数据结构

struct nf_hook_ops

{

struct list_head list;

/* User fills in from here down. */

nf_hookfn *hook;

struct module *owner;

int pf;

int hooknum;

/* Hooks are ordered in ascending priority. */

int priority;

};

主要成员介绍

int pf; 协议家族类型

int hooknum 为hook执行点，它表示在报文处理的具体什么阶段执行hook函数。

Linux有以下几种执行点：

NF_IP_PRE_ROUTING 在报文作路由以前执行；

NF_IP_FORWARD 在报文转向另一个NIC以前执行；

NF_IP_POST_ROUTING 在报文流出以前执行；

NF_IP_LOCAL_IN 在流入本地的报文作路由以后执行；

NF_IP_LOCAL_OUT 在本地报文做流出路由前执行。

nf_hookfn *hook; 为hook处理回调函数。其定义为：

typedef unsigned int nf_hookfn(

unsigned int hooknum, //hook执行点

struct sk_buff **skb, //sk buffer数据

const struct net_device *in, //输入设备

const struct net_device *out, //输出设备

int (*okfn)(struct sk_buff *) //

)

nf_hookfn执行后需要返回以下返回值：

NF_ACCEPT：继续正常的报文处理；

NF_DROP：将报文丢弃；

NF_STOLEN：由钩子函数处理了该报文，不要再继续传送；

NF_QUEUE：将报文入队，通常交由用户程序处理；

NF_REPEAT：再次调用该钩子函数。

最后一个参数为hook优先级，内核定义了以下多种优先级:

enum nf_ip_hook_priorities

{

　　NF_IP_PRI_FIRST = INT_MIN,

　　NF_IP_PRI_CONNTRACK = -200,

　　NF_IP_PRI_MANGLE = -150,

　　NF_IP_PRI_NAT_DST = -100,

　　NF_IP_PRI_FILTER = 0,

　　NF_IP_PRI_NAT_SRC = 100,

　　NF_IP_PRI_LAST = INT_MAX,

};

2. hook注册/注销

注册和注销函数使用起来非常简单，我们来看一下它们的函数原型：

单个hook注册和注销函数

int nf_register_hook(struct nf_hook_ops *reg);

void nf_unregister_hook(struct nf_hook_ops *reg);

多个hook注册和注销函数

int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);

void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);

3. 一个使用hook来监听主机ICMP报文的简单内核模块程序

#include <linux/init.h>

#include <linux/types.h>

#include <linux/netdevice.h>

#include <linux/skbuff.h>

#include <linux/netfilter_ipv4.h>

#include <linux/inet.h>

#include <linux/in.h>

#include <linux/ip.h>

static unsigned int icmp_srv(unsigned int hook,

struct sk_buff **pskb,

const struct net_device *in,

const struct net_device *out,

int (*okfn)(struct sk_buff *)

)

{

//printk(KERN_INFO"hook_icmp::icmp_srv()\n");

struct iphdr *iph = (*pskb)->nh.iph;

if(iph->protocol == IPPROTO_ICMP)

{

printk(KERN_INFO"hook_icmp::icmp_srv: receive ICMP packet\n");

printk(KERN_INFO"src: ");

}

return NF_ACCEPT;

}

static struct nf_hook_ops icmpsrv_ops =

{

.hook = icmp_srv,

.pf = PF_INET,

.hooknum = NF_IP_PRE_ROUTING,

.priority = NF_IP_PRI_FILTER -1,

};

static int __init init_hook_icmp(void)

{

return nf_register_hook(&icmpsrv_ops);

}

static void __exit fini_hook_icmp(void)

{

nf_unregister_hook(&icmpsrv_ops);

}

MODULE_LICENSE("GPL");

module_init(init_hook_icmp);

module_exit(fini_hook_icmp);

编译改模块之后，加载该模块，之后可以在DOS下用ping命令来测试。

在linux中用dmesg查看，可以看到收到的icmp报文

hook_icmp::icmp_srv: receive ICMP packet

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(578)

個人分類：Linux kernel

▲top

3月 22 週二 201113:19
linux kernel list sample written by Adrian Huang

/*
* =============================================================================
*
*       Filename: list_head_ex.h
*
*    Description: Write a doubly linked list by using list_head structure
*
*        Version: 1.0
*        Created: Fri Oct 19 14:17:58 GMT 2007
*       Revision: none
*       Compiler: gcc
*
*         Author: Adrian Huang
*       Web Site: http://adrianhuang.blogspot.com/
*
* =============================================================================
*/

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(92)

個人分類：Linux kernel

▲top

11月 05 週五 201000:20
no symbol version for xxx

解决方法：
1、重新编译内核，关闭CONFIG_MODVERSIONS选项
2、重新编译内核开启MODULE_FORCE_LOAD选项，强制加载
3、拷贝Module.symversion到内核源码目录，然后在内核源码目录执行make prepare，然后再在编译Module的时候加上KERN_DIR=/usr/src/linux
4、修改Module代码，通过/proc/kallsyms来获得地址，并赋给函数指针来使用

参考资料：
http://www.4front-tech.com/forum/viewtopic.php?p=10907&sid=43dc39cc92cbe35d27a4f89ec1208eb0
http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/1c724e019da71903
http://ubuntuforums.org/showthread.php?p=6119045

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(270)

個人分類：Linux kernel

▲top

10月 20 週三 201015:20
the email to describe how to analyse linux oops

From: Denis Vlasenko [email blocked]
To: linux-kernel
Subject: [RFC] HOWTO find oops location
Date: Sat, 14 Aug 2004 11:53:06 +0300
Hi folks,
Is this draft HOWTO useful? Comments?
--- cut here --- --- cut here --- --- cut here --- --- cut here ---
Okay, so you've got an oops and want to find out what happened?
In this HOWTO, I presume you did not delete and did not
tamper with your kernel build tree. Also, I recommend you
to enable these options in the .config:
CONFIG_DEBUG_SLAB=y
CONFIG_FRAME_POINTER=y
First one makes use-after-free bug hunt easy, second gives
you much more reliable stacktraces.
Ok, let's take a look at example OOPS. ^^^^ marks are mine.
Unable to handle kernel NULL pointer dereference at virtual address 00000e14
printing eip:
c0162887
*pde = 00000000
Oops: 0000 [#1]
PREEMPT
Modules linked in: eeprom snd_seq_oss snd_seq_midi_event..........
CPU: 0
EIP: 0060:[<c0162887>] Not tainted
EFLAGS: 00010206 (2.6.7-nf2)
EIP is at prune_dcache+0x147/0x1c0
^^^^^^^^^^^^^^^^^^^^^^^^
eax: 00000e00 ebx: d1bde050 ecx: f1b3c050 edx: f1b3ac50
esi: f1b3ac40 edi: c1973000 ebp: 00000036 esp: c1973ef8
ds: 007b es: 007b ss: 0068
Process kswapd0 (pid: 65, threadinfo=c1973000 task=c1986050)
Stack: d7721178 c1973ef8 0000007a 00000000 c1973000 f7ffea48 c0162d1f 0000007a
c0139a2b 0000007a 000000d0 00025528 049dbb00 00000000 000001fa 00000000
c0364564 00000001 0000000a c0364440 c013add1 00000080 000000d0 00000000
Call Trace:
[<c0162d1f>] shrink_dcache_memory+0x1f/0x30
[<c0139a2b>] shrink_slab+0x14b/0x190
[<c013add1>] balance_pgdat+0x1b1/0x200
[<c013aee7>] kswapd+0xc7/0xe0
[<c0114270>] autoremove_wake_function+0x0/0x60
[<c0103e9e>] ret_from_fork+0x6/0x14
[<c0114270>] autoremove_wake_function+0x0/0x60
[<c013ae20>] kswapd+0x0/0xe0
[<c01021d1>] kernel_thread_helper+0x5/0x14
Code: 8b 50 14 85 d2 75 27 89 34 24 e8 4a 2b 00 00 8b 73 0c 89 1c
Let's try to find out where did that exactly happened.
Grep in your kernel tree for prune_dcache. Aha, it is defined in
fs/dcache.c! Ok, execute these two commands:
# objdump -d fs/dcache.o > fs/dcache.disasm
# make fs/cache.s
Now in fs/ you should have:
dcache.c - source code
dcache.o - compiled object file
dcache.s - assembler output of C compiler ('half-compiled' code)
dcache.disasm - disasembled object file
Open dcache.disasm and find "prune_dcache":
00000540 <prune_dcache>:
540: 55 push %ebp
We need to find prune_dcache+0x147. Using shell,
# printf "0x%x\n" $((0x540+0x147))
0x687
and in dcache.disasm:
683: 85 c0 test %eax,%eax
685: 74 07 je 68e <prune_dcache+0x14e>
687: 8b 50 14 mov 0x14(%eax),%edx <======== OOPS
68a: 85 d2 test %edx,%edx
68c: 75 27 jne 6b5 <prune_dcache+0x175>
68e: 89 34 24 mov %esi,(%esp)
691: e8 fc ff ff ff call 692 <prune_dcache+0x152>
696: 8b 73 0c mov 0xc(%ebx),%esi
699: 89 1c 24 mov %ebx,(%esp)
69c: e8 9f f9 ff ff call 40 <d_free>
Comparing with "Code: 8b 50 14 85 d2 75 27 " - match!
We need to find matching line in dcache.s and, eventually, in dcache.c.
It's easy to find prune_dcache in dcache.s:
prune_dcache:
pushl %ebp
but even though it is not too hard to find matching instruction:
movl 8(%edi), %eax
decl 20(%edi)
testb $8, %al
jne .L593
.L517:
movl 68(%ebx), %eax
testl %eax, %eax
je .L532
movl 20(%eax), %edx <========= OOPS
testl %edx, %edx
jne .L594
.L532:
movl %esi, (%esp)
call iput
.L565:
movl 12(%ebx), %esi
movl %ebx, (%esp)
call d_free
it is unclear to which part of .c code it belongs:
static void prune_dcache(int count)
{
spin_lock(&dcache_lock);
for (; count ; count--) {
struct dentry *dentry;
struct list_head *tmp;
tmp = dentry_unused.prev;
if (tmp == &dentry_unused)
break;
list_del_init(tmp);
prefetch(dentry_unused.prev);
dentry_stat.nr_unused--;
dentry = list_entry(tmp, struct dentry, d_lru);
spin_lock(&dentry->d_lock);
/*
* We found an inuse dentry which was not removed from
* dentry_unused because of laziness during lookup. Do not free
* it - just keep it off the dentry_unused list.
*/
if (atomic_read(&dentry->d_count)) {
spin_unlock(&dentry->d_lock);
continue;
}
/* If the dentry was recently referenced, don't free it. */
if (dentry->d_flags & DCACHE_REFERENCED) {
dentry->d_flags &= ~DCACHE_REFERENCED;
list_add(&dentry->d_lru, &dentry_unused);
dentry_stat.nr_unused++;
spin_unlock(&dentry->d_lock);
continue;
}
prune_one_dentry(dentry);
}
spin_unlock(&dcache_lock);
}
What now?! Well, I have a silly method which helps to find
C code line corresponding to that asm one. Edit your
prune_dcache in dcache.c like this:
static void prune_dcache(int count)
{
spin_lock(&dcache_lock);
for (; count ; count--) {
struct dentry *dentry;
struct list_head *tmp;
asm("#1");
tmp = dentry_unused.prev;
asm("#2");
if (tmp == &dentry_unused)
break;
asm("#3");
list_del_init(tmp);
asm("#4");
prefetch(dentry_unused.prev);
asm("#5");
dentry_stat.nr_unused--;
asm("#6");
...
...
asm("#e");
prune_one_dentry(dentry);
}
asm("#f");
spin_unlock(&dcache_lock);
}
and do "make fs/dcache.s" again. Look into new dcache.s.
Nasty surprize:
APP
#e
#NO_APP
testb $16, %al
jne .L495
orl $16, %eax
leal 72(%ecx), %esi
movl %eax, 4(%ebx)
movl 4(%esi), %edx
movl 72(%ecx), %eax
testl %eax, %eax
movl %eax, (%edx)
je .L493
movl %edx, 4(%eax)
.L493:
movl $2097664, 4(%esi)
.L495:
leal 40(%ebx), %ecx
movl 40(%ebx), %eax
movl 4(%ecx), %edx
movl %edx, 4(%eax)
movl %eax, (%edx)
movl $2097664, 4(%ecx)
movl $1048832, 40(%ebx)
decl dentry_stat
movl 8(%ebx), %esi
testl %esi, %esi
je .L536
leal 56(%ebx), %eax
movl $0, 8(%ebx)
movl 56(%ebx), %edx
movl 4(%eax), %ecx
movl %ecx, 4(%edx)
movl %edx, (%ecx)
movl %eax, 4(%eax)
movl %eax, 56(%ebx)
movl 8(%edi), %eax
decl 20(%edi)
testb $8, %al
jne .L592
.L518:
movl 8(%edi), %eax
decl 20(%edi)
testb $8, %al
jne .L593
.L517:
movl 68(%ebx), %eax
testl %eax, %eax
je .L532
movl 20(%eax), %edx <======== OOPS
testl %edx, %edx
jne .L594
.L532:
movl %esi, (%esp)
call iput
How come one line of C code expanded in so much asm?!
Hmm... asm("#e") was directly before prune_one_dentry(dentry),
what's that?
static inline void prune_one_dentry(struct dentry * dentry)
{
struct dentry * parent;
__d_drop(dentry);
list_del(&dentry->d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
dentry_iput(dentry);
parent = dentry->d_parent;
d_free(dentry);
if (parent != dentry)
dput(parent);
spin_lock(&dcache_lock);
}
Argh! An inline function. Do asm trick to it too:
static inline void prune_one_dentry(struct dentry * dentry)
{
struct dentry * parent;
asm("#A");
__d_drop(dentry);
asm("#B");
list_del(&dentry->d_child);
asm("#C");
dentry_stat.nr_dentry--; /* For d_free, below */
asm("#D");
dentry_iput(dentry);
asm("#E");
...
...
}
"make fs/dcache.s", rinse, repeat. You will discover that OOPS
happened after #D mark, inside dentry_iput wich is an inline too.
Will this ever end? Lickily, yes. After yet another round of asm
insertion, we arrive at:
static inline void dentry_iput(struct dentry * dentry)
{
struct inode *inode = dentry->d_inode;
if (inode) {
asm("#K");
dentry->d_inode = NULL;
asm("#L");
list_del_init(&dentry->d_alias);
asm("#M");
spin_unlock(&dentry->d_lock);
asm("#N");
spin_unlock(&dcache_lock);
asm("#O");
if (dentry->d_op && dentry->d_op->d_iput)
{
asm("#P");
dentry->d_op->d_iput(dentry, inode);
}
else
...
Which corresponds to this part of new dcache.s:
.L517:
#APP
#O
#NO_APP
movl 68(%ebx), %eax
testl %eax, %eax
je .L532
movl 20(%eax), %edx <=== OOPS
testl %edx, %edx
jne .L594
.L532:
#APP
#Q
#NO_APP
This is "if (dentry->d_op && dentry->d_op->d_iput)" condition
check, and it is oopsing trying to do second check. dentry->d_op
contains bogus pointer value 0x00000e00.
--
vda
From: Muli Ben-Yehuda [email blocked]
Subject: Re: [RFC] HOWTO find oops location
Date: Sat, 14 Aug 2004 12:11:06 +0300
On Sat, Aug 14, 2004 at 11:53:06AM +0300, Denis Vlasenko wrote:
> Hi folks,
>
> Is this draft HOWTO useful? Comments?
Looks very nice. One small niggle:
> EIP is at prune_dcache+0x147/0x1c0
> ^^^^^^^^^^^^^^^^^^^^^^^^
> Let's try to find out where did that exactly happened.
> Grep in your kernel tree for prune_dcache. Aha, it is defined in
> fs/dcache.c! Ok, execute these two commands:
>
> # objdump -d fs/dcache.o > fs/dcache.disasm
> # make fs/cache.s
you mean 'make fs/dcache.s' here, I believe.
Cheers,
Muli
--
Muli Ben-Yehuda
http://www.mulix.org | http://mulix.livejournal.com/
From: Zwane Mwaikambo [email blocked]
Subject: Re: [RFC] HOWTO find oops location
Date: Sat, 14 Aug 2004 09:41:10 -0400 (EDT)
There are a few very simple methods i use all the time;
compile with CONFIG_DEBUG_INFO (it's safe to select the option and
recompile after the oops even) and then;
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
c046a188
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c046a188>] Not tainted VLI
EFLAGS: 00010246 (2.6.6-mm3)
EIP is at serial_open+0x38/0x170
[...]
(gdb) list *serial_open+0x38
0xc046a188 is in serial_open (drivers/usb/serial/usb-serial.c:465).
460
461 /* get the serial object associated with this tty pointer */
462 serial = usb_serial_get_by_index(tty->index);
463
464 /* set up our port structure making the tty driver remember our port object, and us it */
465 portNumber = tty->index - serial->minor;
466 port = serial->port[portNumber];
467 tty->driver_data = port;
468
469 port->tty = tty;
And then for cases where you deadlock and the NMI watchdog triggers with
%eip in a lock section;
NMI Watchdog detected LOCKUP on CPU0,
eip c0119e5e, registers:
Modules linked in:
CPU: 0
EIP: 0060:[<c0119e5e>] Tainted:
EFLAGS: 00000086 (2.6.7)
EIP is at .text.lock.sched+0x89/0x12b
[...]
(gdb) disassemble 0xc0119e5e
Dump of assembler code for function Letext:
[...]
0xc0119e59 <Letext+132>: repz nop
0xc0119e5b <Letext+134>: cmpb $0x0,(%edi)
0xc0119e5e <Letext+137>: jle 0xc0119e59 <Letext+132>
0xc0119e60 <Letext+139>: jmp 0xc0118183 <scheduler_tick+487>
(gdb) list *scheduler_tick+487
0xc0118183 is in scheduler_tick (include/asm/spinlock.h:124).
119 if (unlikely(lock->magic != SPINLOCK_MAGIC)) {
120 printk("eip: %p\n", &&here);
121 BUG();
122 }
123 #endif
124 __asm__ __volatile__(
125 spin_lock_string
126 :"=m" (lock->lock) : : "memory");
127 }
But that's not much help since it's pointing to an inline function and not
the real lock location, so just subtract a few bytes;
(gdb) list *scheduler_tick+450
0xc011815e is in scheduler_tick (kernel/sched.c:2021).
2016 cpustat->system += sys_ticks;
2017
2018 /* Task might have expired already, but not scheduled off yet */
2019 if (p->array != rq->active) {
2020 set_tsk_need_resched(p);
2021 goto out;
2022 }
2023 spin_lock(&rq->lock);
So we have our lock location. Then there are cases where there is a "Bad
EIP" most common ones are when a bad function pointer is followed or if
some of the kernel text or a module got unloaded/unmapped (e.g. via
__init). You can normally determine which is which by noting that bad eip
for unloaded text normally looks like a valid virtual address.
Unable to handle kernel NULL pointer dereference at virtual address 00000000
00000000
*pde = 00000000
Oops: 0000 [#1]
CPU: 0
EIP: 0060:[<00000000>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210246
[...]
Call Trace:
[<c01dbbfb>] smb_readdir+0x4fb/0x6e0
[<c0165560>] filldir64+0x0/0x130
[<c016524a>] vfs_readdir+0x8a/0x90
[<c0165560>] filldir64+0x0/0x130
[<c01656fd>] sys_getdents64+0x6d/0xa6
[<c0165560>] filldir64+0x0/0x130
[<c010adff>] syscall_call+0x7/0xb
Code: Bad EIP value.
>From there you're best off examining the call trace to find the culprit.
From: Marcelo Tosatti [email blocked]
Subject: Re: [RFC] HOWTO find oops location
Date: Sat, 14 Aug 2004 11:06:42 -0300
> What now?! Well, I have a silly method which helps to find
> C code line corresponding to that asm one. Edit your
> prune_dcache in dcache.c like this:
>
> static void prune_dcache(int count)
> {
> spin_lock(&dcache_lock);
> for (; count ; count--) {
> struct dentry *dentry;
> struct list_head *tmp;
> asm("#1");
> tmp = dentry_unused.prev;
> asm("#2");
> if (tmp == &dentry_unused)
> break;
> asm("#3");
> list_del_init(tmp);
> asm("#4");
> prefetch(dentry_unused.prev);
> asm("#5");
> dentry_stat.nr_unused--;
> asm("#6");
> ...
> ...
> asm("#e");
> prune_one_dentry(dentry);
> }
> asm("#f");
> spin_unlock(&dcache_lock);
> }
Might be also worth mentioning "gcc -c file.c -g -Wa,-a,-ad > file.s"
which makes gcc output C code mixed with asm output.
Sometimes its not as effective as the comment method you describe,
but it will be less work for sure :)
The document looks great, but could go deeper into things
like like hardware-flaky bitflips, stack junk (explain why
the stack can be "unreliable"), etc. to be even more
useful.
Hosting it somewhere would be nice also.

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(66)

個人分類：Linux kernel

▲top

8月 20 週五 201011:25
sk_buff structure

http://blog.chinaunix.net/u3/115276/showart_2284947.html
一、sk_buff的结构图如下

二.sk_buff结构基本操作

(繼續閱讀...)

horace papa 發表在痞客邦留言(1) 人氣(1,243)

個人分類：Linux kernel

▲top

8月 04 週三 201015:01
how the skb be indicated to upper layer

http://linux.chinaunix.net/bbs/viewthread.php?tid=886985&extra=

(繼續閱讀...)

horace papa 發表在痞客邦留言(0) 人氣(12)

個人分類：Linux kernel

▲top

12 »

Horace papa's life

To memory my life, learning , family