目录
1 描述
2 结构体
2.1 block_device_operations
2.2 gendisk
2.3 block_device
2.4 request_queue
2.5 request
2.6 bio
3.7 blk_mq_tag_set
3.8 blk_mq_ops
3 相关函数
3.1 注册注销块设备
3.1.1 register_blkdev
3.1.2 unregister_blkdev
3.2 gendisk 结构体操作
3.2.1 alloc_disk
3.2.2 set_capacity
3.2.3 add_disk
3.2.4 del_gendisk
3.3 块设备 I/O 请求
3.3.1 blk_mq_alloc_tag_set
3.3.2 blk_mq_free_tag_set
3.3.3 blk_mq_init_queue
3.3.4 blk_cleanup_queue
3.4 请求 request
3.4.1 blk_mq_start_request
3.4.1 blk_mq_end_request
3.5 数据操作
3.5.1 bio_data
3.5.2 rq_data_dir
4 实验
4.1 代码
4.2 操作
1 描述
块设备是针对存储设备的,比如 SD 卡、EMMC、NAND Flash、Nor Flash、SPI Flash、机械硬盘、固态硬盘等。因此块设备驱动其实就是这些存储设备驱动,块设备驱动相比字符设备驱动的主要区别如下:
①块设备只能以块为单位进行读写访问,块是 linux 虚拟文件系统(VFS)基本的数据传输单位。字符设备是以字节为单位进行数据传输的,不需要缓冲。
②块设备在结构上是可以进行随机访问的,对于这些设备的读写都是按块进行的,块设备使用缓冲区来暂时存放数据,等到条件成熟以后再一次性将缓冲区中的数据写入块设备中。这么做的目的为了提高块设备寿命,大家如果仔细观察的话就会发现有些硬盘或者 NAND Flash 就会标明擦除次数(flash 的特性,写之前要先擦除),比如擦除 100000 次等。因此,为了提高块设备寿命而引入了缓冲区,数据先写入到缓冲区中,等满足一定条件后再一次性写入到真正的物理存储设备中,这样就减少了对块设备的擦除次数,提高了块设备寿命。
字符设备是顺序的数据流设备,字符设备是按照字节进行读写访问的。字符设备不需要缓冲区,对于字符设备的访问都是实时的,而且也不需要按照固定的块大小进行访问。
块设备结构的不同其 I/O 算法也会不同,比如对于 EMMC、SD 卡、NAND Flash 这类没有任何机械设备的存储设备就可以任意读写任何的扇区(块设备物理存储单元)。但是对于机械硬盘这样带有磁头的设备,读取不同的盘面或者磁道里面的数据,磁头都需要进行移动,因此对于机械硬盘而言,将那些杂乱的访问按照一定的顺序进行排列可以有效提高磁盘性能,linux 里面针对不同的存储设备实现了不同的 I/O 调度算法。
Linux MMC/SD存储卡是一种典型的块设备,它的实现位于drivers/mmc。drivers/mmc下又分为card、core和host这3个子目录。card实际上跟Linux的块设备子系统对接,实现块设备驱动以及完成请求,但是具体的协议经过core层的接口,最终通过host完成传输,因此整个MMC子系统的框架结构如图所示。另外,card目录除了实现标准的MMC/SD存储卡以外,该目录还包含一些SDIO外设的卡驱动,如drivers/mmc/card/sdio_uart.c。core目录除了给card提供接口外,实际上也定义好了host驱动的框架。
2 结构体
2.1 block_device_operations
struct block_device_operations 定义了一组与块设备操作相关的函数指针。这些函数实现了设备的打开、关闭、读写、I/O控制等功能,使得不同的块设备能够通过统一的接口与内核进行交互。这个结构体是块设备驱动程序的核心部分,为块设备的管理提供了灵活性和扩展性。
1983 struct block_device_operations {
1984 int (*open) (struct block_device *, fmode_t);
1985 void (*release) (struct gendisk *, fmode_t);
1986 int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
1987 int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
1988 int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
1989 unsigned int (*check_events) (struct gendisk *disk,
1990 unsigned int clearing);
1991 /* ->media_changed() is DEPRECATED, use ->check_events() instead */
1992 int (*media_changed) (struct gendisk *);
1993 void (*unlock_native_capacity) (struct gendisk *);
1994 int (*revalidate_disk) (struct gendisk *);
1995 int (*getgeo)(struct block_device *, struct hd_geometry *);
1996 /* this callback is with swap_lock and sometimes page table lock held */
1997 void (*swap_slot_free_notify) (struct block_device *, unsigned long);
1998 struct module *owner;
1999 const struct pr_ops *pr_ops;
2000 };
int (*open) (struct block_device *, fmode_t);
用于打开块设备的回调函数。它接收一个指向块设备的指针和打开模式,返回一个整数,通常表示成功与否。
void (*release) (struct gendisk *, fmode_t);
用于释放块设备的回调函数。当设备不再使用时调用,清理相关资源。
int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
用于读写页面的回调函数,支持页面级别的I/O操作。
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
用于处理设备特定的控制请求。允许用户空间程序与设备进行交互。
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
处理兼容性I/O控制请求,通常用于旧版本的系统调用。
unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing);
用于检查设备事件,比如介质是否发生变化,取代了 media_changed。
int (*media_changed) (struct gendisk *);
过时的函数,用于检测介质是否改变,现已被 check_events 取代。
void (*unlock_native_capacity) (struct gendisk *);
解锁设备的原生容量,允许设备在容量发生变化时更新。
int (*revalidate_disk) (struct gendisk *);
重新验证块设备的状态,通常在设备状态可能发生变化时调用。
int (*getgeo)(struct block_device *, struct hd_geometry *);
用于获取设备几何信息,例如扇区数、每个扇区的大小等。
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
当交换槽被释放时的通知函数,通常用于处理内存管理。
struct module *owner;
指向拥有该设备操作的模块的指针,用于管理模块引用计数。
const struct pr_ops *pr_ops;
指向打印操作的结构体,通常用于设备的日志和调试输出。
2.2 gendisk
在Linux内核中,使用gendisk(通用磁盘)结构体来表示一个独立的磁盘设备(或分区)
183 struct gendisk {
184 /* major, first_minor and minors are input parameters only,
185 * don't use directly. Use disk_devt() and disk_max_parts().
186 */
187 int major; /* major number of driver */
188 int first_minor;
189 int minors; /* maximum number of minors, =1 for
190 * disks that can't be partitioned. */
191
192 char disk_name[DISK_NAME_LEN]; /* name of major driver */
193 char *(*devnode)(struct gendisk *gd, umode_t *mode);
194
195 unsigned int events; /* supported events */
196 unsigned int async_events; /* async events, subset of all */
197
198 /* Array of pointers to partitions indexed by partno.
199 * Protected with matching bdev lock but stat and other
200 * non-critical accesses use RCU. Always access through
201 * helpers.
202 */
203 struct disk_part_tbl __rcu *part_tbl;
204 struct hd_struct part0;
205
206 const struct block_device_operations *fops;
207 struct request_queue *queue;
208 void *private_data;
209
210 int flags;
211 struct rw_semaphore lookup_sem;
212 struct kobject *slave_dir;
213
214 struct timer_rand_state *random;
215 atomic_t sync_io; /* RAID */
216 struct disk_events *ev;
217 #ifdef CONFIG_BLK_DEV_INTEGRITY
218 struct kobject integrity_kobj;
219 #endif /* CONFIG_BLK_DEV_INTEGRITY */
220 int node_id;
221 struct badblocks *bb;
222 struct lockdep_map lockdep_map;
223
224 ANDROID_KABI_RESERVE(1);
225 ANDROID_KABI_RESERVE(2);
226 ANDROID_KABI_RESERVE(3);
227 ANDROID_KABI_RESERVE(4);
228
229 };
major: 驱动程序的主设备号。
first_minor: 起始次设备号。
minors: 最大次设备数量。如果设备不可分区,通常为 1。
disk_name: 磁盘的名称(如 "sda")。
devnode: 指向生成设备节点名称的函数指针。
events: 设备支持的事件位掩码。
async_events: 异步事件的子集。
part_tbl: 指向分区表的指针,受 bdev 锁保护。
part0: 设备的第一个分区。
fops: 指向设备操作函数的结构体(如读写操作)。
queue: 请求队列,用于处理 I/O 请求。
private_data: 设备私有数据指针。
flags: 标志位,用于设备状态。
lookup_sem: 读写信号量,用于同步。
slave_dir: 指向与设备相关的 kobject。
random: 随机数状态(与设备随机性相关)。
sync_io: 原子计数器,用于 RAID 操作。
ev: 指向设备事件结构体的指针。
integrity_kobj: (可选)用于设备完整性的 kobject。
node_id: 设备节点 ID。
bb: 指向坏块列表的指针。
lockdep_map: 用于锁依赖检查的结构体。
2.3 block_device
内核使用 block_device 来表示一个具体的块设备对象,比如一个硬盘或者分区
454 struct block_device {455 dev_t bd_dev; /* not a kdev_t - it's a search key */456 int bd_openers;457 struct inode * bd_inode; /* will die */458 struct super_block * bd_super;459 struct mutex bd_mutex; /* open/close mutex */460 void * bd_claiming;461 void * bd_holder;462 int bd_holders;463 bool bd_write_holder;464 #ifdef CONFIG_SYSFS465 struct list_head bd_holder_disks;466 #endif 467 struct block_device * bd_contains;468 unsigned bd_block_size;469 u8 bd_partno;470 struct hd_struct * bd_part;471 /* number of times partitions within this device have been opened. */472 unsigned bd_part_count;473 int bd_invalidated;474 struct gendisk * bd_disk;475 struct request_queue * bd_queue;476 struct backing_dev_info *bd_bdi;477 struct list_head bd_list;478 /* 479 * Private data. You must have bd_claim'ed the block_device480 * to use this. NOTE: bd_claim allows an owner to claim481 * the same device multiple times, the owner must take special482 * care to not mess up bd_private for that case.483 */484 unsigned long bd_private;485 486 /* The counter of freeze processes */487 int bd_fsfreeze_count;488 /* Mutex for freeze */489 struct mutex bd_fsfreeze_mutex;490 491 ANDROID_KABI_RESERVE(1);492 ANDROID_KABI_RESERVE(2);493 ANDROID_KABI_RESERVE(3);494 ANDROID_KABI_RESERVE(4);495 } __randomize_layout;
2.4 request_queue
内核将对块设备的读写都发送到请求队列 request_queue 中,request_queue 中是大量的request(请求结构体),而 request 又包含了 bio,bio 保存了读写相关数据
434 struct request_queue {435 /*436 * Together with queue_head for cacheline sharing437 */438 struct list_head queue_head;439 struct request *last_merge;440 struct elevator_queue *elevator;441 int nr_rqs[2]; /* # allocated [a]sync rqs */442 int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */443 444 struct blk_queue_stats *stats;445 struct rq_qos *rq_qos;446 447 /*448 * If blkcg is not used, @q->root_rl serves all requests. If blkcg449 * is used, root blkg allocates from @q->root_rl and all other450 * blkgs from their own blkg->rl. Which one to use should be451 * determined using bio_request_list().452 */453 struct request_list root_rl;454 455 request_fn_proc *request_fn;456 make_request_fn *make_request_fn;457 poll_q_fn *poll_fn;458 prep_rq_fn *prep_rq_fn;459 unprep_rq_fn *unprep_rq_fn;460 softirq_done_fn *softirq_done_fn;461 rq_timed_out_fn *rq_timed_out_fn;462 dma_drain_needed_fn *dma_drain_needed;463 lld_busy_fn *lld_busy_fn;464 /* Called just after a request is allocated */465 init_rq_fn *init_rq_fn;466 /* Called just before a request is freed */467 exit_rq_fn *exit_rq_fn;468 /* Called from inside blk_get_request() */469 void (*initialize_rq_fn)(struct request *rq);470 471 const struct blk_mq_ops *mq_ops;472 473 unsigned int *mq_map;474 475 /* sw queues */476 struct blk_mq_ctx __percpu *queue_ctx;477 unsigned int nr_queues;478 479 unsigned int queue_depth;480 481 /* hw dispatch queues */482 struct blk_mq_hw_ctx **queue_hw_ctx;483 unsigned int nr_hw_queues;484 485 /*486 * Dispatch queue sorting487 */488 sector_t end_sector;489 struct request *boundary_rq;490 491 /*492 * Delayed queue handling493 */494 struct delayed_work delay_work;495 496 struct backing_dev_info *backing_dev_info;497 498 /*499 * The queue owner gets to use this for whatever they like.500 * ll_rw_blk doesn't touch it.501 */502 void *queuedata;503 504 /*505 * various queue flags, see QUEUE_* below506 */507 unsigned long queue_flags;508 /*509 * Number of contexts that have called blk_set_pm_only(). If this510 * counter is above zero then only RQF_PM and RQF_PREEMPT requests are511 * processed.512 */513 atomic_t pm_only;514 515 /*516 * ida allocated id for this queue. Used to index queues from517 * ioctx.518 */519 int id;520 521 /*522 * queue needs bounce pages for pages above this limit523 */524 gfp_t bounce_gfp;525 526 /*527 * protects queue structures from reentrancy. ->__queue_lock should528 * _never_ be used directly, it is queue private. always use529 * ->queue_lock.530 */531 spinlock_t __queue_lock;532 spinlock_t *queue_lock;533 534 /*535 * queue kobject536 */537 struct kobject kobj;538 539 /*540 * mq queue kobject541 */542 struct kobject mq_kobj;
543 544 #ifdef CONFIG_BLK_DEV_INTEGRITY545 struct blk_integrity integrity;546 #endif /* CONFIG_BLK_DEV_INTEGRITY */547 548 #ifdef CONFIG_PM549 struct device *dev;550 int rpm_status;551 unsigned int nr_pending;552 #endif553 554 /*555 * queue settings556 */557 unsigned long nr_requests; /* Max # of requests */558 unsigned int nr_congestion_on;559 unsigned int nr_congestion_off;560 unsigned int nr_batching;561 562 unsigned int dma_drain_size;563 void *dma_drain_buffer;564 unsigned int dma_pad_mask;565 unsigned int dma_alignment;566 567 struct blk_queue_tag *queue_tags;568 569 unsigned int nr_sorted;570 unsigned int in_flight[2];571 572 /*573 * Number of active block driver functions for which blk_drain_queue()574 * must wait. Must be incremented around functions that unlock the575 * queue_lock internally, e.g. scsi_request_fn().576 */577 unsigned int request_fn_active;578 #ifdef CONFIG_BLK_INLINE_ENCRYPTION579 /* Inline crypto capabilities */580 struct keyslot_manager *ksm;581 #endif582 583 unsigned int rq_timeout;584 int poll_nsec;585 586 struct blk_stat_callback *poll_cb;587 struct blk_rq_stat poll_stat[BLK_MQ_POLL_STATS_BKTS];588 589 struct timer_list timeout;590 struct work_struct timeout_work;591 struct list_head timeout_list;592 593 struct list_head icq_list;594 #ifdef CONFIG_BLK_CGROUP595 DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);596 struct blkcg_gq *root_blkg;
597 struct list_head blkg_list;598 #endif599 600 struct queue_limits limits;601 602 #ifdef CONFIG_BLK_DEV_ZONED603 /*604 * Zoned block device information for request dispatch control.605 * nr_zones is the total number of zones of the device. This is always606 * 0 for regular block devices. seq_zones_bitmap is a bitmap of nr_zones607 * bits which indicates if a zone is conventional (bit clear) or608 * sequential (bit set). seq_zones_wlock is a bitmap of nr_zones609 * bits which indicates if a zone is write locked, that is, if a write610 * request targeting the zone was dispatched. All three fields are611 * initialized by the low level device driver (e.g. scsi/sd.c).612 * Stacking drivers (device mappers) may or may not initialize613 * these fields.614 *615 * Reads of this information must be protected with blk_queue_enter() /616 * blk_queue_exit(). Modifying this information is only allowed while617 * no requests are being processed. See also blk_mq_freeze_queue() and618 * blk_mq_unfreeze_queue().619 */620 unsigned int nr_zones;621 unsigned long *seq_zones_bitmap;622 unsigned long *seq_zones_wlock;623 #endif /* CONFIG_BLK_DEV_ZONED */624 625 /*626 * sg stuff627 */628 unsigned int sg_timeout;629 unsigned int sg_reserved_size;630 int node;631 #ifdef CONFIG_BLK_DEV_IO_TRACE632 struct blk_trace __rcu *blk_trace;633 struct mutex blk_trace_mutex;634 #endif635 /*636 * for flush operations637 */638 struct blk_flush_queue *fq;639 640 struct list_head requeue_list;641 spinlock_t requeue_lock;642 struct delayed_work requeue_work;643 644 struct mutex sysfs_lock;645 646 int bypass_depth;647 atomic_t mq_freeze_depth;648 649 bsg_job_fn *bsg_job_fn;650 struct bsg_class_device bsg_dev;
651 652 #ifdef CONFIG_BLK_DEV_THROTTLING653 /* Throttle data */654 struct throtl_data *td;655 #endif656 struct rcu_head rcu_head;657 wait_queue_head_t mq_freeze_wq;658 struct percpu_ref q_usage_counter;659 struct list_head all_q_node;660 661 struct blk_mq_tag_set *tag_set;662 struct list_head tag_set_list;663 struct bio_set bio_split;664 665 #ifdef CONFIG_BLK_DEBUG_FS666 struct dentry *debugfs_dir;667 struct dentry *sched_debugfs_dir;668 #endif669 670 bool mq_sysfs_init_done;671 672 size_t cmd_size;673 void *rq_alloc_data;674 675 struct work_struct release_work;676 677 #define BLK_MAX_WRITE_HINTS 5678 u64 write_hints[BLK_MAX_WRITE_HINTS];679 };
2.5 request
151 struct request {152 struct request_queue *q;153 struct blk_mq_ctx *mq_ctx;154 155 int cpu;156 unsigned int cmd_flags; /* op and common flags */157 req_flags_t rq_flags;158 159 int internal_tag;160 161 /* the following two fields are internal, NEVER access directly */162 unsigned int __data_len; /* total data len */163 int tag;164 sector_t __sector; /* sector cursor */165 166 struct bio *bio;167 struct bio *biotail;168 169 struct list_head queuelist;170 171 /*172 * The hash is used inside the scheduler, and killed once the173 * request reaches the dispatch list. The ipi_list is only used174 * to queue the request for softirq completion, which is long175 * after the request has been unhashed (and even removed from176 * the dispatch list).177 */178 union {179 struct hlist_node hash; /* merge hash */180 struct list_head ipi_list;181 };182 183 /*184 * The rb_node is only used inside the io scheduler, requests185 * are pruned when moved to the dispatch queue. So let the186 * completion_data share space with the rb_node.187 */188 union {189 struct rb_node rb_node; /* sort/lookup */190 struct bio_vec special_vec;191 void *completion_data;192 int error_count; /* for legacy drivers, don't use */193 };194 195 /*196 * Three pointers are available for the IO schedulers, if they need197 * more they have to dynamically allocate it. Flush requests are198 * never put on the IO scheduler. So let the flush fields share199 * space with the elevator data.200 */201 union {202 struct {203 struct io_cq *icq;
204 void *priv[2];205 } elv;206 207 struct {208 unsigned int seq;209 struct list_head list;210 rq_end_io_fn *saved_end_io;211 } flush;212 };213 214 struct gendisk *rq_disk;215 struct hd_struct *part;216 /* Time that I/O was submitted to the kernel. */217 u64 start_time_ns;218 /* Time that I/O was submitted to the device. */219 u64 io_start_time_ns;220 221 #ifdef CONFIG_BLK_WBT222 unsigned short wbt_flags;223 #endif224 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW225 unsigned short throtl_size;226 #endif227 228 /*229 * Number of scatter-gather DMA addr+len pairs after230 * physical address coalescing is performed.231 */232 unsigned short nr_phys_segments;233 234 #if defined(CONFIG_BLK_DEV_INTEGRITY)235 unsigned short nr_integrity_segments;236 #endif237 238 unsigned short write_hint;239 unsigned short ioprio;240 241 void *special; /* opaque pointer available for LLD use */242 243 unsigned int extra_len; /* length of alignment and padding */244 245 enum mq_rq_state state;246 refcount_t ref;247 248 unsigned int timeout;249 250 /* access through blk_rq_set_deadline, blk_rq_deadline */251 unsigned long __deadline;252 253 struct list_head timeout_list;254 255 union {256 struct __call_single_data csd;257 u64 fifo_time;258 };259 260 /*261 * completion callback.262 */263 rq_end_io_fn *end_io;264 void *end_io_data;265 266 /* for bidi */267 struct request *next_rq;268 269 #ifdef CONFIG_BLK_CGROUP270 struct request_list *rl; /* rl this rq is alloced from */271 #endif272 };
2.6 bio
每个 request 里面里面会有多个 bio,bio 保存着最终要读写的数据、地址等信息。上层应用程序对于块设备的读写会被构造成一个或多个 bio 结构,bio 结构描述了要读写的起始扇区、要读写的扇区数量、是读取还是写入、页便宜、数据长度等等信息。上层会将 bio 提交给 I/O 调度器,I/O 调度器会将这些 bio 构造成 request 结构,request_queue 里面顺序存放着一系列的 request。新产生的 bio 可能被合并到 request_queue 里现有的 request 中,也可能产生新的 request,然后插入到 request_queue 中合适的位置,这一切都是由 I/O 调度器来完成的。
request_queue、request 和 bio 之间的关系如图
146 struct bio {
147 struct bio *bi_next; /* request queue link */
148 struct gendisk *bi_disk;
149 unsigned int bi_opf; /* bottom bits req flags,
150 * top bits REQ_OP. Use
151 * accessors.
152 */
153 unsigned short bi_flags; /* status, etc and bvec pool number */
154 unsigned short bi_ioprio;
155 unsigned short bi_write_hint;
156 blk_status_t bi_status;
157 u8 bi_partno;
158
159 /* Number of segments in this BIO after
160 * physical address coalescing is performed.
161 */
162 unsigned int bi_phys_segments;
163
164 /*
165 * To keep track of the max segment size, we account for the
166 * sizes of the first and last mergeable segments in this bio.
167 */
168 unsigned int bi_seg_front_size;
169 unsigned int bi_seg_back_size;
170
171 struct bvec_iter bi_iter;
172
173 atomic_t __bi_remaining;
174 bio_end_io_t *bi_end_io;
175
176 void *bi_private;
177 #ifdef CONFIG_BLK_CGROUP
178 /*
179 * Optional ioc and css associated with this bio. Put on bio
180 * release. Read comment on top of bio_associate_current().
181 */
182 struct io_context *bi_ioc;
183 struct cgroup_subsys_state *bi_css;
184 struct blkcg_gq *bi_blkg;
185 struct bio_issue bi_issue;
186 #endif
187
188 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
189 struct bio_crypt_ctx *bi_crypt_context;
190 #if IS_ENABLED(CONFIG_DM_DEFAULT_KEY)
191 bool bi_skip_dm_default_key;
192 #endif
193 #endif
194
195 union {
196 #if defined(CONFIG_BLK_DEV_INTEGRITY)
197 struct bio_integrity_payload *bi_integrity; /* data integrity */
198 #endif
199 };
200
201 unsigned short bi_vcnt; /* how many bio_vec's */
202
203 /*
204 * Everything starting with bi_max_vecs will be preserved by bio_reset()
205 */
206
207 unsigned short bi_max_vecs; /* max bvl_vecs we can hold */
208
209 atomic_t __bi_cnt; /* pin count */
210
211 struct bio_vec *bi_io_vec; /* the actual vec list */
212
213 struct bio_set *bi_pool;
214
215 ktime_t bi_alloc_ts; /* for mm_event */
216
217 ANDROID_KABI_RESERVE(1);
218 ANDROID_KABI_RESERVE(2);
219
220 /*
221 * We can inline a number of vecs at the end of the bio, to avoid
222 * double allocations for a small number of bio_vecs. This member
223 * MUST obviously be kept at the very end of the bio.
224 */
225 struc
226 };
bvec_iter 结构体描述了要操作的设备扇区等信息
36 struct bvec_iter {37 sector_t bi_sector; /* device address in 512 byte38 sectors */39 unsigned int bi_size; /* residual I/O count */40 41 unsigned int bi_idx; /* current index into bvl_vec */42 43 unsigned int bi_done; /* number of bytes completed */44 45 unsigned int bi_bvec_done; /* number of bytes completed in46 current bvec */47 };
bio_vec结构体用来描述与这个bio请求对应的所有的内存,它可能不总是在一个页面里面,因此需要一个向量,向量中的每个元素实际是一个[page,offset,len],我们一般也称它为一个片段
30 struct bio_vec { 31 struct page *bv_page;32 unsigned int bv_len;33 unsigned int bv_offset;34 };
bio、bvec_iter 以及 bio_vec 这三个结构体之间的关系如图
3.7 blk_mq_tag_set
该结构体是块设备多队列支持的核心,提供了管理请求标签的必要信息和功能,使得设备能够高效地处理并发 I/O 操作
77 struct blk_mq_tag_set {78 unsigned int *mq_map;79 const struct blk_mq_ops *ops;80 unsigned int nr_hw_queues; 81 unsigned int queue_depth; /* max hw supported */82 unsigned int reserved_tags;83 unsigned int cmd_size; /* per-request extra data */84 int numa_node;85 unsigned int timeout;86 unsigned int flags; /* BLK_MQ_F_* */87 void *driver_data;88 89 struct blk_mq_tags **tags;90 91 struct mutex tag_list_lock;92 struct list_head tag_list;93 };
unsigned int *mq_map: 指向用于映射请求到队列的数组,帮助管理请求分发。
const struct blk_mq_ops *ops: 指向操作集的指针,定义了与标签相关的操作,如请求分配和释放。
unsigned int nr_hw_queues: 表示硬件支持的队列数量,用于配置和优化并发操作。
unsigned int queue_depth: 硬件支持的最大队列深度,定义了每个队列可以同时处理的请求数。
unsigned int reserved_tags: 保留的标签数量,用于特殊用途或特定条件下的请求处理。
unsigned int cmd_size: 每个请求额外数据的大小,用于存储与请求相关的附加信息。
int numa_node: NUMA节点信息,帮助优化内存访问以提高性能。
unsigned int timeout: 请求超时时间,指定请求等待响应的最长时间。
unsigned int flags: 标志位,使用 BLK_MQ_F_* 宏定义,以配置标签集合的行为。
void *driver_data: 指向驱动程序特定数据的指针,可以存储额外的上下文信息。
struct blk_mq_tags **tags: 指向存储标签信息的数组,管理多个请求标签。
struct mutex tag_list_lock: 互斥锁,用于保护标签列表的并发访问,确保线程安全。
struct list_head tag_list: 链表头,用于维护标签列表,便于管理和遍历。
3.8 blk_mq_ops
struct blk_mq_ops 是一个用于定义块设备(block device)驱动程序的操作集合的结构体,这些操作涉及到请求队列的管理和处理。
120 struct blk_mq_ops {
121 /*
122 * Queue request
123 */
124 queue_rq_fn *queue_rq;
125
126 /*
127 * Reserve budget before queue request, once .queue_rq is
128 * run, it is driver's responsibility to release the
129 * reserved budget. Also we have to handle failure case
130 * of .get_budget for avoiding I/O deadlock.
131 */
132 get_budget_fn *get_budget;
133 put_budget_fn *put_budget;
134
135 /*
136 * Called on request timeout
137 */
138 timeout_fn *timeout;
139
140 /*
141 * Called to poll for completion of a specific tag.
142 */
143 poll_fn *poll;
144
145 softirq_done_fn *complete;
146
147 /*
148 * Called when the block layer side of a hardware queue has been
149 * set up, allowing the driver to allocate/init matching structures.
150 * Ditto for exit/teardown.
151 */
152 init_hctx_fn *init_hctx;
153 exit_hctx_fn *exit_hctx;
154
155 /*
156 * Called for every command allocated by the block layer to allow
157 * the driver to set up driver specific data.
158 *
159 * Tag greater than or equal to queue_depth is for setting up
160 * flush request.
161 *
162 * Ditto for exit/teardown.
163 */
164 init_request_fn *init_request;
165 exit_request_fn *exit_request;
166 /* Called from inside blk_get_request() */
167 void (*initialize_rq_fn)(struct request *rq);
168
169 /*
170 * Called before freeing one request which isn't completed yet,
171 * and usually for freeing the driver private data
172 */
173 cleanup_rq_fn *cleanup_rq;
174
175 map_queues_fn *map_queues;
176
177 #ifdef CONFIG_BLK_DEBUG_FS
178 /*
179 * Used by the debugfs implementation to show driver-specific
180 * information about a request.
181 */
182 void (*show_rq)(struct seq_file *m, struct request *rq);
183 #endif
184 };
queue_rq_fn * queue_rq:
描述: 指向一个函数,该函数负责将请求(request)排入队列。它是块设备处理 I/O 请求的核心函数。
get_budget_fn * get_budget:
描述: 在提交请求之前,用于获取预算(budget)。它允许驱动在处理请求之前预留资源。
put_budget_fn * put_budget:
描述: 用于释放先前保留的预算,确保资源被正确管理。
timeout_fn * timeout:
描述: 当请求超时时调用的函数,通常用于处理超时情况下的清理或错误处理。
poll_fn * poll:
描述: 用于轮询特定标签的请求完成情况的函数。
softirq_done_fn * complete:
描述: 当请求处理完成时调用的函数,用于通知系统。
init_hctx_fn * init_hctx:
描述: 在硬件上下文(hardware context)设置完成后调用的函数,驱动可以在此分配和初始化相关的数据结构。
exit_hctx_fn * exit_hctx:
描述: 当硬件上下文被销毁时调用的函数,负责清理和释放资源。
init_request_fn * init_request:
描述: 每次分配请求时调用的函数,允许驱动设置特定的数据。
exit_request_fn * exit_request:
描述: 在请求处理完成后调用的函数,负责清理和释放与请求相关的资源。
void (*initialize_rq_fn)(struct request *rq):
描述: 在获取请求(request)时调用的函数,通常用于初始化请求的状态或数据。
cleanup_rq_fn * cleanup_rq:
描述: 在请求未完成时被调用,通常用于释放驱动专用的数据。
map_queues_fn * map_queues:
描述: 用于映射队列的函数,可能用于复杂的多队列场景。
3 相关函数
3.1 注册注销块设备
3.1.1 register_blkdev
函数原型 | int register_blkdev(unsigned int major, const char *name) | |
参数 | unsigned int major | 指定主设备号。主设备号是用于区分不同类型设备的标识符。 |
const char *name | 设备的名称,通常是在 /dev 目录下显示的名称。 | |
返回值 | int | 成功:主设备号 失败:负数 |
功能 | 将一个块设备注册到内核,使其能够被用户空间访问。块设备是支持随机访问的数据存储设备,如硬盘、闪存等。注册后,内核会将设备名称与主设备号关联,这样用户空间程序就可以通过设备名称访问这个块设备 |
343 int register_blkdev(unsigned int major, const char *name)344 {345 struct blk_major_name **n, *p;346 int index, ret = 0;347 348 mutex_lock(&block_class_lock);349 350 /* temporary */351 if (major == 0) {352 for (index = ARRAY_SIZE(major_names)-1; index > 0; index--) {353 if (major_names[index] == NULL)354 break;355 }356 357 if (index == 0) {358 printk("register_blkdev: failed to get major for %s\n",359 name);360 ret = -EBUSY;361 goto out;362 }363 major = index;364 ret = major;365 }366 367 if (major >= BLKDEV_MAJOR_MAX) {368 pr_err("register_blkdev: major requested (%u) is greater than the maximum (%u) for %s\n",369 major, BLKDEV_MAJOR_MAX-1, name);370 371 ret = -EINVAL;372 goto out;373 }374 375 p = kmalloc(sizeof(struct blk_major_name), GFP_KERNEL);376 if (p == NULL) {377 ret = -ENOMEM;378 goto out;379 }380 381 p->major = major;382 strlcpy(p->name, name, sizeof(p->name));383 p->next = NULL;384 index = major_to_index(major);385 386 for (n = &major_names[index]; *n; n = &(*n)->next) {387 if ((*n)->major == major)388 break;389 }390 if (!*n)391 *n = p;392 else393 ret = -EBUSY;394 395 if (ret < 0) {396 printk("register_blkdev: cannot get major %u for %s\n",397 major, name);398 kfree(p);399 }400 out:401 mutex_unlock(&block_class_lock);402 return ret;403 }
3.1.2 unregister_blkdev
函数原型 | void unregister_blkdev(unsigned int major, const char *name) | |
参数 | unsigned int major | 指定主设备号。主设备号是用于区分不同类型设备的标识符。 |
const char *name | 设备的名称,通常是在 /dev 目录下显示的名称。 | |
返回值 |
|
|
功能 | 从内核中注销先前注册的块设备。它通过主设备号和设备名称找到对应的设备,并将其从设备列表中移除。 |
407 void unregister_blkdev(unsigned int major, const char *name)408 {409 struct blk_major_name **n;410 struct blk_major_name *p = NULL;411 int index = major_to_index(major);412 413 mutex_lock(&block_class_lock);414 for (n = &major_names[index]; *n; n = &(*n)->next)415 if ((*n)->major == major)416 break;417 if (!*n || strcmp((*n)->name, name)) {418 WARN_ON(1);419 } else {420 p = *n;421 *n = p->next;422 }423 mutex_unlock(&block_class_lock);424 kfree(p);425 }
3.2 gendisk 结构体操作
3.2.1 alloc_disk
函数原型 | #define alloc_disk(minors) alloc_disk_node(minors, NUMA_NO_NODE) | |
参数 | int minors | 次设备号数量,也就是 gendisk 对应的分区数量 |
返回值 | struct gendisk * | 成功:返回申请到的 gendisk,失败:NULL |
功能 | 用于分配一个新的块设备(gendisk)结构体 |
#define alloc_disk(minors) alloc_disk_node(minors, NUMA_NO_NODE)655 #define alloc_disk_node(minors, node_id) \
656 ({ \
657 static struct lock_class_key __key; \
658 const char *__name; \
659 struct gendisk *__disk; \
660 \
661 __name = "(gendisk_completion)"#minors"("#node_id")"; \
662 \
663 __disk = __alloc_disk_node(minors, node_id); \
664 \
665 if (__disk) \
666 lockdep_init_map(&__disk->lockdep_map, __name, &__key, 0); \
667 \
668 __disk; \
669 })
1438 struct gendisk *__alloc_disk_node(int minors, int node_id)
1439 {
1440 struct gendisk *disk;
1441 struct disk_part_tbl *ptbl;
1442
1443 if (minors > DISK_MAX_PARTS) {
1444 printk(KERN_ERR
1445 "block: can't allocate more than %d partitions\n",
1446 DISK_MAX_PARTS);
1447 minors = DISK_MAX_PARTS;
1448 }
1449
1450 disk = kzalloc_node(sizeof(struct gendisk), GFP_KERNEL, node_id);
1451 if (disk) {
1452 if (!init_part_stats(&disk->part0)) {
1453 kfree(disk);
1454 return NULL;
1455 }
1456 init_rwsem(&disk->lookup_sem);
1457 disk->node_id = node_id;
1458 if (disk_expand_part_tbl(disk, 0)) {
1459 free_part_stats(&disk->part0);
1460 kfree(disk);
1461 return NULL;
1462 }
1463 ptbl = rcu_dereference_protected(disk->part_tbl, 1);
1464 rcu_assign_pointer(ptbl->part[0], &disk->part0);
1465
1466 /*
1467 * set_capacity() and get_capacity() currently don't use
1468 * seqcounter to read/update the part0->nr_sects. Still init
1469 * the counter as we can read the sectors in IO submission
1470 * patch using seqence counters.
1471 *
1472 * TODO: Ideally set_capacity() and get_capacity() should be
1473 * converted to make use of bd_mutex and sequence counters.
1474 */
1475 seqcount_init(&disk->part0.nr_sects_seq);
1476 if (hd_ref_init(&disk->part0)) {
1477 hd_free_part(&disk->part0);
1478 kfree(disk);
1479 return NULL;
1480 }
1481
1482 disk->minors = minors;
1483 rand_initialize_disk(disk);
1484 disk_to_dev(disk)->class = &block_class;
1485 disk_to_dev(disk)->type = &disk_type;
1486 device_initialize(disk_to_dev(disk));
1487 }
1488 return disk;
1489 }
3.2.2 set_capacity
函数原型 | void set_capacity(struct gendisk *disk, sector_t size) | |
参数 | struct gendisk *disk | 指向 gendisk 结构体的指针,表示要设置容量的块设备 |
sector_t size | 表示设备的总容量,以扇区为单位。块设备中最小的可寻址单元是扇区,一个扇区一般是 512 字节,有些设备的物理扇区可能不是 512 字节。不管物理扇区是多少,内核和块设备驱动之间的扇区都是 512 字节 | |
返回值 |
|
|
功能 | 设置块设备(gendisk 结构体)容量 |
460 static inline void set_capacity(struct gendisk *disk, sector_t size)
461 {
462 disk->part0.nr_sects = size;
463 }
3.2.3 add_disk
函数原型 | void add_disk(struct gendisk *disk) | |
参数 | struct gendisk *disk | 指向 gendisk 结构体的指针,代表要添加的块设备 |
返回值 |
|
|
功能 | 将给定的块设备(gendisk)添加到设备模型中 |
421 static inline void add_disk(struct gendisk *disk)
422 {
423 device_add_disk(NULL, disk);
424 }
730 void device_add_disk(struct device *parent, struct gendisk *disk)731 {732 __device_add_disk(parent, disk, true);733 }672 static void __device_add_disk(struct device *parent, struct gendisk *disk,673 bool register_queue)674 {675 dev_t devt;676 int retval;677 678 /* minors == 0 indicates to use ext devt from part0 and should679 * be accompanied with EXT_DEVT flag. Make sure all680 * parameters make sense.681 */682 WARN_ON(disk->minors && !(disk->major || disk->first_minor));683 WARN_ON(!disk->minors &&684 !(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN)));685 686 disk->flags |= GENHD_FL_UP;687 688 retval = blk_alloc_devt(&disk->part0, &devt);689 if (retval) {690 WARN_ON(1);691 return;692 }693 disk->major = MAJOR(devt);694 disk->first_minor = MINOR(devt);695 696 disk_alloc_events(disk);697 698 if (disk->flags & GENHD_FL_HIDDEN) {699 /*700 * Don't let hidden disks show up in /proc/partitions,701 * and don't bother scanning for partitions either.702 */703 disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;704 disk->flags |= GENHD_FL_NO_PART_SCAN;705 } else {706 int ret;707 708 /* Register BDI before referencing it from bdev */709 disk_to_dev(disk)->devt = devt;710 ret = bdi_register_owner(disk->queue->backing_dev_info,711 disk_to_dev(disk));712 WARN_ON(ret);713 blk_register_region(disk_devt(disk), disk->minors, NULL,714 exact_match, exact_lock, disk);715 }716 register_disk(parent, disk);717 if (register_queue)718 blk_register_queue(disk);719 720 /*721 * Take an extra ref on queue which will be put on disk_release()722 * so that it sticks around as long as @disk is there.723 */724 WARN_ON_ONCE(!blk_get_queue(disk->queue));725 726 disk_add_events(disk);727 blk_integrity_add(disk);728 }
3.2.4 del_gendisk
函数原型 | void del_gendisk(struct gendisk *disk) | |
参数 | struct gendisk *disk | 指向 gendisk 结构体的指针,代表要删除的块设备 |
返回值 |
|
|
功能 | 用于删除块设备(gendisk 结构体) |
742 void del_gendisk(struct gendisk *disk)743 {744 struct disk_part_iter piter;745 struct hd_struct *part;746 747 blk_integrity_del(disk);748 disk_del_events(disk);749 750 /*751 * Block lookups of the disk until all bdevs are unhashed and the752 * disk is marked as dead (GENHD_FL_UP cleared).753 */754 down_write(&disk->lookup_sem);755 /* invalidate stuff */756 disk_part_iter_init(&piter, disk,757 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);758 while ((part = disk_part_iter_next(&piter))) {759 invalidate_partition(disk, part->partno);760 bdev_unhash_inode(part_devt(part));761 delete_partition(disk, part->partno);762 }763 disk_part_iter_exit(&piter);764 765 invalidate_partition(disk, 0);766 bdev_unhash_inode(disk_devt(disk));767 set_capacity(disk, 0);768 disk->flags &= ~GENHD_FL_UP;769 up_write(&disk->lookup_sem);770 771 if (!(disk->flags & GENHD_FL_HIDDEN))772 sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");773 if (disk->queue) {774 /*775 * Unregister bdi before releasing device numbers (as they can776 * get reused and we'd get clashes in sysfs).777 */778 if (!(disk->flags & GENHD_FL_HIDDEN))779 bdi_unregister(disk->queue->backing_dev_info);780 blk_unregister_queue(disk);781 } else {782 WARN_ON(1);783 }784 785 if (!(disk->flags & GENHD_FL_HIDDEN))786 blk_unregister_region(disk_devt(disk), disk->minors);787 /*788 * Remove gendisk pointer from idr so that it cannot be looked up789 * while RCU period before freeing gendisk is running to prevent790 * use-after-free issues. Note that the device number stays791 * "in-use" until we really free the gendisk.792 */793 blk_invalidate_devt(disk_devt(disk));794 795 kobject_put(disk->part0.holder_dir);796 kobject_put(disk->slave_dir);797 798 part_stat_set_all(&disk->part0, 0);799 disk->part0.stamp = 0;800 if (!sysfs_deprecated)801 sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));802 pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);803 device_del(disk_to_dev(disk));804 }
3.3 块设备 I/O 请求
3.3.1 blk_mq_alloc_tag_set
函数原型 | int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) | |
参数 | struct blk_mq_tag_set *set | 指向块设备标签集的结构体指针,包含了与请求标签相关的信息,比如锁、调度策略等。这个结构体定义了如何管理并发请求,以及如何分配和回收请求标签。 |
返回值 | int | 成功:0 失败:负数 |
功能 | 初始化和分配一个标签集合,这个集合用于管理并发请求的标识符(标签),以支持多队列的块设备操作。 |
2785 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
2786 {
2787 int ret;
2788
2789 BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
2790
2791 if (!set->nr_hw_queues)
2792 return -EINVAL;
2793 if (!set->queue_depth)
2794 return -EINVAL;
2795 if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
2796 return -EINVAL;
2797
2798 if (!set->ops->queue_rq)
2799 return -EINVAL;
2800
2801 if (!set->ops->get_budget ^ !set->ops->put_budget)
2802 return -EINVAL;
2803
2804 if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
2805 pr_info("blk-mq: reduced tag depth to %u\n",
2806 BLK_MQ_MAX_DEPTH);
2807 set->queue_depth = BLK_MQ_MAX_DEPTH;
2808 }
2809
2810 /*
2811 * If a crashdump is active, then we are potentially in a very
2812 * memory constrained environment. Limit us to 1 queue and
2813 * 64 tags to prevent using too much memory.
2814 */
2815 if (is_kdump_kernel()) {
2816 set->nr_hw_queues = 1;
2817 set->queue_depth = min(64U, set->queue_depth);
2818 }
2819 /*
2820 * There is no use for more h/w queues than cpus.
2821 */
2822 if (set->nr_hw_queues > nr_cpu_ids)
2823 set->nr_hw_queues = nr_cpu_ids;
2824
2825 set->tags = kcalloc_node(nr_cpu_ids, sizeof(struct blk_mq_tags *),
2826 GFP_KERNEL, set->numa_node);
2827 if (!set->tags)
2828 return -ENOMEM;
2829
2830 ret = -ENOMEM;
2831 set->mq_map = kcalloc_node(nr_cpu_ids, sizeof(*set->mq_map),
2832 GFP_KERNEL, set->numa_node);
2833 if (!set->mq_map)
2834 goto out_free_tags;
2835
2836 ret = blk_mq_update_queue_map(set);
2837 if (ret)
2838 goto out_free_mq_map;
2839
2840 ret = blk_mq_alloc_rq_maps(set);
2841 if (ret)
2842 goto out_free_mq_map;
2843
2844 mutex_init(&set->tag_list_lock);
2845 INIT_LIST_HEAD(&set->tag_list);
2846
2847 return 0;
2848
2849 out_free_mq_map:
2850 kfree(set->mq_map);
2851 set->mq_map = NULL;
2852 out_free_tags:
2853 kfree(set->tags);
2854 set->tags = NULL;
2855 return ret;
2856 }
3.3.2 blk_mq_free_tag_set
函数原型 | void blk_mq_free_tag_set(struct blk_mq_tag_set *set) | |
参数 | struct blk_mq_tag_set *set | 指向块设备标签集的结构体指针,包含了与请求标签相关的信息,比如锁、调度策略等。这个结构体定义了如何管理并发请求,以及如何分配和回收请求标签。 |
返回值 |
|
|
功能 | Linux 内核中用于释放块设备标签集的函数。这个函数的主要作用是清理和释放与 blk_mq_tag_set 结构相关联的资源,以防止内存泄漏和保持系统的稳定性。 |
2859 void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
2860 {
2861 int i;
2862
2863 for (i = 0; i < nr_cpu_ids; i++)
2864 blk_mq_free_map_and_requests(set, i);
2865
2866 kfree(set->mq_map);
2867 set->mq_map = NULL;
2868
2869 kfree(set->tags);
2870 set->tags = NULL;
2871 }
3.3.3 blk_mq_init_queue
函数原型 | struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set) | |
参数 | struct blk_mq_tag_set *set | 指向块设备标签集的结构体指针,包含了与请求标签相关的信息,比如锁、调度策略等。这个结构体定义了如何管理并发请求,以及如何分配和回收请求标签。 |
返回值 | struct request_queue * | 指向初始化后的请求队列的指针;如果初始化失败,则返回 NULL。 |
功能 | linux 内核中用于初始化块设备的请求队列的函数,特别是针对多队列(multi-queue)块层的实现。这个函数主要用于设置和准备一个新的请求队列,以便能够接收和处理来自上层的 I/O 请求。 |
2496 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
2497 {
2498 struct request_queue *uninit_q, *q;
2499
2500 uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node, NULL);
2501 if (!uninit_q)
2502 return ERR_PTR(-ENOMEM);
2503
2504 q = blk_mq_init_allocated_queue(set, uninit_q);
2505 if (IS_ERR(q))
2506 blk_cleanup_queue(uninit_q);
2507
2508 return q;
2509 }
3.3.4 blk_cleanup_queue
函数原型 | void blk_cleanup_queue(struct request_queue *q) | |
参数 | struct request_queue *q | 指向请求队列的指针,该队列用于管理对块设备的请求 |
返回值 |
|
|
功能 | Linux 内核中用于清理请求队列的函数。它的主要作用是释放与请求队列相关的资源,以便在不再需要该队列时进行清理 |
757 void blk_cleanup_queue(struct request_queue *q)758 {759 spinlock_t *lock = q->queue_lock;760 761 /* mark @q DYING, no new request or merges will be allowed afterwards */762 mutex_lock(&q->sysfs_lock);763 blk_set_queue_dying(q);764 spin_lock_irq(lock);765 766 /*767 * A dying queue is permanently in bypass mode till released. Note768 * that, unlike blk_queue_bypass_start(), we aren't performing769 * synchronize_rcu() after entering bypass mode to avoid the delay770 * as some drivers create and destroy a lot of queues while771 * probing. This is still safe because blk_release_queue() will be772 * called only after the queue refcnt drops to zero and nothing,773 * RCU or not, would be traversing the queue by then.774 */775 q->bypass_depth++;776 queue_flag_set(QUEUE_FLAG_BYPASS, q);777 778 queue_flag_set(QUEUE_FLAG_NOMERGES, q);779 queue_flag_set(QUEUE_FLAG_NOXMERGES, q);780 queue_flag_set(QUEUE_FLAG_DYING, q);781 spin_unlock_irq(lock);782 mutex_unlock(&q->sysfs_lock);783 784 /*785 * Drain all requests queued before DYING marking. Set DEAD flag to786 * prevent that q->request_fn() gets invoked after draining finished.787 */788 blk_freeze_queue(q);789 790 rq_qos_exit(q);791 792 spin_lock_irq(lock);793 queue_flag_set(QUEUE_FLAG_DEAD, q);794 spin_unlock_irq(lock);795 796 /*797 * make sure all in-progress dispatch are completed because798 * blk_freeze_queue() can only complete all requests, and799 * dispatch may still be in-progress since we dispatch requests800 * from more than one contexts.801 *802 * We rely on driver to deal with the race in case that queue803 * initialization isn't done.804 */805 if (q->mq_ops && blk_queue_init_done(q))806 blk_mq_quiesce_queue(q);807 808 /* for synchronous bio-based driver finish in-flight integrity i/o */809 blk_flush_integrity();810 811 /* @q won't process any more request, flush async actions */812 del_timer_sync(&q->backing_dev_info->laptop_mode_wb_timer);813 blk_sync_queue(q);814 815 /*816 * I/O scheduler exit is only safe after the sysfs scheduler attribute817 * has been removed.818 */819 WARN_ON_ONCE(q->kobj.state_in_sysfs);820 821 blk_exit_queue(q);822 823 if (q->mq_ops)824 blk_mq_exit_queue(q);825 826 percpu_ref_exit(&q->q_usage_counter);827 828 spin_lock_irq(lock);829 if (q->queue_lock != &q->__queue_lock)830 q->queue_lock = &q->__queue_lock;831 spin_unlock_irq(lock);832 833 /* @q is and will stay empty, shutdown and put */834 blk_put_queue(q);835 }
3.4 请求 request
3.4.1 blk_mq_start_request
函数原型 | void blk_mq_start_request(struct request *rq) | |
参数 | struct request *rq | 指向待处理请求的指针。该请求结构体包含了有关 I/O 操作的信息,如操作类型、目标设备、数据缓冲区等 |
返回值 |
|
|
功能 | Linux 内核中用于开始处理块设备请求的函数。它主要用于标记一个请求为正在处理状态,并触发相关的处理流程,以确保请求能够被正确地调度和执行。 |
630 void blk_mq_start_request(struct request *rq)631 { 632 struct request_queue *q = rq->q;633 634 blk_mq_sched_started_request(rq);635 636 trace_block_rq_issue(q, rq);637 638 if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {639 rq->io_start_time_ns = ktime_get_ns();640 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW641 rq->throtl_size = blk_rq_sectors(rq);642 #endif 643 rq->rq_flags |= RQF_STATS; 644 rq_qos_issue(q, rq);645 } 646 647 WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);648 649 blk_add_timer(rq);650 WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);651 652 if (q->dma_drain_size && blk_rq_bytes(rq)) {653 /*654 * Make sure space for the drain appears. We know we can do655 * this because max_hw_segments has been adjusted to be one656 * fewer than the device can handle.657 */658 rq->nr_phys_segments++;659 }660 }
3.4.1 blk_mq_end_request
函数原型 | void blk_mq_end_request(struct request *rq, blk_status_t error) | |
参数 | struct request *rq | 指向已处理请求的指针。该请求结构体包含了关于 I/O 操作的各种信息。 |
blk_status_t error | 表示请求执行结果的状态码。它指示请求是成功还是失败,通常会传递一个错误代码或状态值。 | |
返回值 |
|
|
功能 | Linux 内核中用于结束块设备请求的函数。它在请求处理完成后被调用,用于更新请求状态并进行必要的清理工作。 |
542 void blk_mq_end_request(struct request *rq, blk_status_t error)543 { 544 if (blk_update_request(rq, error, blk_rq_bytes(rq)))545 BUG();546 __blk_mq_end_request(rq, error);547 } 520 inline void __blk_mq_end_request(struct request *rq, blk_status_t error)521 {522 u64 now = ktime_get_ns();523 524 if (rq->rq_flags & RQF_STATS) {525 blk_mq_poll_stats_start(rq->q);526 blk_stat_add(rq, now);527 }528 529 blk_account_io_done(rq, now);530 531 if (rq->end_io) {532 rq_qos_done(rq->q, rq);533 rq->end_io(rq, error);534 } else {535 if (unlikely(blk_bidi_rq(rq)))536 blk_mq_free_request(rq->next_rq);537 blk_mq_free_request(rq);538 } 539 }
3.5 数据操作
3.5.1 bio_data
函数原型 | void *bio_data(struct bio *bio) | |
参数 | struct bio *bio | 指向 bio 结构体的指针。bio 结构体用于描述一次块 I/O 操作,包括请求的数据位置、长度、数据缓冲区等信息。 |
返回值 |
|
|
功能 | Linux 内核中用于获取块 I/O 操作(即 bio 结构体)中数据缓冲区的指针的函数。它提供了一种方便的方式来访问与特定 I/O 操作相关联的数据 |
124 static inline void *bio_data(struct bio *bio)
125 {
126 if (bio_has_data(bio))
127 return page_address(bio_page(bio)) + bio_offset(bio);
128
129 return NULL;
130 }
3.5.2 rq_data_dir
函数原型 | #define rq_data_dir(rq) (op_is_write(req_op(rq)) ? WRITE : READ) | |
参数 | struct request *rq | 指向待处理请求的指针。该请求结构体包含了有关 I/O 操作的信息,如操作类型、目标设备、数据缓冲区等 |
返回值 |
| WRITE 或者 READ |
功能 | 确定给定请求(request)是读取操作还是写入操作。这个宏在块设备的 I/O 请求处理中非常有用,特别是在进行读写操作时。 |
771 #define rq_data_dir(rq) (op_is_write(req_op(rq)) ? WRITE : READ)386 #define req_op(req) ((req)->cmd_flags & REQ_OP_MASK)396 static inline bool op_is_write(unsigned int op)
397 {
398 return (op & 1);
399 }
4 实验
4.1 代码
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/errno.h>
#include <linux/types.h>
#include <linux/fcntl.h>
#include <linux/hdreg.h>
#include <linux/kdev_t.h>
#include <linux/vmalloc.h>
#include <linux/genhd.h>
#include <linux/blk-mq.h>
#include <linux/buffer_head.h>
#include <linux/bio.h>#define RAMDISK_SIZE (2 * 1024 * 1024) /* 容量大小为2MB */
#define RAMDISK_NAME "ramdisk" /* 名字 */
#define RADMISK_MINOR 3 /* 表示有三个磁盘分区!不是次设备号为3!*//* ramdisk设备结构体 */
struct ramdisk_dev{int major; /* 主设备号 */unsigned char *ramdiskbuf; /* ramdisk内存空间,用于模拟块设备 */struct gendisk *gendisk; /* gendisk */struct request_queue *queue; /* 请求队列 */struct blk_mq_tag_set tag_set; /* blk_mq_tag_set */spinlock_t lock; /* 自旋锁 */
};struct ramdisk_dev *ramdisk = NULL; /* ramdisk设备指针 *//** @description : 处理传输过程* @param-req : 请求* @return : 0,成功;其它表示失败*/
static int ramdisk_transfer(struct request *req)
{ unsigned long start = blk_rq_pos(req) << 9; /* blk_rq_pos获取到的是扇区地址,左移9位转换为字节地址 */unsigned long len = blk_rq_cur_bytes(req); /* 大小 *//* bio中的数据缓冲区* 读:从磁盘读取到的数据存放到buffer中* 写:buffer保存这要写入磁盘的数据*/void *buffer = bio_data(req->bio); if(rq_data_dir(req) == READ) /* 读数据 */ memcpy(buffer, ramdisk->ramdiskbuf + start, len);else if(rq_data_dir(req) == WRITE) /* 写数据 */memcpy(ramdisk->ramdiskbuf + start, buffer, len);return 0;}/** @description : 开始处理传输数据的队列* @hctx : 硬件相关的队列结构体* @bd : 数据相关的结构体* @return : 0,成功;其它值为失败*/
static blk_status_t _queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data* bd)
{struct request *req = bd->rq; /* 通过bd获取到request队列*/struct ramdisk_dev *dev = req->rq_disk->private_data;int ret;blk_mq_start_request(req); /* 开启处理队列 */spin_lock(&dev->lock); ret = ramdisk_transfer(req); /* 处理数据 */blk_mq_end_request(req, ret); /* 结束处理队列 */spin_unlock(&dev->lock);return BLK_STS_OK;}
/** 队列操作函数*/
static struct blk_mq_ops mq_ops = {.queue_rq = _queue_rq,
};int ramdisk_open(struct block_device *dev, fmode_t mode)
{printk("ramdisk open\r\n");return 0;
}void ramdisk_release(struct gendisk *disk, fmode_t mode)
{printk("ramdisk release\r\n");
}/** @description : 获取磁盘信息* @param - dev : 块设备* @param - geo : 模式* @return : 0 成功;其他 失败*/
int ramdisk_getgeo(struct block_device *dev, struct hd_geometry *geo)
{/* 这是相对于机械硬盘的概念 */geo->heads = 2; /* 磁头 */geo->cylinders = 32; /* 柱面 */geo->sectors = RAMDISK_SIZE / (2 * 32 *512); /* 一个磁道上的扇区数量 */return 0;
}/* * 块设备操作函数 */
static struct block_device_operations ramdisk_fops =
{.owner = THIS_MODULE,.open = ramdisk_open,.release = ramdisk_release,.getgeo = ramdisk_getgeo,
};/** @description : 初始化队列相关操作* @set : blk_mq_tag_set对象* @return : request_queue的地址*/
static struct request_queue * create_req_queue(struct blk_mq_tag_set *set)
{struct request_queue *q;#if 0/**这里是使用了blk_mq_init_sq_queue 函数*进行初始化的。*/q = blk_mq_init_sq_queue(set, &mq_ops, 2, BLK_MQ_F_SHOULD_MERGE);#elseint ret;memset(set, 0, sizeof(*set));set->ops = &mq_ops; //操作函数set->nr_hw_queues = 2; //硬件队列set->queue_depth = 2; //队列深度set->numa_node = NUMA_NO_NODE;//numa节点set->flags = BLK_MQ_F_SHOULD_MERGE; //标记在bio下发时需要合并ret = blk_mq_alloc_tag_set(set); //使用函数进行再次初始化if (ret) {printk(KERN_WARNING "sblkdev: unable to allocate tag set\n");return ERR_PTR(ret);}q = blk_mq_init_queue(set); //分配请求队列if(IS_ERR(q)) {blk_mq_free_tag_set(set);return q;}
#endifreturn q;
}/** @description : 创建块设备,为应用层提供接口。* @set : ramdisk_dev对象* @return : 0,表示成功;其它值为失败*/
static int create_req_gendisk(struct ramdisk_dev *set)
{struct ramdisk_dev *dev = set;/* 1、分配并初始化 gendisk */dev->gendisk = alloc_disk(RADMISK_MINOR);if(dev == NULL)return -ENOMEM;/* 2、添加(注册)disk */dev->gendisk->major = ramdisk->major; /* 主设备号 */dev->gendisk->first_minor = 0; /* 起始次设备号 */dev->gendisk->fops = &ramdisk_fops; /* 操作函数 */dev->gendisk->private_data = set; /* 私有数据 */dev->gendisk->queue = dev->queue; /* 请求队列 */sprintf(dev->gendisk->disk_name, RAMDISK_NAME); /* 名字 */set_capacity(dev->gendisk, RAMDISK_SIZE/512); /* 设备容量(单位为扇区)*/add_disk(dev->gendisk);return 0;
}static int __init ramdisk_init(void)
{int ret = 0;struct ramdisk_dev * dev;printk("ramdisk init\r\n");/* 1、申请内存 */dev = kzalloc(sizeof(*dev), GFP_KERNEL);if(dev == NULL) {return -ENOMEM;}dev->ramdiskbuf = kmalloc(RAMDISK_SIZE, GFP_KERNEL);if(dev->ramdiskbuf == NULL) {printk(KERN_WARNING "dev->ramdiskbuf: vmalloc failure.\n");return -ENOMEM;}ramdisk = dev;/* 2、初始化自旋锁 */spin_lock_init(&dev->lock);/* 3、注册块设备 */dev->major = register_blkdev(0, RAMDISK_NAME); /* 由系统自动分配主设备号 */if(dev->major < 0) {goto register_blkdev_fail;}/* 4、创建多队列 */dev->queue = create_req_queue(&dev->tag_set);if(dev->queue == NULL) {goto create_queue_fail;}/* 5、创建块设备 */ret = create_req_gendisk(dev);if(ret < 0)goto create_gendisk_fail;return 0;create_gendisk_fail:blk_cleanup_queue(dev->queue);blk_mq_free_tag_set(&dev->tag_set);
create_queue_fail:unregister_blkdev(dev->major, RAMDISK_NAME);
register_blkdev_fail:kfree(dev->ramdiskbuf);kfree(dev);return -ENOMEM;
}static void __exit ramdisk_exit(void)
{printk("ramdisk exit\r\n");/* 释放gendisk */del_gendisk(ramdisk->gendisk);put_disk(ramdisk->gendisk);/* 清除请求队列 */blk_cleanup_queue(ramdisk->queue);/* 释放blk_mq_tag_set */blk_mq_free_tag_set(&ramdisk->tag_set);/* 注销块设备 */unregister_blkdev(ramdisk->major, RAMDISK_NAME);/* 释放内存 */kfree(ramdisk->ramdiskbuf);kfree(ramdisk);
}module_init(ramdisk_init);
module_exit(ramdisk_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("ALIENTEK");
MODULE_INFO(intree, "Y");
4.2 操作
加载驱动:insmod ramdisk_withrequest_test.ko
console:/data # insmod ramdisk_withrequest_test.ko
[ 916.610832] ramdisk init
[ 916.613432] ramdisk open
[ 916.613718] ramdisk release
[ 916.613758] type=1400 audit(1727152399.923:65): avc: denied { create } for comm="kdevtmpfs" name="ramdisk" scontext=u:r:kernel:s0 tcontext=u:object_r:device:s0 tclass=blk_file permissive=1
[ 916.614331] type=1400 audit(1727152399.923:66): avc: denied { setattr } for comm="kdevtmpfs" name="ramdisk" dev="devtmpfs" ino=38456 scontext=u:r:kernel:s0 tcontext=u:object_r:device:s0 tclass=blk_file permissive=1
console:/data #
cat /proc/partitions,可以看到ramdisk的信息
console:/dev/block # cat /proc/partitions
major minor #blocks name1 0 8192 ram01 1 8192 ram11 2 8192 ram21 3 8192 ram31 4 8192 ram41 5 8192 ram51 6 8192 ram61 7 8192 ram71 8 8192 ram81 9 8192 ram91 10 8192 ram101 11 8192 ram111 12 8192 ram121 13 8192 ram131 14 8192 ram141 15 8192 ram15254 0 1952508 zram0179 0 30535680 mmcblk2179 1 4096 mmcblk2p1179 2 4096 mmcblk2p2179 3 4096 mmcblk2p3179 4 4096 mmcblk2p4179 5 4096 mmcblk2p5179 6 1024 mmcblk2p6179 7 73728 mmcblk2p7179 8 98304 mmcblk2p8179 9 393216 mmcblk2p9179 10 393216 mmcblk2p10179 11 16384 mmcblk2p11179 12 1024 mmcblk2p12179 13 3186688 mmcblk2p13179 14 26347488 mmcblk2p14253 0 1589636 dm-0253 1 120540 dm-1253 2 201604 dm-2253 3 208704 dm-3253 4 608 dm-4253 5 26347488 dm-58 0 30720000 sda8 4 30719872 sda4252 0 2048 ramdisk
console:/dev/block #
ls /dev/block
console:/data # ls /dev/block/
by-name loop10 loop5 mmcblk2p1 mmcblk2p5 ram11 ram6
dm-0 loop11 loop6 mmcblk2p10 mmcblk2p6 ram12 ram7
dm-1 loop12 loop7 mmcblk2p11 mmcblk2p7 ram13 ram8
dm-2 loop13 loop8 mmcblk2p12 mmcblk2p8 ram14 ram9
dm-3 loop14 loop9 mmcblk2p13 mmcblk2p9 ram15 ramdisk
dm-4 loop15 mapper mmcblk2p14 platform ram2 sda
dm-5 loop2 mmcblk2 mmcblk2p2 ram0 ram3 sda4
loop0 loop3 mmcblk2boot0 mmcblk2p3 ram1 ram4 vold
loop1 loop4 mmcblk2boot1 mmcblk2p4 ram10 ram5 zram0
console:/data #
console:/data # ls /dev/block/ramdisk -l
brw------- 1 root root 252, 0 2024-09-23 09:24 /dev/block/ramdisk
console:/data #
挂载:mount /dev/block/ramdisk disk/
1|console:/data # mount /dev/block/ramdisk disk/
[224737.460103] ramdisk open
[224737.460238] ramdisk release
[224737.460275] ramdisk open
[224737.460316] ramdisk release
[224737.460348] ramdisk open
[224737.460389] ramdisk release
[224737.460420] ramdisk open
[224737.460480] ramdisk relemount: /dev/block/ramdisk: need -ta
se
[224737.460552] ramdisk open
[224737.460599] ramdisk release
[221|console:/data # 4737.460637] ramdisk open
[224737.461348] ramdisk release
[224737.461398] ramdisk open
[224737.461424] ramdisk release
[224737.461456] ramdisk open
[224737.461589] ramdisk release
[224737.461627] ramdisk open
[224737.461689] F2FS-fs (ramdisk): Magic Mismatch, valid(0xf2f52010) - read(0xffffffff)
[224737.461704] F2FS-fs (ramdisk): Can't find valid F2FS filesystem in 1th superblock
[224737.461724] F2FS-fs (ramdisk): Magic Mismatch, valid(0xf2f52010) - read(0xffffffff)
[224737.461738] F2FS-fs (ramdisk): Can't find valid F2FS filesystem in 2th superblock
[224737.461758] ramdisk release1|console:/data #
格式化:mke2fs /dev/block/ramdisk
127|console:/data # mke2fs /dev/block/ramdisk
mke2fs 1.45.4 (23-Sep-2019)
[225588.883514] ramdisk open
[225588.883552] ramdisk release
[225588.883644] ramdisk open
/dev/block/ramdisk contains a ext4 file system
[225588.884440] ramdisk release
[225588.884578] ramdisk opencreated on Mon Sep 23 09:38:55 2024
Proceed anyway? (y,N) [225588.885256] ramdisk release
y
[225596.336292] ramdisk open
[225596.336323] ramdisk release
[225596.336338] ramdisk open
[225596.336353] ramdisk releaCreating filesystem with 2048 1k blsocks and 256 inodess: done [2255W9r6i.t3ing inode tables: 36415] ramdisk open
[225done 5
96.336426] ramdisk release
[225596.336438] ramdisk open
[225596.336447] ramdisk release
[225596.3365Writing superblocks and filesystem accounting information: 19] ramdisk open
[225596.341289] ramdisk release
doneconsole:/data #